Hi all,
I've spent a bit more time today trying to figure out why this crashes under OpenBIOS and so decided to start working backwards through the -d in_asm QEMU output to try and figure out why we seem to get stuck in an infinite loop.
Anyhow the short version is that I've traced the series of functions that get called just before we get stuck in the infinite loop and it looks like this:
0xf0044384: cbe_set_level 0xf0063954: splr 0xf0044268: cbe_enable 0xf0062cf0: __div64
... invokes "ta 2" trap instruction
0xf004127c: sys_trap() 0xf0064658: vpanic () 0xf0064618: panic_trigger () 0xf005bf90: trap
... infinite loop ...
The interesting part here is that the fatal trap sequence is being invoked from __div64 which makes me think it is some kind of divide by zero error. I'm not exactly sure what the cbe_* functions are doing, but a quick browse of the OpenSolaris source shows that it's related to PIL/RTC things.
My current feeling is that maybe OpenBIOS isn't doing something right with PIL state somewhere, or we're reading back a zero value from either the clock (or a missing OpenBIOS property) which is causing us to die in the division in cbe_enable(). And another thing, what does cbe actually stand for?
ATB,
Mark.
On 13/02/11 14:31, Mark Cave-Ayland wrote:
My current feeling is that maybe OpenBIOS isn't doing something right with PIL state somewhere, or we're reading back a zero value from either the clock (or a missing OpenBIOS property) which is causing us to die in the division in cbe_enable(). And another thing, what does cbe actually stand for?
Incidentally if I also enable romvec debugging in OpenBIOS this is what I get on the console just before the crash:
vac: enabled in write through mode mem = 131072K (0x8000000) avail mem = 110419968 obp_nextnode(0x0) = 0xffd4527c obp_proplen(0xffd4527c, reg) (not found) obp_proplen(0xffd4527c, ranges) (not found) obp_proplen(0xffd4527c, intr) (not found) obp_proplen(0xffd4527c, interrupts) (not found)
So maybe it could be that the cbe_* functions are trying to get this information from the root node node of the device tree, but I don't see this in the prtconf output from the real SS5. Artyom, can you see these properties on a real SS5 OBP device tree?
ATB,
Mark.
On 2011-2-13 9:47 AM, Mark Cave-Ayland wrote:
[...] Incidentally if I also enable romvec debugging in OpenBIOS this is what I get on the console just before the crash:
vac: enabled in write through mode mem = 131072K (0x8000000) avail mem = 110419968 obp_nextnode(0x0) = 0xffd4527c obp_proplen(0xffd4527c, reg) (not found) obp_proplen(0xffd4527c, ranges) (not found) obp_proplen(0xffd4527c, intr) (not found) obp_proplen(0xffd4527c, interrupts) (not found)
That's not good. obp_nextnode() should be giving you a pointer to a valid node (I believe root), where it looks at properties.
The divide by zero is probably Solaris signalling an error; if things are bad enough that it can't talk with the PROM (or doesn't trust it), it does a divide by zero to blow up. In usr/src/psm/promif/ieee1275/sun4/prom_init.c :
/* * Fatal promif internal error, not an external interface */
/*ARGSUSED*/ void prom_fatal_error(const char *errormsg) {
volatile int zero = 0; volatile int i = 1;
/* * No prom interface, try to cause a trap by * dividing by zero, leaving the message in %i0. */
i = i / zero; /*NOTREACHED*/
I don't think this has anything to do with the PIL 14 or 10 issues you discuss later on.
On 13/02/11 22:17, Tarl Neustaedter wrote:
Hi Tarl,
Incidentally if I also enable romvec debugging in OpenBIOS this is what I get on the console just before the crash:
vac: enabled in write through mode mem = 131072K (0x8000000) avail mem = 110419968 obp_nextnode(0x0) = 0xffd4527c obp_proplen(0xffd4527c, reg) (not found) obp_proplen(0xffd4527c, ranges) (not found) obp_proplen(0xffd4527c, intr) (not found) obp_proplen(0xffd4527c, interrupts) (not found)
That's not good. obp_nextnode() should be giving you a pointer to a valid node (I believe root), where it looks at properties.
Yes, that is actually what is happening in the trace above - obp_nextnode(0x0) means 0 is being passed in, and then 0xffd4527c is being returned as the handle.
The divide by zero is probably Solaris signalling an error; if things are bad enough that it can't talk with the PROM (or doesn't trust it), it does a divide by zero to blow up. In usr/src/psm/promif/ieee1275/sun4/prom_init.c :
/*
- Fatal promif internal error, not an external interface
*/
/*ARGSUSED*/ void prom_fatal_error(const char *errormsg) {
volatile int zero = 0; volatile int i = 1;
/*
- No prom interface, try to cause a trap by
- dividing by zero, leaving the message in %i0.
*/
i = i / zero; /*NOTREACHED*/
I don't think this has anything to do with the PIL 14 or 10 issues you discuss later on.
Oh that's interesting. However I don't think that this is the case here for 2 reasons:
1) The backtraces definitely point to an issue with clock initialisation based upon the symbol names, and enabling the L14 timer does allow the division by zero to succeed with a value between 0 and the counter limit.
2) The address where the trap is invoked is definitely outside the main kernel space by some margin, which makes me think that this is because it is coming from an external kernel module which is being dynamically loaded - otherwise if it were being caused by the above, I would expect the trap address to be within the main kernel image.
ATB,
Mark.
On 13/02/11 14:31, Mark Cave-Ayland wrote:
My current feeling is that maybe OpenBIOS isn't doing something right with PIL state somewhere, or we're reading back a zero value from either the clock (or a missing OpenBIOS property) which is causing us to die in the division in cbe_enable(). And another thing, what does cbe actually stand for?
Finally I finally have made some progress! :) Firstly here is the gdb session with OpenBIOS from SVN trunk at the point where we hit the division by zero:
build@zeno:~/src/openbios/openbios-git/openbios-devel$ sparc64-linux-gdb GNU gdb 6.8 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=sparc64-linux". (gdb) target remote :1234 Remote debugging using :1234 [New Thread 1] 0x00000000 in ?? () (gdb) file ../../ss5 A program is being debugged already. Are you sure you want to change the file? (y or n) y Reading symbols from /home/build/src/openbios/ss5...(no debugging symbols found)...done. (gdb) break *0xf00a3f2c Breakpoint 1 at 0xf00a3f2c (gdb) cont Continuing.
Breakpoint 1, 0xf00a3f2c in ?? () (gdb) stepi 0xf00a3f30 in ?? () (gdb) 0xf0062cf0 in __div64 () (gdb) 0xf0062cf4 in __div64 () (gdb) 0xf0062cf8 in __div64 () (gdb) info regi i0 i0 0x0 0 (gdb) info regi i1 i1 0x3e60c 255500 (gdb) info regi i2 i2 0x0 0 (gdb) info regi i3 i3 0x0 0 (gdb)
Browsing around the Linux source code, I saw some comments mentioning that the level 14 timer was handled by OBP, so I made the following experimental change in drivers/obio.c:
counter_regs->cpu_timers[0].l14_timer_limit = 0x55555555;
And now I don't get the divide by zero error any more:
build@zeno:~/src/openbios/openbios-git/openbios-devel$ sparc64-linux-gdb GNU gdb 6.8 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=sparc64-linux". (gdb) target remote :1234 Remote debugging using :1234 [New Thread 1] 0x00000000 in ?? () (gdb) file ../../ss5 A program is being debugged already. Are you sure you want to change the file? (y or n) y Reading symbols from /home/build/src/openbios/ss5...(no debugging symbols found)...done. (gdb) break *0xf00a3f2c Breakpoint 1 at 0xf00a3f2c (gdb) cont Continuing.
Breakpoint 1, 0xf00a3f2c in ?? () (gdb) stepi 0xf00a3f30 in ?? () (gdb) 0xf0062cf0 in __div64 () (gdb) 0xf0062cf4 in __div64 () (gdb) 0xf0062cf8 in __div64 () (gdb) info regi i0 i0 0x0 0 (gdb) info regi i1 i1 0x7a6fc 501500 (gdb) info regi i2 i2 0x0 0 (gdb) info regi i3 i3 0x53555408 1398101000 (gdb)
However I still don't get much further with the boot because I now get stuck in another loop which keeps outputting the following onto the QEMU console:
spurious interrupt at processor level 10 spurious interrupt at processor level 10 spurious interrupt at processor level 10 ...etc...
So my next question is that if OpenBIOS is supposed to handle the level 14 timer, what should it doing with it? And is there a relationship between the level 14 timer and level 10 timer which causes one to trigger the other?
ATB,
Mark.
On Sun, Feb 13, 2011 at 10:30 PM, Mark Cave-Ayland mark.cave-ayland@siriusit.co.uk wrote:
On 13/02/11 14:31, Mark Cave-Ayland wrote:
My current feeling is that maybe OpenBIOS isn't doing something right with PIL state somewhere, or we're reading back a zero value from either the clock (or a missing OpenBIOS property) which is causing us to die in the division in cbe_enable(). And another thing, what does cbe actually stand for?
Finally I finally have made some progress! :) Firstly here is the gdb session with OpenBIOS from SVN trunk at the point where we hit the division by zero:
build@zeno:~/src/openbios/openbios-git/openbios-devel$ sparc64-linux-gdb GNU gdb 6.8 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=sparc64-linux". (gdb) target remote :1234 Remote debugging using :1234 [New Thread 1] 0x00000000 in ?? () (gdb) file ../../ss5 A program is being debugged already. Are you sure you want to change the file? (y or n) y Reading symbols from /home/build/src/openbios/ss5...(no debugging symbols found)...done. (gdb) break *0xf00a3f2c Breakpoint 1 at 0xf00a3f2c (gdb) cont Continuing.
Breakpoint 1, 0xf00a3f2c in ?? () (gdb) stepi 0xf00a3f30 in ?? () (gdb) 0xf0062cf0 in __div64 () (gdb) 0xf0062cf4 in __div64 () (gdb) 0xf0062cf8 in __div64 () (gdb) info regi i0 i0 0x0 0 (gdb) info regi i1 i1 0x3e60c 255500 (gdb) info regi i2 i2 0x0 0 (gdb) info regi i3 i3 0x0 0 (gdb)
Browsing around the Linux source code, I saw some comments mentioning that the level 14 timer was handled by OBP, so I made the following experimental change in drivers/obio.c:
counter_regs->cpu_timers[0].l14_timer_limit = 0x55555555;
And now I don't get the divide by zero error any more:
build@zeno:~/src/openbios/openbios-git/openbios-devel$ sparc64-linux-gdb GNU gdb 6.8 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=sparc64-linux". (gdb) target remote :1234 Remote debugging using :1234 [New Thread 1] 0x00000000 in ?? () (gdb) file ../../ss5 A program is being debugged already. Are you sure you want to change the file? (y or n) y Reading symbols from /home/build/src/openbios/ss5...(no debugging symbols found)...done. (gdb) break *0xf00a3f2c Breakpoint 1 at 0xf00a3f2c (gdb) cont Continuing.
Breakpoint 1, 0xf00a3f2c in ?? () (gdb) stepi 0xf00a3f30 in ?? () (gdb) 0xf0062cf0 in __div64 () (gdb) 0xf0062cf4 in __div64 () (gdb) 0xf0062cf8 in __div64 () (gdb) info regi i0 i0 0x0 0 (gdb) info regi i1 i1 0x7a6fc 501500 (gdb) info regi i2 i2 0x0 0 (gdb) info regi i3 i3 0x53555408 1398101000 (gdb)
However I still don't get much further with the boot because I now get stuck in another loop which keeps outputting the following onto the QEMU console:
spurious interrupt at processor level 10 spurious interrupt at processor level 10 spurious interrupt at processor level 10 ...etc...
So my next question is that if OpenBIOS is supposed to handle the level 14 timer, what should it doing with it? And is there a relationship between the level 14 timer and level 10 timer which causes one to trigger the other?
Nice work! Maybe OBP uses it to blink the cursor or poll some devices. If we don't use the level 14 interrupt, it should be handled but ignored. About level 10 timer, maybe OpenBIOS should program it initially so that it won't cause interrupts.
On 13/02/11 20:50, Blue Swirl wrote:
So my next question is that if OpenBIOS is supposed to handle the level 14 timer, what should it doing with it? And is there a relationship between the level 14 timer and level 10 timer which causes one to trigger the other?
Nice work! Maybe OBP uses it to blink the cursor or poll some devices. If we don't use the level 14 interrupt, it should be handled but ignored. About level 10 timer, maybe OpenBIOS should program it initially so that it won't cause interrupts.
I did find something that may be useful here: http://www.ibiblio.org/pub/historic-linux/early-ports/Sparc/NCR/NCR89C105.tx...
<quote> The interrupt is cleared and the limit bits reset by reading the appropriate limit register. Reading the counter register does not change the state of the limit bit. Writing the limit register resets the corresponding counter to 500nS (0x200). </quote>
So does that mean that the OpenBIOS timer handlers need to read the limit register in order to reset the interrupt status? Then again, I'm not convinced by this is the total solution since if I add breakpoints on irq_entry10 and irq_entry14 then they never seem to hit in gdb anyhow.
ATB,
Mark.
On Sun, Feb 13, 2011 at 11:23 PM, Mark Cave-Ayland mark.cave-ayland@siriusit.co.uk wrote:
On 13/02/11 20:50, Blue Swirl wrote:
So my next question is that if OpenBIOS is supposed to handle the level 14 timer, what should it doing with it? And is there a relationship between the level 14 timer and level 10 timer which causes one to trigger the other?
Nice work! Maybe OBP uses it to blink the cursor or poll some devices. If we don't use the level 14 interrupt, it should be handled but ignored. About level 10 timer, maybe OpenBIOS should program it initially so that it won't cause interrupts.
I did find something that may be useful here: http://www.ibiblio.org/pub/historic-linux/early-ports/Sparc/NCR/NCR89C105.tx...
<quote> The interrupt is cleared and the limit bits reset by reading the appropriate limit register. Reading the counter register does not change the state of the limit bit. Writing the limit register resets the corresponding counter to 500nS (0x200). </quote>
So does that mean that the OpenBIOS timer handlers need to read the limit register in order to reset the interrupt status? Then again, I'm not convinced by this is the total solution since if I add breakpoints on irq_entry10 and irq_entry14 then they never seem to hit in gdb anyhow.
Right. Then I'd try to program the timer and also interrupt controller. In theory also PIL level could matter if the OS preserved the value which was used when it took charge, but I doubt this.
On 13/02/11 21:57, Blue Swirl wrote:
So does that mean that the OpenBIOS timer handlers need to read the limit register in order to reset the interrupt status? Then again, I'm not convinced by this is the total solution since if I add breakpoints on irq_entry10 and irq_entry14 then they never seem to hit in gdb anyhow.
Right. Then I'd try to program the timer and also interrupt controller. In theory also PIL level could matter if the OS preserved the value which was used when it took charge, but I doubt this.
I'm wondering if there is actually a qemu bug here too, since the documentation also mentions the following:
"When a counter reached the value in its corresponding limit register, it is reset to 500ns (i.e. 0x00000200)"
and also:
"Writing the limit register resets the corresponding counter to 500nS (0x200)"
AFAICT looking at slavio_timer.c I can't see a 0x200 initial base when the timer is reset - Blue, what do you think? I'm trying to figure out why writing a non-zero value to the limit register seems to get the L14 timer to return something non-zero, whereas in theory with a limit set to 0 it should be free-running and with a base limit of 0x200 set, never return a zero value.
ATB,
Mark.