On 13/02/11 22:17, Tarl Neustaedter wrote:
Hi Tarl,
Incidentally if I also enable romvec debugging in OpenBIOS this is what I get on the console just before the crash:
vac: enabled in write through mode mem = 131072K (0x8000000) avail mem = 110419968 obp_nextnode(0x0) = 0xffd4527c obp_proplen(0xffd4527c, reg) (not found) obp_proplen(0xffd4527c, ranges) (not found) obp_proplen(0xffd4527c, intr) (not found) obp_proplen(0xffd4527c, interrupts) (not found)
That's not good. obp_nextnode() should be giving you a pointer to a valid node (I believe root), where it looks at properties.
Yes, that is actually what is happening in the trace above - obp_nextnode(0x0) means 0 is being passed in, and then 0xffd4527c is being returned as the handle.
The divide by zero is probably Solaris signalling an error; if things are bad enough that it can't talk with the PROM (or doesn't trust it), it does a divide by zero to blow up. In usr/src/psm/promif/ieee1275/sun4/prom_init.c :
/*
- Fatal promif internal error, not an external interface
*/
/*ARGSUSED*/ void prom_fatal_error(const char *errormsg) {
volatile int zero = 0; volatile int i = 1;
/*
- No prom interface, try to cause a trap by
- dividing by zero, leaving the message in %i0.
*/
i = i / zero; /*NOTREACHED*/
I don't think this has anything to do with the PIL 14 or 10 issues you discuss later on.
Oh that's interesting. However I don't think that this is the case here for 2 reasons:
1) The backtraces definitely point to an issue with clock initialisation based upon the symbol names, and enabling the L14 timer does allow the division by zero to succeed with a value between 0 and the counter limit.
2) The address where the trap is invoked is definitely outside the main kernel space by some margin, which makes me think that this is because it is coming from an external kernel module which is being dynamically loaded - otherwise if it were being caused by the above, I would expect the trap address to be within the main kernel image.
ATB,
Mark.