Analysis of current Solaris 8 boot failure on SPARC32 - OpenBIOS

1 Jan 2011


      Hi all,
Currently attempting to boot a Solaris 8 install CD results in the 
following output:
Configuration device id QEMU version 1 machine id 32
CPUs: 1 x FMI,MB86904
UUID: 00000000-0000-0000-0000-000000000000
Welcome to OpenBIOS v1.0 built on Jan 2 2011 00:28
   Type 'help' for detailed information
Trying cdrom:d...
Not a bootable ELF image
Loading a.out image...
Loaded 7680 bytes
entry point is 0x4000
bootpath: /iommu/sbus/espdma/esp/sd@2,0:d
Jumping to entry point 00004000 for type 00000005...
switching to new context:
SunOS Release 5.8 Version Generic_108528-09 32-bit
Copyright 1983-2001 Sun Microsystems, Inc.  All rights reserved.
qemu: fatal: Trap 0x29 while interrupts disabled, Error state
pc: f004127c  npc: f0041280
General Registers:
%g0-7: 00000000 00000808 00000001 f0041b74 00000000 f0243b88 00000000 
f0244020
Current Register Window:
%o0-7: f025831c f5a0f00c f0240374 f0240370 f024036c 00000004 f0240300 
f005bd84
%l0-7: 04400cc2 f005bf94 f005bf98 00000004 00000209 00000004 00000000 
f023fe60
%i0-7: 00000001 f02403f4 f5a0f00c f025831c 00000001 00000009 f023ff08 
f005c6b8
Floating Point Registers:
%f00: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
%f04: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
%f08: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
%f12: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
%f16: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
%f20: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
%f24: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
%f28: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
psr: 04000cc2 (icc: ---- SPE: SP-) wim: 00000004
fsr: 00080000 y: 00000000
Aborted
With the SPARC32 OFMEM migration complete, we can now get lots of 
debugging information regarding the memory mappings being made at run 
time. Setting a breakpoint at the crash address, it is possible to see 
that it is part of a loop that called several times during boot. Using 
this we can compare the successful iterations of the loop with the 
failing version in order to determine where the crash is happening.
Here is the gdb output from the last successful iteration of the loop:
Breakpoint 1, 0xf004127c in ?? ()
(gdb) disas 0xf0041270 0xf00412a0
Dump of assembler code from 0xf0041270 to 0xf00412a0:
0xf0041270:     rett  %l2 + 4
0xf0041274:     b  0xf004127c
0xf0041278:     nop
0xf004127c:     mov  1, %l5     ! 0x1
0xf0041280:     sll  %l5, %l0, %l5
0xf0041284:     rd  %wim, %l3
0xf0041288:     btst  0x40, %l0
0xf004128c:     be  0xf0041318
0xf0041290:     btst  %l3, %l5
0xf0041294:     sub  %fp, 0xa8, %l7
0xf0041298:     st  %g1, [ %l7 + 0x6c ]
0xf004129c:     std  %g2, [ %l7 + 0x70 ]
End of assembler dump.
(gdb) info regi
g0             0x0      0
g1             0x808    2056
g2             0xf5a0f000       -174002176
g3             0x19     25
g4             0x0      0
g5             0xf0243b88       -266060920
g6             0x0      0
g7             0xf0244020       -266059744
o0             0x0      0
o1             0xf02406b4       -266074444
o2             0xf5a0f00c       -174002164
o3             0xf0258398       -265976936
o4             0xf0252b10       -265999600
o5             0x0      0
sp             0xf0240658       0xf0240658
o7             0xf0041b74       -268166284
l0             0x4400cc0        71306432
l1             0xf004b1f8       -268127752
l2             0xf004b1fc       -268127748
l3             0xf0041000       -268169216
l4             0x209    521
l5             0x1      1
l6             0x7      7
l7             0xf0240658       -266074536
i0             0xf024d870       -266020752
i1             0x0      0
i2             0xff812201       -8314367
i3             0x0      0
i4             0x0      0
i5             0xf01582dc       -267025700
fp             0xf0240290       0xf0240290
i7             0xf004ef98       -268111976
y              0x0      0
psr            0x4400cc0        [ PS S #10 #11 #22 #26 ]
wim            0x1      1
tbr            0xf0040090       -268173168
pc             0xf004127c       0xf004127c
npc            0xf0041280       0xf0041280
fsr            0x80000  [ #19 ]
csr            0x0      0
(gdb) stepi
0xf0041280 in ?? ()
(gdb)
0xf0041284 in ?? ()
(gdb)
0xf0041288 in ?? ()
(gdb)
0xf004128c in ?? ()
(gdb)
0xf0041290 in ?? ()
(gdb)
0xf0041294 in ?? ()
(gdb)
0xf0041298 in ?? ()
(gdb)
0xf004129c in ?? ()
(gdb) info regi
g0             0x0      0
g1             0x808    2056
g2             0xf5a0f000       -174002176
g3             0x19     25
g4             0x0      0
g5             0xf0243b88       -266060920
g6             0x0      0
g7             0xf0244020       -266059744
o0             0x0      0
o1             0xf02406b4       -266074444
o2             0xf5a0f00c       -174002164
o3             0xf0258398       -265976936
o4             0xf0252b10       -265999600
o5             0x0      0
sp             0xf0240658       0xf0240658
o7             0xf0041b74       -268166284
l0             0x4400cc0        71306432
l1             0xf004b1f8       -268127752
l2             0xf004b1fc       -268127748
l3             0x1      1
l4             0x209    521
l5             0x1      1
l6             0x7      7
l7             0xf02401e8       -266075672
i0             0xf024d870       -266020752
i1             0x0      0
i2             0xff812201       -8314367
i3             0x0      0
i4             0x0      0
i5             0xf01582dc       -267025700
fp             0xf0240290       0xf0240290
i7             0xf004ef98       -268111976
y              0x0      0
psr            0x4000cc0        [ PS S #10 #11 #26 ]
wim            0x1      1
tbr            0xf0040090       -268173168
pc             0xf004129c       0xf004129c
npc            0xf00412a0       0xf00412a0
fsr            0x80000  [ #19 ]
csr            0x0      0
And here is the failing version:
(gdb) info regi
g0             0x0      0
g1             0x80     128
g2             0xf5a0f000       -174002176
g3             0x1a     26
g4             0x0      0
g5             0xf0243b88       -266060920
g6             0x0      0
g7             0xf0244020       -266059744
o0             0x0      0
o1             0xf024047c       -266075012
o2             0xf5a0f00c       -174002164
o3             0xf0258398       -265976936
o4             0xf0252b10       -265999600
o5             0x0      0
sp             0xf0240420       0xf0240420
o7             0xf0041b74       -268166284
l0             0x4400cc4        71306436
l1             0xf004b1f8       -268127752
l2             0xf004b1fc       -268127748
l3             0x10     16
l4             0x209    521
l5             0x10     16
l6             0x7      7
l7             0xf023ffb0       -266076240
i0             0xf5a0f01c       -174002148
i1             0x100    256
i2             0xf0000000       -268435456
i3             0xff000000       -16777216
i4             0x4100cc5        68160709
i5             0x4100ce5        68160741
fp             0xf0240058       0xf0240058
i7             0xf0054be8       -268088344
y              0x0      0
psr            0x4000cc4        [ #2 PS S #10 #11 #26 ]
wim            0x10     16
tbr            0xf0040090       -268173168
pc             0xf00412a4       0xf00412a4
npc            0xf00412a8       0xf00412a8
fsr            0x80000  [ #19 ]
csr            0x0      0
(gdb) cont
Continuing.
Breakpoint 1, 0xf004127c in ?? ()
(gdb) stepi
0xf0041280 in ?? ()
(gdb)
0xf0041284 in ?? ()
(gdb)
0xf0041288 in ?? ()
(gdb)
0xf004128c in ?? ()
(gdb)
0xf0041290 in ?? ()
(gdb)
0xf0041294 in ?? ()
(gdb)
0xf0041298 in ?? ()
(gdb)
Remote connection closed
(gdb)
So the failure appears to be happening on this instruction:
0xf0041298:     st  %g1, [ %l7 + 0x6c ]
For the successful iteration:
l7             0xf02401e8       -266075672
For the failing iteration:
l7             0xf023ffb0       -266076240
With OFMEM debugging enabled, it's fairly easy to see the following in 
the console output:
Jumping to entry point 00004000 for type 00000005...
switching to new context:
OFMEM: ofmem_claim phys=ffffffffffffffff size=00040000 align=00000008
OFMEM: ofmem_claim_virt virt=f0040000 size=00040000 align=00000000
OFMEM: ofmem_map_page_range f0040000 -> 006fc0000 00040000 mode 000000bc
OFMEM: ofmem_claim phys=ffffffffffffffff size=00019000 align=00000008
OFMEM: ofmem_claim_virt virt=f0240000 size=00019000 align=00000000
OFMEM: ofmem_map_page_range f0240000 -> 006fa7000 00019000 mode 000000bc
So what is happening is that %l7 is getting set to below 0xf0240000 and 
hence the trap is triggered because the kernel is attempting to write to 
unmapped virtual memory.
Using Artyom's blog, I was able to fire up kadb to try and figure out 
which part of the kernel is raising the exception:
kadb[0]: 0xf004127c?
sys_trap:
sys_trap:       aa102001        = mov     0x1, %l5
kadb[0]:
Based upon this, it would seem that the Solaris kernel allocates a stack 
for saving state when a trap is called with a base of 0xf0240000, but 
for some reason we are stacking to a point where we go beyond the memory 
region allocated for it. I suspect that this is a side effect of a 
property/device not being setup correctly, but I'm not yet sure what it 
is. Anyhow, I thought I'd post the results of my investigations so far 
in case anyone else has any ideas as to what could cause this.
ATB,
Mark.
-- 
Mark Cave-Ayland - Senior Technical Architect
PostgreSQL - PostGIS
Sirius Corporation plc - control through freedom
http://www.siriusit.co.uk
t: +44 870 608 0063

Sirius Labs: http://www.siriusit.co.uk/labs