[OpenBIOS] Analysis of current Solaris 8 boot failure on SPARC32

Blue Swirl blauwirbel at gmail.com
Sun Jan 2 10:53:58 CET 2011


On Sun, Jan 2, 2011 at 1:17 AM, Mark Cave-Ayland
<mark.cave-ayland at siriusit.co.uk> wrote:
> Hi all,
>
> Currently attempting to boot a Solaris 8 install CD results in the following
> output:
>
>
> Configuration device id QEMU version 1 machine id 32
> CPUs: 1 x FMI,MB86904
> UUID: 00000000-0000-0000-0000-000000000000
> Welcome to OpenBIOS v1.0 built on Jan 2 2011 00:28
>  Type 'help' for detailed information
> Trying cdrom:d...
> Not a bootable ELF image
> Loading a.out image...
> Loaded 7680 bytes
> entry point is 0x4000
> bootpath: /iommu/sbus/espdma/esp/sd at 2,0:d
>
> Jumping to entry point 00004000 for type 00000005...
> switching to new context:
> SunOS Release 5.8 Version Generic_108528-09 32-bit
> Copyright 1983-2001 Sun Microsystems, Inc.  All rights reserved.
> qemu: fatal: Trap 0x29 while interrupts disabled, Error state
> pc: f004127c  npc: f0041280
> General Registers:
> %g0-7: 00000000 00000808 00000001 f0041b74 00000000 f0243b88 00000000
> f0244020
>
> Current Register Window:
> %o0-7: f025831c f5a0f00c f0240374 f0240370 f024036c 00000004 f0240300
> f005bd84
> %l0-7: 04400cc2 f005bf94 f005bf98 00000004 00000209 00000004 00000000
> f023fe60
> %i0-7: 00000001 f02403f4 f5a0f00c f025831c 00000001 00000009 f023ff08
> f005c6b8
>
> Floating Point Registers:
> %f00: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> %f04: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> %f08: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> %f12: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> %f16: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> %f20: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> %f24: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> %f28: 000000000.000000 000000000.000000 000000000.000000 000000000.000000
> psr: 04000cc2 (icc: ---- SPE: SP-) wim: 00000004
> fsr: 00080000 y: 00000000
> Aborted
>
>
> With the SPARC32 OFMEM migration complete, we can now get lots of debugging
> information regarding the memory mappings being made at run time. Setting a
> breakpoint at the crash address, it is possible to see that it is part of a
> loop that called several times during boot. Using this we can compare the
> successful iterations of the loop with the failing version in order to
> determine where the crash is happening.
>
>
> Here is the gdb output from the last successful iteration of the loop:
>
> Breakpoint 1, 0xf004127c in ?? ()
> (gdb) disas 0xf0041270 0xf00412a0
> Dump of assembler code from 0xf0041270 to 0xf00412a0:
> 0xf0041270:     rett  %l2 + 4
> 0xf0041274:     b  0xf004127c
> 0xf0041278:     nop
> 0xf004127c:     mov  1, %l5     ! 0x1
> 0xf0041280:     sll  %l5, %l0, %l5
> 0xf0041284:     rd  %wim, %l3
> 0xf0041288:     btst  0x40, %l0
> 0xf004128c:     be  0xf0041318
> 0xf0041290:     btst  %l3, %l5
> 0xf0041294:     sub  %fp, 0xa8, %l7
> 0xf0041298:     st  %g1, [ %l7 + 0x6c ]
> 0xf004129c:     std  %g2, [ %l7 + 0x70 ]

Seems to be some kind of window trap handler.

> End of assembler dump.
> (gdb) info regi
> g0             0x0      0
> g1             0x808    2056
> g2             0xf5a0f000       -174002176
> g3             0x19     25
> g4             0x0      0
> g5             0xf0243b88       -266060920
> g6             0x0      0
> g7             0xf0244020       -266059744
> o0             0x0      0
> o1             0xf02406b4       -266074444
> o2             0xf5a0f00c       -174002164
> o3             0xf0258398       -265976936
> o4             0xf0252b10       -265999600
> o5             0x0      0
> sp             0xf0240658       0xf0240658
> o7             0xf0041b74       -268166284
> l0             0x4400cc0        71306432
> l1             0xf004b1f8       -268127752
> l2             0xf004b1fc       -268127748
> l3             0xf0041000       -268169216
> l4             0x209    521
> l5             0x1      1
> l6             0x7      7
> l7             0xf0240658       -266074536
> i0             0xf024d870       -266020752
> i1             0x0      0
> i2             0xff812201       -8314367
> i3             0x0      0
> i4             0x0      0
> i5             0xf01582dc       -267025700
> fp             0xf0240290       0xf0240290
> i7             0xf004ef98       -268111976
> y              0x0      0
> psr            0x4400cc0        [ PS S #10 #11 #22 #26 ]
> wim            0x1      1
> tbr            0xf0040090       -268173168
> pc             0xf004127c       0xf004127c
> npc            0xf0041280       0xf0041280
> fsr            0x80000  [ #19 ]
> csr            0x0      0
> (gdb) stepi
> 0xf0041280 in ?? ()
> (gdb)
> 0xf0041284 in ?? ()
> (gdb)
> 0xf0041288 in ?? ()
> (gdb)
> 0xf004128c in ?? ()
> (gdb)
> 0xf0041290 in ?? ()
> (gdb)
> 0xf0041294 in ?? ()
> (gdb)
> 0xf0041298 in ?? ()
> (gdb)
> 0xf004129c in ?? ()
> (gdb) info regi
> g0             0x0      0
> g1             0x808    2056
> g2             0xf5a0f000       -174002176
> g3             0x19     25
> g4             0x0      0
> g5             0xf0243b88       -266060920
> g6             0x0      0
> g7             0xf0244020       -266059744
> o0             0x0      0
> o1             0xf02406b4       -266074444
> o2             0xf5a0f00c       -174002164
> o3             0xf0258398       -265976936
> o4             0xf0252b10       -265999600
> o5             0x0      0
> sp             0xf0240658       0xf0240658
> o7             0xf0041b74       -268166284
> l0             0x4400cc0        71306432
> l1             0xf004b1f8       -268127752
> l2             0xf004b1fc       -268127748
> l3             0x1      1
> l4             0x209    521
> l5             0x1      1
> l6             0x7      7
> l7             0xf02401e8       -266075672
> i0             0xf024d870       -266020752
> i1             0x0      0
> i2             0xff812201       -8314367
> i3             0x0      0
> i4             0x0      0
> i5             0xf01582dc       -267025700
> fp             0xf0240290       0xf0240290
> i7             0xf004ef98       -268111976
> y              0x0      0
> psr            0x4000cc0        [ PS S #10 #11 #26 ]
> wim            0x1      1
> tbr            0xf0040090       -268173168
> pc             0xf004129c       0xf004129c
> npc            0xf00412a0       0xf00412a0
> fsr            0x80000  [ #19 ]
> csr            0x0      0
>
>
> And here is the failing version:
>
>
> (gdb) info regi
> g0             0x0      0
> g1             0x80     128
> g2             0xf5a0f000       -174002176
> g3             0x1a     26
> g4             0x0      0
> g5             0xf0243b88       -266060920
> g6             0x0      0
> g7             0xf0244020       -266059744
> o0             0x0      0
> o1             0xf024047c       -266075012
> o2             0xf5a0f00c       -174002164
> o3             0xf0258398       -265976936
> o4             0xf0252b10       -265999600
> o5             0x0      0
> sp             0xf0240420       0xf0240420
> o7             0xf0041b74       -268166284
> l0             0x4400cc4        71306436
> l1             0xf004b1f8       -268127752
> l2             0xf004b1fc       -268127748
> l3             0x10     16
> l4             0x209    521
> l5             0x10     16
> l6             0x7      7
> l7             0xf023ffb0       -266076240
> i0             0xf5a0f01c       -174002148
> i1             0x100    256
> i2             0xf0000000       -268435456
> i3             0xff000000       -16777216
> i4             0x4100cc5        68160709
> i5             0x4100ce5        68160741
> fp             0xf0240058       0xf0240058
> i7             0xf0054be8       -268088344
> y              0x0      0
> psr            0x4000cc4        [ #2 PS S #10 #11 #26 ]
> wim            0x10     16
> tbr            0xf0040090       -268173168
> pc             0xf00412a4       0xf00412a4
> npc            0xf00412a8       0xf00412a8
> fsr            0x80000  [ #19 ]
> csr            0x0      0
> (gdb) cont
> Continuing.
>
> Breakpoint 1, 0xf004127c in ?? ()
> (gdb) stepi
> 0xf0041280 in ?? ()
> (gdb)
> 0xf0041284 in ?? ()
> (gdb)
> 0xf0041288 in ?? ()
> (gdb)
> 0xf004128c in ?? ()
> (gdb)
> 0xf0041290 in ?? ()
> (gdb)
> 0xf0041294 in ?? ()
> (gdb)
> 0xf0041298 in ?? ()
> (gdb)
> Remote connection closed
> (gdb)
>
>
> So the failure appears to be happening on this instruction:
>
> 0xf0041298:     st  %g1, [ %l7 + 0x6c ]
>
> For the successful iteration:
>
> l7             0xf02401e8       -266075672
>
> For the failing iteration:
>
> l7             0xf023ffb0       -266076240
>
> With OFMEM debugging enabled, it's fairly easy to see the following in the
> console output:
>
> Jumping to entry point 00004000 for type 00000005...
> switching to new context:
> OFMEM: ofmem_claim phys=ffffffffffffffff size=00040000 align=00000008
> OFMEM: ofmem_claim_virt virt=f0040000 size=00040000 align=00000000
> OFMEM: ofmem_map_page_range f0040000 -> 006fc0000 00040000 mode 000000bc
> OFMEM: ofmem_claim phys=ffffffffffffffff size=00019000 align=00000008
> OFMEM: ofmem_claim_virt virt=f0240000 size=00019000 align=00000000
> OFMEM: ofmem_map_page_range f0240000 -> 006fa7000 00019000 mode 000000bc
>
> So what is happening is that %l7 is getting set to below 0xf0240000 and
> hence the trap is triggered because the kernel is attempting to write to
> unmapped virtual memory.
>
> Using Artyom's blog, I was able to fire up kadb to try and figure out which
> part of the kernel is raising the exception:
>
> kadb[0]: 0xf004127c?
> sys_trap:
> sys_trap:       aa102001        = mov     0x1, %l5
> kadb[0]:
>
> Based upon this, it would seem that the Solaris kernel allocates a stack for
> saving state when a trap is called with a base of 0xf0240000, but for some
> reason we are stacking to a point where we go beyond the memory region
> allocated for it. I suspect that this is a side effect of a property/device
> not being setup correctly, but I'm not yet sure what it is. Anyhow, I
> thought I'd post the results of my investigations so far in case anyone else
> has any ideas as to what could cause this.

The kernel stack is overflown. Perhaps some recursive loop (iterating
device tree, since this doesn't happen on real hardware?) never exits,
or maybe OpenBIOS consumes kernel stack much more than OBP.


More information about the OpenBIOS mailing list