On Sun, May 6, 2012 at 2:02 PM, Andreas Färber afaerber@suse.de wrote:
Am 06.05.2012 13:32, schrieb Blue Swirl:
On Sat, May 5, 2012 at 3:37 PM, Andreas Färber afaerber@suse.de wrote:
Hello Blue,
Testing a potential AREG0 fix for ppc host by malc I got an error running `./sparc-softmmu/sparc-softmmu` (same with CD/kernel):
qemu: fatal: Trap 0x07 while interrupts disabled, Error state pc: 00005e0c npc: 00005e10 General Registers: %g0-7: 00000000 00000001 babababa 00000000 00000020 07ffff08 07ffe000 babababa
Current Register Window: %o0-7: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 %l0-7: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 %i0-7: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Floating Point Registers: %f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 %f08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 %f16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 %f24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 psr: 048000c0 (icc: N--- SPE: SP-) wim: 00000001 fsr: 00000000 y: 00000020 Abgebrochen
The 0xbabababa in %g2 and %g7 is a signature I've seen in uninitialized memory on openSUSE 12.1 Betas. So I ran valgrind, and the following caught my eye on both ppc and x86_64:
==18801== Command: ./sparc-softmmu/qemu-system-sparc ==18801== ==18801== Thread 2: ==18801== Conditional jump or move depends on uninitialised value(s) ==18801== at 0x25C5AF: compute_all_logic (cc_helper.c:37) ==18801== by 0x25C648: helper_compute_psr (cc_helper.c:470) ==18801== by 0x8CD0981: ??? ==18801== Uninitialised value was created by a heap allocation ==18801== at 0x4C27CE8: memalign (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==18801== by 0x4C27D97: posix_memalign (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==18801== by 0x1F2101: qemu_memalign (oslib-posix.c:93) ==18801== by 0x1F21A9: qemu_vmalloc (oslib-posix.c:126) ==18801== by 0x2665F6: qemu_ram_alloc_from_ptr (exec.c:2647) ==18801== by 0x286D76: memory_region_init_ram (memory.c:954) ==18801== by 0x297FFD: ram_init1 (sun4m.c:757) ==18801== by 0x204DAE: qdev_init (qdev.c:151) ==18801== by 0x204EEC: qdev_init_nofail (qdev.c:258) ==18801== by 0x298845: ram_init.constprop.7 (sun4m.c:783) ==18801== by 0x298980: sun4m_hw_init (sun4m.c:862) ==18801== by 0x2994A2: ss5_init (sun4m.c:1289)
This is at 8f473dd104f0937ce98523fa6f9de0bd845aebbe, and cc_helper.c:37 is int32_t dst argument of get_NZ_icc(), which is always called with CC_DST, i.e. env->cc_dst.
This seems to indicate that a read from uninitialized memory occurred, from which cc_dst is being initialized?
This should happen in target-sparc/cpu.c:45 memset(env, 0, offsetof(CPUSPARCState, breakpoints));
cc_dst is between structure start and CPU_COMMON.
89aaf60dedbe0e6415acfe816e02b538e5c54e68 fixed a bug relating to reset recently.
The still-current master commit above includes that fix though, and that's no explanation for the uninitialized memory stemming from sun4m RAM as opposed to QOM object_new(). Somewhere a read is happening, possibly in OpenBIOS, from uninitialized memory that is then stored into the CPUSPARCState after that has been zero-initialized, IIUC.
Ok, I see it now. OpenBIOS assumes that the Sparc32 SMP table is valid when the valid field is nonzero, indicating secondary processor setup so OpenBIOS jumps to the location indicated with the SMP table. With 0xbabababa in memory, this fragile logic fails and there is the early crash. https://tracker.coreboot.org/trac/openbios/browser/trunk/openbios-devel/arch... https://tracker.coreboot.org/trac/openbios/browser/trunk/openbios-devel/arch...
I think the current logic would also not survive a reset just when a secondary processor is brought online.
The fix is to make the SMP table logic robust, for example with a checksum. We could also read CPU ID from MXCC and skip the check for boot CPU, though MXCC should not exist for all models.
My issue here is that sparc64 boots HelenOS fine up until it's trying to load the kernel (identical to x86_64 host) but sparc32 exits really early on ppc. It might well be that there's a bug hidden in malc's TCG patch that's causing the fatal error state, but the uninitialized memory report is on both TCG hosts, so unlikely TCG-related.
/-F
Any idea where that could originate from or how to further debug? It doesn't seem to be caused by the 7d21dcc84b8c07918124a9c0708694d2fb013f65 OpenBIOS r1056 update.
Regards,
Andreas
-- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg