On Tue, Feb 14, 2017 at 02:50:23PM +0000, Dr. David Alan Gilbert wrote:
I've been sporadically carrying on debugging this and I think I have a bit more understanding, but not quite all the way.
I'm pretty sure we are trying to run code that's been over written by variable data - which I believe to be TimerLast, and I'm also fairly sure the problem is not a delayed reset from multiple reboot signals.
Some detail: a) I added some debug to qemu to print what triggered the reboot by always passing qemu_system_reset_request a string; the kernel is always rebooting via KBD_CCMD_RESET, I see that enter the BIOS and then see SeaBIOS then does KBD_CCMD_RESET again - which I think is the one coming from handle_resume32. But then I noticed that every so often we got a reboot from a KVM_SHUTDOWN which is apparently a triple-fault or the like.
b) With a KVM trace I got the addresses the fault was happening at, and it was always an address in the copy fo the bios around bffbcf.. although the actual failure jumped about a bit, taking the: Relocating init from 0x000e4810 to 0xbffb2050 (size 57120) I worked the fault address back to 'ef799' and from the objdump found that was an address in maininit but also corresponded to TimerLast:
I can confirm there is a defect with TimerLast and reboot handling. I'm not sure what the best fix is - suggestions welcome.
The root of the problem is that QEMU does not reset the f-segment on a reboot signal and so SeaBIOS must manually reset that memory. It does this in the call to qemu_prep_reset() in tryReboot(). That call copies the pristine copy of the bios at 0xffff0000 to 0xf0000. The code then goes on to signal a hard reboot so that all the hardware is reset and so that the code starts just like it would on first boot.
Unfortunately, the hard reboot logic makes calls to udelay() - specifically in i8042_reboot() and pci_reboot(). These calls to udelay() are not valid here as the timer isn't even necessarily configured at this point. Alas - it also has the nasty possibility of corrupting the pristine copy of the bios that was just made in qemu_prep_reset(). Random crashes can then occur on the next boot.
It's easy to catch this with the following code change:
--- a/src/hw/timer.c +++ b/src/hw/timer.c @@ -146,6 +146,8 @@ timer_adjust_bits(u32 value, u32 validbits) static u32 timer_read(void) { + if (!GET_GLOBAL(HaveRunPost)) + panic("timer read too early\n"); u16 port = GET_GLOBAL(TimerPort); if (CONFIG_TSC_TIMER && !port) // Read from CPU TSC
With that, I get a panic on every reboot. Commenting out the udelay() calls then gets reboots working again.
The simplest fix is to remove the udelay() calls from the reboot logic. However, it's not clear to me if it's valid to omit the delays when running on real hardware. Another possibility would be to try and harden timer_read() so that it can be run very early (prior to code relocation), but that's tricky to implement correctly..
Good find Dave - I know that wasn't easy to track down. -Kevin