Hi Kevin,
-----Original Message----- From: Kevin O'Connor [mailto:kevin@koconnor.net]
On Fri, Dec 18, 2015 at 03:04:58AM +0000, Gonglei (Arei) wrote:
Hi Kevin & Paolo,
Luckily, I reproduced this problem last night. And I got the below log when
SeaBIOS is stuck. [...]
[2015-12-18 10:38:10] gonglei: finish while
[...]
<...>-31509 [035] 154753.180077: kvm_exit: reason EXCEPTION_NMI rip 0x3
info 0 80000306
<...>-31509 [035] 154753.180077: kvm_emulate_insn: 0:3:f0 53 (real) <...>-31509 [035] 154753.180077: kvm_inj_exception: #UD (0x0) <...>-31509 [035] 154753.180077: kvm_entry: vcpu 0
This is an odd finding. It seems to indicate that the code is caught in an infinite irq loop once irqs are enabled. What doesn't make sense is that an NMI shouldn't depend on the cpu irq enable flag.
Maybe the root cause is not NMI but INTR, so yield() can open hardware interrupt, And then execute interrupt handler, but the interrupt handler make the SeaBIOS stack broken, so that the BSP can't execute the instruction and occur exception, VM_EXIT to Kmod, which is an infinite loop. But I don't have any proofs except the surface phenomenon.
Kevin, can we drop yield() in smp_setup() ?
diff --git a/src/fw/smp.c b/src/fw/smp.c index 579acdb..dd23eda 100644 --- a/src/fw/smp.c +++ b/src/fw/smp.c @@ -136,7 +136,6 @@ smp_setup(void) " jc 1b\n" : "+m" (SMPLock), "+m" (SMPStack) : : "cc", "memory"); - yield();
// Restore memory. *(u64*)BUILD_AP_BOOT_ADDR = old;
Is it really useful and allowable for SeaBIOS? Maybe for other components? I'm not sure. Because we found that when SeaBIOS is booting, if we inject a NMI by QMP, the guest will *stuck*. And the kvm tracing log is the same with the current problem.
Regards, -Gonglei
Also, I can't explain why rip would be 0x03, nor why a #UD in an exception handler wouldn't result in a triple fault. Maybe someone with more kvm knowledge could help here.
I did notice that you appear to be running with SeaBIOS v1.8.1 - I recommend you upgrade to the latest. There were two important fixes in this area (8b9942fa and 3156b71a). I don't think either of these fixes would explain the log above, but it would be best to eliminate the possibility.
-Kevin