On Thu, Nov 19, 2015 at 12:42:50PM +0000, Xulei (Stone) wrote:
Kevin,
After deeply analyzing, i think there may be 3 possible reasons: 1)wrong CountCPUs value. It seems CountCPUs++ in handle_smp() has no lock to protect. So, sometimes, 2 or more vcpu may get the same current value of CountCPUs. Then we'll get a single incrementation instead of 2 or more and "while (cmos_smp_count != CountCPUs)" will loop forever;
The handle_smp() code is called from romlayout.S:entry_smp() which does take a lock. So, all of handle_smp() should run synchronous.
Ok, this possibility is ruled out!
2)wrong cmos_smp_count value. SeaBIOS rtc reads an incorrect number?
Not sure - the last time there were problems in this area of the code others used kvmtrace to try and track this down. Since you are getting dprintf statements, you could also try outputting cmos_smp_count prior to the loop (see patch below).
I'll test again with this patch, and observe the output. But frankly speaking, i don't think SeaBIOS may read an incorrect number. Because, QEMU set smp_cpus value in pc_cmos_init (cmos ox5f) at the 1st time (QEMU does not execute pc_cmos_init again during repetitive reboot). SeaBIOS works well for many times which means there is no reason that a time SeaBIOS suddenly reads an incorrect number.
3)yield() stuck. Is it possible that SeaBIOS is stuck during yield? I've tested, when yield() is running, SeaBIOS seems has not created some other threads except the main thread. So I don't know what's the function of yield() here.?
The yield() allows hardware interrupts to occur. But note that yield() isn't called in the loop - is is only called after the loop completes.
If you are only getting this on massive repetitive reboot requests, there are some other possible explanations:
- perhaps the SIPI is getting lost because one of the CPUs is still
resetting or still processing a SIPI from the last reboot?
Seems impossible. Seen from the qemu log and SeaBIOS log, the VM has booted successfully and can execute our "regular rebooting" daemon process before the last reboot.
- the seabios code itself may have been corrupted if the memcpy() in
qemu_prep_reset() got far enough along to clear HaveRunPost, but did not get far enough along to fully complete the memcpy().
BTW, my VM reboots every 220 second (reboot time interval = 220s). I think SeaBIOS has enough time to process all kinds of affairs, like SIPI and memcpy().
Kevin, I want to know whether it is possible that if my VM is stuck at QEMU (a point of pci device reset procedure? or whatever) SeaBIOS will hold at hanle_smp() and could not printf "Found %d cpu(s) max supported %d cpu(s)"? Is this possible?
================== bad QEMU log======= [2015-11-13 18:45:57] qemu_devices_reset:1941 reset all devices [2015-11-13 18:45:57] set_nmi_flag:71 set nmi val = 0 [2015-11-13 18:45:58] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411558, "microseconds": 650381}, "event": "VSERPORT_CHANGE", "data": {"open": false, "id": "channel0"}} [2015-11-13 18:45:58] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411558, "microseconds": 796285}, "event": "RESET"} [2015-11-13 18:45:58] qemu_devices_reset:1941 reset all devices [2015-11-13 18:45:59] set_nmi_flag:71 set nmi val = 0 [2015-11-13 18:46:00] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411560, "microseconds": 212196}, "event": "RESET"} [2015-11-13 18:46:00] qemu_reset_report:749 domain is rebooting [2015-11-13 18:46:00] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411558, "microseconds": 650558}, "event": "VSERPORT_CHANGE", "data": {"open": false, "id": "channel3"}}
================ good QEMU log========= [2015-11-13 18:42:12] qemu_devices_reset:1941 reset all devices [2015-11-13 18:42:12] set_nmi_flag:71 set nmi val = 0 [2015-11-13 18:42:13] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411333, "microseconds": 718617}, "event": "VSERPORT_CHANGE", "data": {"open": false, "id": "channel0"}} [2015-11-13 18:42:13] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411333, "microseconds": 848236}, "event": "RESET"} [2015-11-13 18:42:14] qemu_devices_reset:1941 reset all devices [2015-11-13 18:42:14] set_nmi_flag:71 set nmi val = 0 [2015-11-13 18:42:15] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411335, "microseconds": 280198}, "event": "RESET"} [2015-11-13 18:42:15] qemu_reset_report:749 domain is rebooting [2015-11-13 18:42:15] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411333, "microseconds": 718794}, "event": "VSERPORT_CHANGE", "data": {"open": false, "id": "channel3"}} [2015-11-13 18:42:15] virtio_set_status:524 virtio-blk device status is 3 that means DRIVER [2015-11-13 18:42:15] virtio_set_status:524 virtio-blk device status is 7 that means DRIVER OK [2015-11-13 18:42:15] virtio_set_status:524 virtio-blk device status is 3 that means DRIVER [2015-11-13 18:42:15] virtio_set_status:524 virtio-blk device status is 7 that means DRIVER OK [2015-11-13 18:42:23] virtio_set_status:524 virtio-serial device status is 1 that means ACKNOWLEDGE [2015-11-13 18:42:23] virtio_set_status:524 virtio-serial device status is 3 that means DRIVER [2015-11-13 18:42:23] handle_control_message:333 virtio serial port '-1' hanle control message event = 0, value = 1 [2015-11-13 18:42:23] send_control_event:225 virtio serial port 1 send control message event = 1, value = 1 [2015-11-13 18:42:23] send_control_event:225 virtio serial port 2 send control message event = 1, value = 1 [2015-11-13 18:42:23] send_control_event:225 virtio serial port 3 send control message event = 1, value = 1 [2015-11-13 18:42:23] send_control_event:225 virtio serial port 4 send control message event = 1, value = 1 [2015-11-13 18:42:23] virtio_set_status:524 virtio-serial device status is 7 that means DRIVER OK [2015-11-13 18:42:23] handle_control_message:333 virtio serial port '1' hanle control message event = 3, value = 1 [2015-11-13 18:42:23] send_control_event:225 virtio serial port 1 send control message event = 6, value = 1 [2015-11-13 18:42:23] handle_control_message:333 virtio serial port '2' hanle control message event = 3, value = 1 [2015-11-13 18:42:23] send_control_event:225 virtio serial port 2 send control message event = 6, value = 1 [2015-11-13 18:42:23] handle_control_message:333 virtio serial port '3' hanle control message event = 3, value = 1 [2015-11-13 18:42:23] send_control_event:225 virtio serial port 3 send control message event = 6, value = 1 [2015-11-13 18:42:23] handle_control_message:333 virtio serial port '4' hanle control message event = 3, value = 1 [2015-11-13 18:42:23] virtio_set_status:524 virtio-blk device status is 1 that means ACKNOWLEDGE [2015-11-13 18:42:23] virtio_set_status:524 virtio-blk device status is 1 that means ACKNOWLEDGE [2015-11-13 18:42:23] virtio_set_status:524 virtio-blk device status is 3 that means DRIVER [2015-11-13 18:42:23] virtio_set_status:524 virtio-blk device status is 7 that means DRIVER OK [2015-11-13 18:42:23] virtio_set_status:524 virtio-blk device status is 3 that means DRIVER [2015-11-13 18:42:23] virtio_set_status:524 virtio-blk device status is 7 that means DRIVER OK [2015-11-13 18:42:30] handle_control_message:333 virtio serial port '2' hanle control message event = 6, value = 1 [2015-11-13 18:42:30] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411350, "microseconds": 214826}, "event": "VSERPORT_CHANGE", "data": {"open": true, "id": "channel1"}} [2015-11-13 18:42:30] handle_control_message:333 virtio serial port '1' hanle control message event = 6, value = 1 [2015-11-13 18:42:30] handle_control_message:333 virtio serial port '3' hanle control message event = 6, value = 1 [2015-11-13 18:42:30] handle_control_message:333 virtio serial port '4' hanle control message event = 6, value = 1 [2015-11-13 18:42:31] monitor_qapi_event_emit:483 {"timestamp": {"seconds": 1447411350, "microseconds": 220665}, "event": "VSERPORT_CHANGE", "data": {"open": true, "id": "channel3"}}
If the failure is reproducible, the patch below could help narrow the possibilities.
-Kevin
--- a/src/fw/smp.c +++ b/src/fw/smp.c @@ -125,6 +125,7 @@ smp_setup(void)
// Wait for other CPUs to process the SIPI. u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
- dprintf(1, "cmos_smp_count=%d\n", cmos_smp_count); while (cmos_smp_count != CountCPUs) asm volatile( // Release lock and allow other processors to use the stack.
@@ -136,6 +137,7 @@ smp_setup(void) " jc 1b\n" : "+m" (SMPLock), "+m" (SMPStack) : : "cc", "memory");
dprintf(1, "finish smp\n"); yield();
// Restore memory.