Hi,
I noticed that under OVMF + SeaBIOS CSM + your related patches for both, reset requested by the guest doesn't work as expected. The behavior is an infinite loop, with the following debug fragment repeated by the CSM-ized SeaBIOS:
In resume (status=0) In 32bit resume Attempting a hard reboot i8042_wait_write
The corresponding call chain seems to be:
reset_vector() [src/romlayout.S] entry_post() entry_resume() handle_resume() [src/resume.c] prints "In resume" handle_resume32() prints "In 32bit resume" tryReboot() prints "Attempting a hard reboot" i8042_reboot() [src/ps2port.c] i8042_wait_write() prints "i8042_wait_write" outb(0xfe, PORT_PS2_STATUS)
(The entry_post -> entry_resume jump occurs because HaveRunPost has been set to 1 by csm_maininit() --> interface_init() --> ivt_init().)
At this point kbd_write_command() in qemu-kvm's "hw/pckbd.c", case KBD_CCMD_RESET, requests a system reset. Soon the reset handlers run, among them cpu_reset() (which was registered by
pc_init1() [hw/pc.c] pc_new_cpu()
). cpu_reset() [target-i386/helper.c] sets CS:IP to f000:fff0, which is the exact address of... reset_vector() in SeaBIOS.
Of course OVMF should be re-run instead of SeaBIOS. When qemu-kvm starts, "OVMF.fd" is installed as ROM, such that the last address it occupies is "all-bits-one", independently of its size (below a limit of course):
pc_init1() [hw/pc.c] rom_add_file_fixed() [] open() / read() /close() rom_insert() some calls to inform KVM
I think when OVMF runs SeaBIOS as CSM, OVMF shadows the original ROM (containing the OVMF binary itself) with the SeaBIOS code + static data (I'm peeking at http://en.wikipedia.org/wiki/Shadow_RAM#Shadow_RAM...). This should render SeaBIOS visible / executable / writeable in the top 16-bit segment, and leave OVMF in a permanently unusable state (in RAM at least).
My guess at the relevant edk2 function is ShadowAndStartLegacy16() [IntelFrameworkModulePkg/Csm/LegacyBiosDxe/LegacyBios.c]. The LegacyRegion->UnLock() call should be instrumental (implemented in "OvmfPkg/Csm/CsmSupportLib/LegacyRegion.c" with PAM (Programmable Attribute Map) registers?)
Hence I *presume* qemu-kvm should un-shadow the ROM at reset time (ie. make OVMF visible again as ROM) not later than allowing the VCPU to continue at f000:fff0 again. Normally that address should be occupied by OVMF code from (I guess) "UefiCpuPkg/ResetVector/Vtf0".
Does this make any sense? Is qemu-kvm forgetting to reset the PAMs? Or would that be the responsibility of tryReboot() in SeaBIOS?
... Aah! qemu_prep_reset() in SeaBIOS [src/shadow.c] goes like:
void qemu_prep_reset(void) { if (!CONFIG_QEMU) return; // QEMU doesn't map 0xc0000-0xfffff back to the original rom on a // reset, so do that manually before invoking a hard reset. make_bios_writable(); extern u8 code32flat_start[], code32flat_end[]; memcpy(code32flat_start, code32flat_start + BIOS_SRC_OFFSET , code32flat_end - code32flat_start);
if (HaveRunPost) // Memory copy failed to work - try to halt the machine. apm_shutdown(); }
and this function is actually called inside the infinite loop (I ignored it before):
tryReboot() prints "Attempting a hard reboot" qemu_prep_reset() [src/shadow.c] <-------------- here i8042_reboot() [src/ps2port.c] i8042_wait_write() prints "i8042_wait_write" outb(0xfe, PORT_PS2_STATUS)
but of course it doesn't do anything with CONFIG_CSM (since that implies !CONFIG_QEMU). What's more, qemu_prep_reset() and make_bios_writable_intel() seem to exploit SeaBIOS characteristics (code32flat_*, HaveRunPost etc.) that probably make no sense when the data being restored is a different (= OVMF) image.
Can we dumb down ^W^W generalize this code? :) Or maybe should qemu introduce a reset handler for PAMs?
(I realize I've been reading all the time about PAMs, in the "Writeable files in fw_cfg" thread, in the discussion about unlocking the 0xE0000 segment for stack purposes... Didn't understand a single word before, sorry. Downloaded my copy of the i440FX spec just today; I finally have a remote idea how shadowing / write-protecting works.)
Thanks! Laszlo
On 02/14/2013 12:41 PM, Laszlo Ersek wrote:
). cpu_reset() [target-i386/helper.c] sets CS:IP to f000:fff0, which is the exact address of... reset_vector() in SeaBIOS.
This would be a bug, but it isn't quite true.
If you look at x86_cpu_reset() you will note that it sets the code segment base to 0xffff0000, not 0xf0000 as one could expect from the above. This is also true of a physical x86.
As such, the *real* reset vector is at 0xfffffff0 as opposed to the SeaBIOS vector at 0xffff0 -- this is a backwards compatibility vector which typically just issues a real reset.
Now, if Qemu doesn't handle the distinction here correctly, that is a bug.
-hpa
On Thu, 2013-02-14 at 12:54 -0800, H. Peter Anvin wrote:
This would be a bug, but it isn't quite true.
If you look at x86_cpu_reset() you will note that it sets the code segment base to 0xffff0000, not 0xf0000 as one could expect from the above. This is also true of a physical x86.
As such, the *real* reset vector is at 0xfffffff0 as opposed to the SeaBIOS vector at 0xffff0 -- this is a backwards compatibility vector which typically just issues a real reset.
In SeaBIOS it doesn't. It jumps to entry_post(). Which is fine for native SeaBIOS, but I suppose I need to fix it to do a *real* reset in the CSM case, for those operating systems which will switch back to 16-bit mode and jump to f000:fff0 to reboot.
Of course, if said "real reset" is only going to get straight back to the same 0xffff0 reset vector, that's not going to help. But at least then none of it will be *my* fault :)
On 02/14/13 21:54, H. Peter Anvin wrote:
On 02/14/2013 12:41 PM, Laszlo Ersek wrote:
). cpu_reset() [target-i386/helper.c] sets CS:IP to f000:fff0, which is the exact address of... reset_vector() in SeaBIOS.
This would be a bug, but it isn't quite true.
If you look at x86_cpu_reset() you will note that it sets the code segment base to 0xffff0000, not 0xf0000 as one could expect from the above. This is also true of a physical x86.
As such, the *real* reset vector is at 0xfffffff0 as opposed to the SeaBIOS vector at 0xffff0 -- this is a backwards compatibility vector which typically just issues a real reset.
Now, if Qemu doesn't handle the distinction here correctly, that is a bug.
I think I was simply wrong :)
Thanks Laszlo
On Thu, 2013-02-14 at 22:14 +0100, Laszlo Ersek wrote:
On 02/14/13 21:54, H. Peter Anvin wrote:
On 02/14/2013 12:41 PM, Laszlo Ersek wrote:
). cpu_reset() [target-i386/helper.c] sets CS:IP to f000:fff0, which is the exact address of... reset_vector() in SeaBIOS.
This would be a bug, but it isn't quite true.
If you look at x86_cpu_reset() you will note that it sets the code segment base to 0xffff0000, not 0xf0000 as one could expect from the above. This is also true of a physical x86.
As such, the *real* reset vector is at 0xfffffff0 as opposed to the SeaBIOS vector at 0xffff0 -- this is a backwards compatibility vector which typically just issues a real reset.
Now, if Qemu doesn't handle the distinction here correctly, that is a bug.
I think I was simply wrong :)
So it *is* jumping to 0xfffffff0 but the memory at that location isn't what we expect? Do the PAM registers affect *that* too, or only the region from 0xc0000-0xfffff? Surely the contents at 4GiB-δ should be unchanged by *anything* we do with the PAM registers?
Or maybe not... after also downloading the i440fx data sheet, I'm even more confused. There's some aliasing with... not the region at 1MiB-δ but the region at 16MiB-δ:
(From §4.1 System Address Map):
2. High BIOS Area (FFE0_0000h−− FFFF_FFFFh) The top 2 Mbytes of the Extended Memory Region is reserved for System BIOS (High BIOS), extended BIOS for PCI devices, and the A20 alias of the system BIOS. The CPU begins execution from the High BIOS after reset. This region is mapped to the PCI so that the upper subset of this region is aliased to 16 Mbytes minus 256-Kbyte range.
On 02/14/2013 01:27 PM, David Woodhouse wrote:
So it *is* jumping to 0xfffffff0 but the memory at that location isn't what we expect? Do the PAM registers affect *that* too, or only the region from 0xc0000-0xfffff? Surely the contents at 4GiB-δ should be unchanged by *anything* we do with the PAM registers?
Or maybe not... after also downloading the i440fx data sheet, I'm even more confused. There's some aliasing with... not the region at 1MiB-δ but the region at 16MiB-δ:
(From §4.1 System Address Map):
- High BIOS Area (FFE0_0000h−− FFFF_FFFFh) The top 2 Mbytes of the Extended Memory Region is reserved for System BIOS (High BIOS), extended BIOS for PCI devices, and the A20 alias of the system BIOS. The CPU begins execution from the High BIOS after reset. This region is mapped to the PCI so that the upper subset of this region is aliased to 16 Mbytes minus 256-Kbyte range.
That is presumably a 286 compatibility hack -- the 286 had 24 address lines. I doubt anyone gives a hoot about it, and neither EDK2 nor SeaBIOS should care.
-hpa
On Feb 14, 2013, at 2:09 PM, "H. Peter Anvin" hpa@zytor.com wrote:
On 02/14/2013 01:27 PM, David Woodhouse wrote:
So it *is* jumping to 0xfffffff0 but the memory at that location isn't what we expect? Do the PAM registers affect *that* too, or only the region from 0xc0000-0xfffff? Surely the contents at 4GiB-δ should be unchanged by *anything* we do with the PAM registers?
Or maybe not... after also downloading the i440fx data sheet, I'm even more confused. There's some aliasing with... not the region at 1MiB-δ but the region at 16MiB-δ:
I don't remember the specific registers for the 440BX....
The i486 moved the reset vector to 0xFFFFFFF0, but it is in real mode. The processor CS register has some magic internal value that lets you run real mode code up high, but the 1st long jmp you do sends you down low. Thus the chipset needs to alias 0xF000:0xFFF0 to the high address. If you BIOS is written in protected mode then it will turn on the HIgh BIOS Area and jump back into the just under the 4GB region and now it has access to a ROM that can be up to 2MB in size after it turns on the high BIOS area.
If you hardware reset the PAM registers should get set back to defaults, and CPU goes into the reset state. If you soft (also called warm) reset, jump to 0xF000:0xFFF0 then, you are not running the reset code in ROM (called SEC in the PI lingo) you are running the shadowed copy from memory provided by the SeaBIOS for compatibility.
Thanks,
Andrew
(From §4.1 System Address Map):
- High BIOS Area (FFE0_0000h−− FFFF_FFFFh)
The top 2 Mbytes of the Extended Memory Region is reserved for System BIOS (High BIOS), extended BIOS for PCI devices, and the A20 alias of the system BIOS. The CPU begins execution from the High BIOS after reset. This region is mapped to the PCI so that the upper subset of this region is aliased to 16 Mbytes minus 256-Kbyte range.
That is presumably a 286 compatibility hack -- the 286 had 24 address lines. I doubt anyone gives a hoot about it, and neither EDK2 nor SeaBIOS should care.
-hpa
-- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
On 02/14/13 22:27, David Woodhouse wrote:
On Thu, 2013-02-14 at 22:14 +0100, Laszlo Ersek wrote:
On 02/14/13 21:54, H. Peter Anvin wrote:
On 02/14/2013 12:41 PM, Laszlo Ersek wrote:
). cpu_reset() [target-i386/helper.c] sets CS:IP to f000:fff0, which is the exact address of... reset_vector() in SeaBIOS.
This would be a bug, but it isn't quite true.
If you look at x86_cpu_reset() you will note that it sets the code segment base to 0xffff0000, not 0xf0000 as one could expect from the above. This is also true of a physical x86.
As such, the *real* reset vector is at 0xfffffff0 as opposed to the SeaBIOS vector at 0xffff0 -- this is a backwards compatibility vector which typically just issues a real reset.
Now, if Qemu doesn't handle the distinction here correctly, that is a bug.
I think I was simply wrong :)
So it *is* jumping to 0xfffffff0 but the memory at that location isn't what we expect? Do the PAM registers affect *that* too, or only the region from 0xc0000-0xfffff? Surely the contents at 4GiB-δ should be unchanged by *anything* we do with the PAM registers?
I meant that my reading of what x86_cpu_reset() [nee cpu_reset()] was wrong, because the constant that it passes as argument in fact conforms to what Peter says.
( Also check the rom_add_file_fixed() call in pc_init1() / qemu:
ret = rom_add_file_fixed(bios_name, (uint32_t)(-bios_size), -1);
("bios_size" is an "int"). I referred to this in my thread starter as
When qemu-kvm starts, "OVMF.fd" is installed as ROM, such that the last address it occupies is "all-bits-one", independently of its size [...]
Namely, the value of (uint32_t)(-bios_size) is
UINT32_MAX + 1 - bios_size
(the above is meant as a math formula, not as a C expression), hence the last byte occupied is at UINT32_MAX. In my first post I silently thought that this value would be truncated to fewer bits somewhere, but apparently that's not the case. )
Or maybe not... after also downloading the i440fx data sheet, I'm even more confused. There's some aliasing with... not the region at 1MiB-δ but the region at 16MiB-δ:
(From §4.1 System Address Map):
- High BIOS Area (FFE0_0000h−− FFFF_FFFFh)
The top 2 Mbytes of the Extended Memory Region is reserved for System BIOS (High BIOS), extended BIOS for PCI devices, and the A20 alias of the system BIOS. The CPU begins execution from the High BIOS after reset. This region is mapped to the PCI so that the upper subset of this region is aliased to 16 Mbytes minus 256-Kbyte range.
After Peter emphasized that the code segment base was 0xffff0000, I went back to wikipedia http://en.wikipedia.org/wiki/Reset_vector, and finally managed to parse
The reset vector for the 80386DX and later x86 processors is 0xFFFF0, although the value of the CS register at reset is 0xF000 and the value of the IP register at reset is 0xFFF0. In actuality, current x86 processors fetch from the physical address 0xFFFFFFF0. This is due to a hidden base address portion of the CS register in real mode which defaults to 0xFFFF0000 after reset.
This is again consistent with the 0xfffffff0 vector pointed out by Peter (= 0xFFFF0000 + 0xFFF0), but I don't know how to match it to the data sheet language...
Thanks Laszlo
On Thu, 2013-02-14 at 21:41 +0100, Laszlo Ersek wrote:
Can we dumb down ^W^W generalize this code? :) Or maybe should qemu introduce a reset handler for PAMs?
In the UEFI+CSM model, I believe the handling of PAM stuff is left *entirely* to the UEFI side and the CSM is supposed to be hardware-agnostic. So actually bashing on the PAM registers from the CSM side would be my last resort. It's why it's important to fix the UmbStart/UmbEnd thing correctly, too.
Other people might have been happy to hack up something machine-specific, given that they control both UEFI and CSM sides of it and they're shipping their own proprietary version where nobody can see what they're doing. But when the CSM spec is the interface between two entirely *separate* projects (OVMF and SeaBIOS), I think it's important that we *follow* the spec and don't have nasty hacks.
So, if real hardware would reset the PAMs on reset and avoid the need for SeaBIOS to do so, I think we should be doing the same in qemu too.
Thanks for testing this, btw. Are you looking at suspend/resume too? :)
On 02/14/13 21:55, David Woodhouse wrote:
So, if real hardware would reset the PAMs on reset and avoid the need for SeaBIOS to do so, I think we should be doing the same in qemu too.
That's what I couldn't figure out from the i440FX spec, but I believe one could argue that "reset" should in fact re-set the state that was observable at VM startup.
Thanks for testing this, btw. Are you looking at suspend/resume too? :)
I'm either not looking, or not admitting it! :)
In earnest: I "approached" Platform Initialization S3 resume cautiously, then fled in a panic. See Chapter 8 in Volume 5 -- Jordan convinced me that this scripting language was in fact reasonable, but the work needed to follow through OVMF PI, make it S3 resume compliant, and record everything into a script at cold boot looks very threatening.
Regarding S3 in SeaBIOS CSM: I haven't tried it yet, but I can press the button in guests if you want me to.
Laszlo
On 02/14/13 21:55, David Woodhouse wrote:
Thanks for testing this, btw. Are you looking at suspend/resume too? :)
Entering S3 seemed OK (except the screen was not cleared; using Cirrus). I woke up the guest with
# virsh qemu-monitor-command fw-ovmf.g-f18xfce2012121716.e-rhel63 \ --hmp --cmd system_wakeup
Trailing portion of the log:
In resume (status=254) In 32bit resume rsdp=0x00000000 No resume vector set! Attempting a hard reboot i8042_wait_write In resume (status=0) In 32bit resume Attempting a hard reboot [...]
I can see the following CSM calls earlier: - Legacy16InitializeYourself - Legacy16GetTableAddress - Legacy16DispatchOprom - Legacy16UpdateBbs
No calls to PrepareToBoot (which could set RsdpAddr); this is an UEFI guest. (The CSM is used for the GOP only.)
Laszlo
On Fri, 2013-02-15 at 00:01 +0100, Laszlo Ersek wrote:
Entering S3 seemed OK (except the screen was not cleared; using Cirrus). I woke up the guest with
# virsh qemu-monitor-command fw-ovmf.g-f18xfce2012121716.e-rhel63 \ --hmp --cmd system_wakeup
Trailing portion of the log:
In resume (status=254) In 32bit resume rsdp=0x00000000 No resume vector set! Attempting a hard reboot i8042_wait_write In resume (status=0) In 32bit resume Attempting a hard reboot [...]
I can see the following CSM calls earlier:
- Legacy16InitializeYourself
- Legacy16GetTableAddress
- Legacy16DispatchOprom
- Legacy16UpdateBbs
No calls to PrepareToBoot (which could set RsdpAddr); this is an UEFI guest. (The CSM is used for the GOP only.)
So you have the same problem as with reset — you're ending up back in the CSM in RAM, when you ought to be in the OVMF "ROM".
I wonder if a *Legacy* guest might actually fare a little better? At least find_resume_vector() would have a chance of working if the CSM has actually been told where the ACPI tables are...
On Fri, Feb 15, 2013 at 12:01:38AM +0100, Laszlo Ersek wrote:
On 02/14/13 21:55, David Woodhouse wrote:
Thanks for testing this, btw. Are you looking at suspend/resume too? :)
Entering S3 seemed OK (except the screen was not cleared; using Cirrus). I woke up the guest with
# virsh qemu-monitor-command fw-ovmf.g-f18xfce2012121716.e-rhel63 \ --hmp --cmd system_wakeup
Trailing portion of the log:
In resume (status=254) In 32bit resume rsdp=0x00000000 No resume vector set!
That is strange. As noted elsewhere, on a resume or reboot the cpu should have started execution at 0xfffffff0 which is OVMF and not SeaBIOS. I don't understand why/how SeaBIOS would be involved in the resume code path at all.
-Kevin
On Thu, Feb 14, 2013 at 08:16:02PM -0500, Kevin O'Connor wrote:
On Fri, Feb 15, 2013 at 12:01:38AM +0100, Laszlo Ersek wrote:
On 02/14/13 21:55, David Woodhouse wrote:
Thanks for testing this, btw. Are you looking at suspend/resume too? :)
Entering S3 seemed OK (except the screen was not cleared; using Cirrus). I woke up the guest with
# virsh qemu-monitor-command fw-ovmf.g-f18xfce2012121716.e-rhel63 \ --hmp --cmd system_wakeup
Trailing portion of the log:
In resume (status=254) In 32bit resume rsdp=0x00000000 No resume vector set!
That is strange. As noted elsewhere, on a resume or reboot the cpu should have started execution at 0xfffffff0 which is OVMF and not SeaBIOS. I don't understand why/how SeaBIOS would be involved in the resume code path at all.
By chance, are you using an older version of kvm? There was a bug in kvm that caused changes to memory mapped at 0xe0000-0xfffff to also be reflected in the "rom" image at 0xfffe0000-0xffffffff. It was my understand that this bug was fixed though.
-Kevin
On 02/15/13 02:22, Kevin O'Connor wrote:
On Thu, Feb 14, 2013 at 08:16:02PM -0500, Kevin O'Connor wrote:
On Fri, Feb 15, 2013 at 12:01:38AM +0100, Laszlo Ersek wrote:
On 02/14/13 21:55, David Woodhouse wrote:
Thanks for testing this, btw. Are you looking at suspend/resume too? :)
Entering S3 seemed OK (except the screen was not cleared; using Cirrus). I woke up the guest with
# virsh qemu-monitor-command fw-ovmf.g-f18xfce2012121716.e-rhel63 \ --hmp --cmd system_wakeup
Trailing portion of the log:
In resume (status=254) In 32bit resume rsdp=0x00000000 No resume vector set!
That is strange. As noted elsewhere, on a resume or reboot the cpu should have started execution at 0xfffffff0 which is OVMF and not SeaBIOS. I don't understand why/how SeaBIOS would be involved in the resume code path at all.
By chance, are you using an older version of kvm? There was a bug in kvm that caused changes to memory mapped at 0xe0000-0xfffff to also be reflected in the "rom" image at 0xfffe0000-0xffffffff. It was my understand that this bug was fixed though.
You are great! Disabling KVM for the guest (/domain/@type='qemu') made the reboot work on both the RHEL-6 devel version of qemu and on upstream 1.3.1.
(I didn't try suspend/resume yet.)
Do you recall the precise commit that fixed the "reflection"? I've been eyeballing kvm commit messages for a few ten minutes now, but of course in vain. (CC'ing Gleb and Marcelo.)
Thank you, Laszlo
On Fri, Feb 15, 2013 at 04:10:59AM +0100, Laszlo Ersek wrote:
On 02/15/13 02:22, Kevin O'Connor wrote:
On Thu, Feb 14, 2013 at 08:16:02PM -0500, Kevin O'Connor wrote: By chance, are you using an older version of kvm? There was a bug in kvm that caused changes to memory mapped at 0xe0000-0xfffff to also be reflected in the "rom" image at 0xfffe0000-0xffffffff. It was my understand that this bug was fixed though.
You are great! Disabling KVM for the guest (/domain/@type='qemu') made the reboot work on both the RHEL-6 devel version of qemu and on upstream 1.3.1.
(I didn't try suspend/resume yet.)
Do you recall the precise commit that fixed the "reflection"? I've been eyeballing kvm commit messages for a few ten minutes now, but of course in vain. (CC'ing Gleb and Marcelo.)
I found this email thread:
http://kerneltrap.org/mailarchive/linux-kvm/2010/9/21/6267744
and: http://marc.info/?l=kvm-commits&m=128576215909532
-Kevin
15.02.2013 07:43, Kevin O'Connor wrote:
On Fri, Feb 15, 2013 at 04:10:59AM +0100, Laszlo Ersek wrote:
On 02/15/13 02:22, Kevin O'Connor wrote:
On Thu, Feb 14, 2013 at 08:16:02PM -0500, Kevin O'Connor wrote: By chance, are you using an older version of kvm? There was a bug in kvm that caused changes to memory mapped at 0xe0000-0xfffff to also be reflected in the "rom" image at 0xfffe0000-0xffffffff. It was my understand that this bug was fixed though.
You are great! Disabling KVM for the guest (/domain/@type='qemu') made the reboot work on both the RHEL-6 devel version of qemu and on upstream 1.3.1.
(I didn't try suspend/resume yet.)
Do you recall the precise commit that fixed the "reflection"? I've been eyeballing kvm commit messages for a few ten minutes now, but of course in vain. (CC'ing Gleb and Marcelo.)
I found this email thread:
http://kerneltrap.org/mailarchive/linux-kvm/2010/9/21/6267744
This patch is more than 2 years old and is applied to all more or less recent qemu versions. This does not tell us why disabling kvm (with this patch applied!) makes a difference. So there must be another (maybe similar) bug somewhere...
/mjt
On Fri, 2013-02-15 at 11:19 +0400, Michael Tokarev wrote:
This patch is more than 2 years old and is applied to all more or less recent qemu versions.
RHEL 6.3?
I'm *not* seeing this bug with recent qemu versions.
This does not tell us why disabling kvm (with this patch applied!) makes a difference. So there must be another (maybe similar) bug somewhere...
Are you looking at the same patch I'm looking at? Before the patch, if KVM is enabled then the i440fx_update_memory_mappings() function just bails out without doing anything. As the commit message describes, it fails to remap the 0xf0000 memory from ROM to RAM, so subsequent writes to the F-segment actually modify the *ROM* content instead of the RAM copy as they should. (KVM doesn't write-protect the ROM). So on reset, it ends up running the *modified* copy of the BIOS.
That's an *exact* description of what Laszlo was seeing, surely?
(removing edk2-devel, adding Jan)
On 02/15/13 08:19, Michael Tokarev wrote:
15.02.2013 07:43, Kevin O'Connor wrote:
On Fri, Feb 15, 2013 at 04:10:59AM +0100, Laszlo Ersek wrote:
On 02/15/13 02:22, Kevin O'Connor wrote:
On Thu, Feb 14, 2013 at 08:16:02PM -0500, Kevin O'Connor wrote: By chance, are you using an older version of kvm? There was a bug in kvm that caused changes to memory mapped at 0xe0000-0xfffff to also be reflected in the "rom" image at 0xfffe0000-0xffffffff. It was my understand that this bug was fixed though.
You are great! Disabling KVM for the guest (/domain/@type='qemu') made the reboot work on both the RHEL-6 devel version of qemu and on upstream 1.3.1.
(I didn't try suspend/resume yet.)
Do you recall the precise commit that fixed the "reflection"? I've been eyeballing kvm commit messages for a few ten minutes now, but of course in vain. (CC'ing Gleb and Marcelo.)
I found this email thread:
http://kerneltrap.org/mailarchive/linux-kvm/2010/9/21/6267744
I confirm RHEL-6 qemu-kvm lacks that patch; we still have the FIXME and the return statement that depend on kvm_enabled() in i440fx_update_memory_mappings().
This patch is more than 2 years old and is applied to all more or less recent qemu versions. This does not tell us why disabling kvm (with this patch applied!) makes a difference.
I just retested on v1.3.1 + kvm, the problem is still there indeed.
(Note that neither Gleb's patch, aa85bd8b "support piix PAM registers in KVM", nor the patch that it partially undid:
commit d03f4d2defd76f35f46f5418979f3e6d14a11183 Author: Jan Kiszka jan.kiszka@web.de Date: Wed Sep 10 21:34:44 2008 +0200
I440fx: do change ISA mappings under KVM
As long as KVM does not support remapping or protection state changes of guest memory, do not fiddle with the ISA mappings that QEMU see, confusing both the monitor and the gdbstub.
Signed-off-by: Jan Kiszka jan.kiszka@web.de Signed-off-by: Avi Kivity avi@qumranet.com
made it ever to qemu; these are qemu-kvm commits.)
So there must be another (maybe similar) bug somewhere...
Maybe there was a concurrent or slightly earlier change to KVM that enabled the userspace fix too?... IOW the KVM fix could be necessary but not sufficient, the KVM fix + the qemu-kvm fix together are sufficient.
If I disable KVM, i440fx_update_memory_mappings() probably does the same thing in RHEL-6 qemu-kvm as in upstream qemu v1.3.1. If I enable KVM, then RHEL-6 qemu-kvm breaks immediately in userspace, while upstream 1.3.1 might want to rely on KVM, but runs into a bug (?) on the RHEL-6 host kernel.
Thanks, Laszlo
On Thu, 2013-02-14 at 21:41 +0100, Laszlo Ersek wrote:
I noticed that under OVMF + SeaBIOS CSM + your related patches for both, reset requested by the guest doesn't work as expected. The behavior is an infinite loop, with the following debug fragment repeated by the CSM-ized SeaBIOS:
In resume (status=0) In 32bit resume Attempting a hard reboot i8042_wait_write
Hmm. My build from http://david.woodhou.se/OVMF.fd works fine. I did a legacy boot into (Ubuntu Oneiric's) Grub, then issued the 'reboot' command...
This appears to be the case for qemu 1.2.0 and 1.3.0, both with and without KVM.
enter handle_13: a=00004200 b=00000801 c=0000003f d=00000080 ds=6000 es=0000 ss=0000 si=0000fe00 di=00000000 bp=00001ff0 sp=00001ff2 cs=0000 ip=9157 f=0202 disk_op d=0x0000db20 lba=9269505 buf=0x00068000 count=63 cmd=2 pmtimer: 2:15494096 pmtimer: 2:15494211 In resume (status=0) In 32bit resume Attempting a hard reboot i8042_wait_write pmtimer: 2:15501497 pmtimer: 2:15501593 pmtimer: 2:15501750 SecCoreStartupWithStack(0xFFFE6000, 0x80000) File->Type: 0xB Section->Type: 0x2 Section->Type: 0x19 Section->Type (0x19) != SectionType (0x17)
On 02/14/13 23:24, David Woodhouse wrote:
On Thu, 2013-02-14 at 21:41 +0100, Laszlo Ersek wrote:
I noticed that under OVMF + SeaBIOS CSM + your related patches for both, reset requested by the guest doesn't work as expected. The behavior is an infinite loop, with the following debug fragment repeated by the CSM-ized SeaBIOS:
In resume (status=0) In 32bit resume Attempting a hard reboot i8042_wait_write
Hmm. My build from http://david.woodhou.se/OVMF.fd works fine. I did a legacy boot into (Ubuntu Oneiric's) Grub, then issued the 'reboot' command...
This appears to be the case for qemu 1.2.0 and 1.3.0, both with and without KVM.
I retested: - on a pristine v3.0.0 host kernel, and - a pristine v1.3.1 qemu build (+ KVM enabled), and - using your OVMF.fd from the above link (which of course includes your build of the CSM-ized SeaBIOS).
Same infinite loop, alas...
(i) What is your host kernel exactly?
(ii) And when you say you did a "legacy boot", does that mean you installed the guest OS as a traditional one? Is that grub or grub-efi?
In my case the guest is a "full" UEFI installation of RHEL-6: I perform an UEFI boot from an emulated IDE disk to load grub-efi (which is thus pointed to by a non-BBS-devpath). The only thing I'm using the CSM for is the GOP based on vgabios-cirrus.bin
(iii) Can vgabios.bin make a difference? Could you please upload your build? I gather you use stdvga; I also tried that, makes no difference, same loop.
Comparing our logs,
--- dwmw2.log 2013-02-15 18:47:39.654360652 +0100 +++ lersek.log 2013-02-15 18:49:18.061364128 +0100 @@ -1,18 +1,12 @@ -enter handle_13: - a=00004200 b=00000801 c=0000003f d=00000080 ds=6000 es=0000 ss=0000 - si=0000fe00 di=00000000 bp=00001ff0 sp=00001ff2 cs=0000 ip=9157 f=0202 -disk_op d=0x0000db20 lba=9269505 buf=0x00068000 count=63 cmd=2 -pmtimer: 2:15494096 -pmtimer: 2:15494211 +enter handle_15: + a=00002401 b=00004118 c=00000000 d=00000003 ds=0000 es=4000 ss=4000 + si=00000000 di=00004380 bp=00000000 sp=0000ffc6 cs=4f00 ip=0030 f=3002 +Trying to allocate 971 pages for VMLINUZ +[Linux-EFI, setup=0x10fa, size=0x3ca030] + [Initrd, addr=0x3c089000, size=0xf68cb9] +\u02d9Changing serial settings was 0/0 now 3/0 In resume (status=0) In 32bit resume Attempting a hard reboot i8042_wait_write -pmtimer: 2:15501497 -pmtimer: 2:15501593 -pmtimer: 2:15501750 -SecCoreStartupWithStack(0xFFFE6000, 0x80000) -File->Type: 0xB -Section->Type: 0x2 -Section->Type: 0x19 -Section->Type (0x19) != SectionType (0x17) +Changing serial settings was 0/0 now 3/0 [...]
Thanks Laszlo
On Fri, 2013-02-15 at 19:54 +0100, Laszlo Ersek wrote:
Same infinite loop, alas...
(i) What is your host kernel exactly?
3.7.5-201.fc18.x86_64 (booted from EFI on a MacBookPro 8,3).
(ii) And when you say you did a "legacy boot", does that mean you installed the guest OS as a traditional one? Is that grub or grub-efi?
That one was a traditional install. I've just now tested an EFI install of Fedora 17, which also reboots fine; both from grub and the installed kernel. It *doesn't* seem to hit the SeaBIOS CSM on the way back, but reboots directly into OVMF.
I have two versions of qemu; the Fedora 18 one (1.2.0) and a locally-rebuilt copy of the Fedora rawhide 1.3.0, which I installed because OVMF's native video driver doesn't work in 1.2.0. Not that that matters any more since I've disabled that and I'm using the VGA BIOS.
There's also a current build from qemu git. They all behave the same way, both with and without KVM.
(iii) Can vgabios.bin make a difference? Could you please upload your build? I gather you use stdvga; I also tried that, makes no difference, same loop.
This shouldn't make a difference. For the old vgabios, I only have the alignment fix, and no video would work if you didn't have that. http://david.woodhou.se/vgabios-cirrus.bin
On 02/15/13 21:57, David Woodhouse wrote:
On Fri, 2013-02-15 at 19:54 +0100, Laszlo Ersek wrote:
Same infinite loop, alas...
(i) What is your host kernel exactly?
3.7.5-201.fc18.x86_64 (booted from EFI on a MacBookPro 8,3).
- host CPU: Xeon W3550 (family/model/stepping = 6/26/5)
- host kernel: 3.7.8; config attached
- md5sums of firmware binaries (both from you):
61daae4d085f646093e31df2b13b13e8 OVMF-david.fd 3a6a829c55cbd4e27db745326a5fed44 vgabios-cirrus-david.bin
- qemu: upstream v1.3.1; configs attached
./configure --target-list=x86_64-softmmu --prefix=/opt/qemu-upstream \ --enable-debug
- libvirt XML: attached; <emulator> refers to qemu wrapper script
- qemu command line (from ps -f):
/opt/qemu-upstream/bin/qemu-system-x86_64 \ -name fw-mixed.g-f18xfce2012121716.e-upstream \ -S \ -M pc-1.3 \ -enable-kvm \ -bios /root/OVMF-david.fd \ -m 1024 \ -smp 4,sockets=4,cores=1,threads=1 \ -uuid 0865a1c1-6474-249b-5e84-81171c4e1d0c \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/fw-mixed.g-f18xfce2012121716.e-upstream.monitor,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -boot c \ -drive file=/var/lib/libvirt/images/fw-mixed.g-f18xfce2012121716.e-upstream.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none \ -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 \ -drive file=/filestore/isos/f18/Fedora-18-Nightly-20121217.16-x86_64-Live-xfce.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw \ -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 \ -chardev pty,id=charserial0 \ -device isa-serial,chardev=charserial0,id=serial0 \ -usb \ -vnc 127.0.0.1:0 \ -vga cirrus \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \ -debugcon file:/tmp/fw-mixed.g-f18xfce2012121716.e-upstream.debug \ -global isa-debugcon.iobase=0x402 \ -global PIIX4_PM.disable_s3=0 \ -global PIIX4_PM.disable_s4=0
- guest: F18 XFCE nightly, UEFI installed, UEFI booted
- serial output: attached
- result: infinite loop at guest reset
I give up. Thanks for the help & sorry about spamming three lists.
Laszlo
On Sat, 2013-02-16 at 02:37 +0100, Laszlo Ersek wrote:
I give up. Thanks for the help & sorry about spamming three lists.
I've managed to reproduce this on a clean F18 system. This is the stock qemu 1.2.2-6.fc18 on kernel 3.7.6-201.fc18.x86_64 with a newly-installed Fedora 18 VM in the guest.
qemu-system-x86_64 -enable-kvm -cdrom F18boot.iso -serial mon:stdio -bios OVMF.fd
On my laptop where I'd been doing most of my testing, even after running 'yum distro-sync qemu*' to get back to the stock qemu, I still can't reproduce the issue. They are both running the *same* kernel.
I'll try reverting a whole bunch of other stuff that ought to be irrelevant to the stock distro packages, and see if/when it breaks...
On Mon, 2013-02-18 at 10:40 +0000, David Woodhouse wrote:
On Sat, 2013-02-16 at 02:37 +0100, Laszlo Ersek wrote:
I give up. Thanks for the help & sorry about spamming three lists.
I've managed to reproduce this on a clean F18 system. This is the stock qemu 1.2.2-6.fc18 on kernel 3.7.6-201.fc18.x86_64 with a newly-installed Fedora 18 VM in the guest.
qemu-system-x86_64 -enable-kvm -cdrom F18boot.iso -serial mon:stdio -bios OVMF.fd
On my laptop where I'd been doing most of my testing, even after running 'yum distro-sync qemu*' to get back to the stock qemu, I still can't reproduce the issue. They are both running the *same* kernel.
I'll try reverting a whole bunch of other stuff that ought to be irrelevant to the stock distro packages, and see if/when it breaks...
I cannot make these two machines behave consistently. I have absolutely no clue what is going on here.
At reset, the PAM regions are all set to '1' (read only). So the CSM should reside in RAM at 0xffff0 but THAT SHOULDN'T MATTER. After a reset we should be running from 0xfffffff0 and there's unconditionally ROM there, isn't there?
Nevertheless, on my workstation as on yours, we do seem to end up executing from the CSM in RAM when we reset. But on my laptop, it executes the *ROM* as it should.
This patch 'fixes' it, and I think it might even be correct in itself, but I don't think it's a correct fix for the problem we're discussing. And I certainly want to know what's different on my laptop that makes it work *without* this patch.
Either there's some weirdness with setting the high CS base address, on CPU reset. Or perhaps the contents of the memory region at 0xfffffff0 have *really* been changed along with the sub-1MiB range. Or maybe the universe just hates us...
diff --git a/hw/piix_pci.c b/hw/piix_pci.c index 6c77e49..6dcf1c5 100644 --- a/hw/piix_pci.c +++ b/hw/piix_pci.c @@ -171,6 +171,23 @@ static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id) return 0; }
+static void i440fx_reset(void *opaque) +{ + PCII440FXState *d = opaque; + uint8_t *pci_conf = d->dev.config; + + pci_conf[0x59] = 0x00; // Reset PAM setup + pci_conf[0x5a] = 0x00; + pci_conf[0x5b] = 0x00; + pci_conf[0x5c] = 0x00; + pci_conf[0x5d] = 0x00; + pci_conf[0x5e] = 0x00; + pci_conf[0x5f] = 0x00; + pci_conf[0x72] = 0x02; // And SMM + + i440fx_update_memory_mappings(d); +} + static int i440fx_post_load(void *opaque, int version_id) { PCII440FXState *d = opaque; @@ -217,6 +234,8 @@ static int i440fx_initfn(PCIDevice *dev) d->dev.config[I440FX_SMRAM] = 0x02;
cpu_smm_register(&i440fx_set_smm, d); + + qemu_register_reset(i440fx_reset, d); return 0; }
Il 18/02/2013 13:53, David Woodhouse ha scritto:
diff --git a/hw/piix_pci.c b/hw/piix_pci.c index 6c77e49..6dcf1c5 100644 --- a/hw/piix_pci.c +++ b/hw/piix_pci.c @@ -171,6 +171,23 @@ static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id) return 0; }
+static void i440fx_reset(void *opaque) +{
- PCII440FXState *d = opaque;
- uint8_t *pci_conf = d->dev.config;
- pci_conf[0x59] = 0x00; // Reset PAM setup
- pci_conf[0x5a] = 0x00;
- pci_conf[0x5b] = 0x00;
- pci_conf[0x5c] = 0x00;
- pci_conf[0x5d] = 0x00;
- pci_conf[0x5e] = 0x00;
- pci_conf[0x5f] = 0x00;
- pci_conf[0x72] = 0x02; // And SMM
- i440fx_update_memory_mappings(d);
+}
static int i440fx_post_load(void *opaque, int version_id) { PCII440FXState *d = opaque; @@ -217,6 +234,8 @@ static int i440fx_initfn(PCIDevice *dev) d->dev.config[I440FX_SMRAM] = 0x02;
cpu_smm_register(&i440fx_set_smm, d);
- qemu_register_reset(i440fx_reset, d); return 0;
}
If you want to submit this patch for upstream QEMU (I agree it is a good idea), please set dc->reset instead in i440fx_class_init.
Paolo
On Mon, 2013-02-18 at 15:46 +0100, Paolo Bonzini wrote:
If you want to submit this patch for upstream QEMU (I agree it is a good idea), please set dc->reset instead in i440fx_class_init.
Thanks.
I just copied the way that PIIX3 does it... is that something that piix3_class_init() should be doing for *its* reset function too?
I'll submit this for upstream, but I consider it a workaround for the real bug that Laszlo has been suffering from. So I'd rather wait until we've solved that properly, or at least until we understand why we get such different results on different CPUs.
Il 18/02/2013 16:00, David Woodhouse ha scritto:
On Mon, 2013-02-18 at 15:46 +0100, Paolo Bonzini wrote:
If you want to submit this patch for upstream QEMU (I agree it is a good idea), please set dc->reset instead in i440fx_class_init.
Thanks.
I just copied the way that PIIX3 does it... is that something that piix3_class_init() should be doing for *its* reset function too?
Yes.
I'll submit this for upstream, but I consider it a workaround for the real bug that Laszlo has been suffering from. So I'd rather wait until we've solved that properly, or at least until we understand why we get such different results on different CPUs.
Indeed the difference between CPUs is puzzling.
Paolo
On 02/18/13 13:53, David Woodhouse wrote:
Nevertheless, on my workstation as on yours, we do seem to end up executing from the CSM in RAM when we reset. But on my laptop, it executes the *ROM* as it should.
This patch 'fixes' it, and I think it might even be correct in itself, but I don't think it's a correct fix for the problem we're discussing. And I certainly want to know what's different on my laptop that makes it work *without* this patch.
Either there's some weirdness with setting the high CS base address, on CPU reset. Or perhaps the contents of the memory region at 0xfffffff0 have *really* been changed along with the sub-1MiB range. Or maybe the universe just hates us...
We're ending up in the wrong place, under 1MB (which is consistent with your "reset the PAMs" patch -- state of PAMs should only matter below 1MB).
I single-stepped qemu-1.3.1 in x86_cpu_reset() / cpu_x86_load_seg_cache(), and we seem to set the correct base. However when I pause the VM when it's spinning in the reset loop, and I issue the following in virsh:
# qemu-monitor-command --domain \ fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \ cpu 0
# qemu-monitor-command --domain \ fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \ info registers
for EIP and CS I get (from cpu_x86_dump_seg_cache(), in the "HF_CS64_MASK clear" branch):
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000623 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 0000f300 CS =f000 000f0000 0000ffff 0000f300 ^ ^ ^ ^ | base limit flags selector
SS =0000 00000000 0000ffff 0000f300 DS =0000 00000000 0000ffff 0000f300 FS =0000 00000000 0000ffff 0000f300 GS =0000 00000000 0000ffff 0000f300 LDT=0000 00000000 0000ffff 00008200 TR =0000 feffd000 00002088 00008b00 GDT= 00000000 0000ffff IDT= 00000000 0000ffff CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000 XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000 XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000 XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000
(1) The three high nibbles of CS base are lost.
Furthermore, the flags value is (Intel SDM Vol.3A, 3.4.5):
1 11 1 0011 00000000 P DPL S type base 23:16 ^ ^ ^ | | descriptor type (1 == code or data segment, 0 == system segment), DESC_S_MASK | descriptor privilege level (3 == least privileged) segment present, DESC_P_MASK
The "type" field depends on the S bit (here 1 == code/data). 0011b means (see 3.4.5.1):
0 0 1 1 D/C E W A C R A ^ ^ ^ ^ | | | accessed, DESC_A_MASK | | | | | for data: 0=r/o, 1==r/w | | for code: 0==exec/only, 1==exec/read, DESC_R_MASK | | | for data: 1==expand down | for code: 1==conforming | 0 == data, 1 == code, DESC_CS_MASK
The type dumped by "info registers" is "data segment, expand up, read/write, accessed".
I believe the D/C bit (bit 11) should be set, and then 1011b would mean "code segment, non-conforming, exec/read, accessed".
(2) x86_cpu_reset() does pass DESC_CS_MASK for R_CS, but it doesn't seem to be present in the dumped value.
I have no idea what's going on, but vmx_set_segment() in the kernel has a bunch of hacks for CS && selector == 0xf000 && base == 0xffff0000, and it seems to be host processor dependent. Eg. from commit b246dd5d:
/* * Fix segments for real mode guest in hosts that don't have * "unrestricted_mode" or it was disabled. * This is done to allow migration of the guests from hosts with * unrestricted guest like Westmere to older host that don't have * unrestricted guest like Nehelem. */ if (vmx->rmode.vm86_active) { switch (seg) { case VCPU_SREG_CS: vmcs_write32(GUEST_CS_AR_BYTES, 0xf3); vmcs_write32(GUEST_CS_LIMIT, 0xffff); if (vmcs_readl(GUEST_CS_BASE) == 0xffff0000) vmcs_writel(GUEST_CS_BASE, 0xf0000); vmcs_write16(GUEST_CS_SELECTOR, vmcs_readl(GUEST_CS_BASE) >> 4); break;
Also in init_vmcb() [arch/x86/kvm/svm.c] I can see (from commit d92899a0):
/* * cs.base should really be 0xffff0000, but vmx can't handle that, so * be consistent with it. * * Replace when we have real mode working for vmx. */ save->cs.base = 0xf0000;
Going back to vmx, vmx_vcpu_reset() [arch/x86/kvm/vmx.c]:
/* * GUEST_CS_BASE should really be 0xffff0000, but VT vm86 mode * insists on having GUEST_CS_BASE == GUEST_CS_SELECTOR << 4. Sigh. */ if (kvm_vcpu_is_bsp(&vmx->vcpu)) { vmcs_write16(GUEST_CS_SELECTOR, 0xf000); vmcs_writel(GUEST_CS_BASE, 0x000f0000); } else { vmcs_write16(GUEST_CS_SELECTOR, vmx->vcpu.arch.sipi_vector << 8); vmcs_writel(GUEST_CS_BASE, vmx->vcpu.arch.sipi_vector << 12); }
The leading comment and the main logic date back to commit 6aa8b732 ([PATCH] kvm: userspace interface).
(3) I wanted to ask you whether your laptop CPU is "more modern" than your workstation CPU, but from your other email I guess they're indeed different.
Laszlo
On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
On 02/18/13 13:53, David Woodhouse wrote: I single-stepped qemu-1.3.1 in x86_cpu_reset() / cpu_x86_load_seg_cache(), and we seem to set the correct base. However when I pause the VM when it's spinning in the reset loop, and I issue the following in virsh:
[...]
EIP=0000fff0 EFL=00000002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 0000f300 CS =f000 000f0000 0000ffff 0000f300
If you're seeing the CPU running at 0x000ffff0 then that would certainly be wrong. It needs to run at 0xfffffff0. Maybe this has something to do with KVM's difficulty with executing in "big real" mode?
Just to verify this is a cpu eip issue and not a memory mapping issue, one could try applying the patch below to seabios. On a working system during a reboot the patch should report "before val=1/0" and "after val=2/0" (the second value could be anything, but should not change). If you do see the second value changing it would indicate memory mapping issues.
-Kevin
--- a/src/resume.c +++ b/src/resume.c @@ -129,6 +129,12 @@ tryReboot(void) { dprintf(1, "Attempting a hard reboot\n");
+ dprintf(1, "before val=%x/%x\n", HaveRunPost, *(int*)((void*)&HaveRunPost + 0xfff00000)); + barrier(); + HaveRunPost = 2; + barrier(); + dprintf(1, "after val=%x/%x\n", HaveRunPost, *(int*)((void*)&HaveRunPost + 0xfff00000)); + // Setup for reset on qemu. qemu_prep_reset();
On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
On 02/18/13 13:53, David Woodhouse wrote:
Nevertheless, on my workstation as on yours, we do seem to end up executing from the CSM in RAM when we reset. But on my laptop, it executes the *ROM* as it should.
This patch 'fixes' it, and I think it might even be correct in itself, but I don't think it's a correct fix for the problem we're discussing. And I certainly want to know what's different on my laptop that makes it work *without* this patch.
Either there's some weirdness with setting the high CS base address, on CPU reset. Or perhaps the contents of the memory region at 0xfffffff0 have *really* been changed along with the sub-1MiB range. Or maybe the universe just hates us...
We're ending up in the wrong place, under 1MB (which is consistent with your "reset the PAMs" patch -- state of PAMs should only matter below 1MB).
I single-stepped qemu-1.3.1 in x86_cpu_reset() / cpu_x86_load_seg_cache(), and we seem to set the correct base. However when I pause the VM when it's spinning in the reset loop, and I issue the following in virsh:
# qemu-monitor-command --domain \ fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \ cpu 0
# qemu-monitor-command --domain \ fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \ info registers
for EIP and CS I get (from cpu_x86_dump_seg_cache(), in the "HF_CS64_MASK clear" branch):
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000623 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 0000f300 CS =f000 000f0000 0000ffff 0000f300 ^ ^ ^ ^ | base limit flags selector
This is because real mode is emulated as vm86 mode on intel cpus without "unrestricted guest" flag.
-- Gleb.
On 02/18/13 18:45, Gleb Natapov wrote:
On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
CS =f000 000f0000 0000ffff 0000f300 ^ ^ ^ ^ | base limit flags selector
This is because real mode is emulated as vm86 mode on intel cpus without "unrestricted guest" flag.
Awesome, this supports my desperate hunch in http://lists.nongnu.org/archive/html/qemu-devel/2013-02/msg02689.html. I hope David can confirm in practice!
Thanks! Laszlo
On Mon, 2013-02-18 at 19:16 +0100, Laszlo Ersek wrote:
On 02/18/13 18:45, Gleb Natapov wrote:
On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
CS =f000 000f0000 0000ffff 0000f300 ^ ^ ^ ^ | base limit flags selector
This is because real mode is emulated as vm86 mode on intel cpus without "unrestricted guest" flag.
Awesome, this supports mys desperate hunch in http://lists.nongnu.org/archive/html/qemu-devel/2013-02/msg02689.html. I hope David can confirm in practice!
Yes, my working machines have unrestricted_guest support, and the non-working machines don't. So when we're emulating it in vm86, the extended segment base handling is broken.
On Mon, Feb 18, 2013 at 07:16:25PM +0100, Laszlo Ersek wrote:
On 02/18/13 18:45, Gleb Natapov wrote:
On Mon, Feb 18, 2013 at 06:12:55PM +0100, Laszlo Ersek wrote:
CS =f000 000f0000 0000ffff 0000f300 ^ ^ ^ ^ | base limit flags selector
This is because real mode is emulated as vm86 mode on intel cpus without "unrestricted guest" flag.
Awesome, this supports my desperate hunch in http://lists.nongnu.org/archive/html/qemu-devel/2013-02/msg02689.html. I hope David can confirm in practice!
Laszlo explained to me that the problem is that after reset we end up in SeaBIOS reset code instead of OVMF one. This is because kvm starts to execute from ffff0 instead of fffffff0 after reset and this memory location is modifying during CSM loading. Seabios solves this problem by detecting reset condition and copying pristine image of itself from the end of 4G to the end of 1M. OVMF should do the same, but with CSM it does not get control back after reset since Seabios reset vector is executed instead. Why not put OVMF reset code at reset vector in CSM built SeaBIOS to solve the problem?
-- Gleb.
On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
Laszlo explained to me that the problem is that after reset we end up in SeaBIOS reset code instead of OVMF one. This is because kvm starts to execute from ffff0 instead of fffffff0 after reset and this memory location is modifying during CSM loading. Seabios solves this problem by detecting reset condition and copying pristine image of itself from the end of 4G to the end of 1M. OVMF should do the same, but with CSM it does not get control back after reset since Seabios reset vector is executed instead. Why not put OVMF reset code at reset vector in CSM built SeaBIOS to solve the problem?
Why not fix KVM so that it runs at fffffff0 after reset?
The only thing SeaBIOS could do is setup the segment registers and then jump to fffffff0, which is a bit of work for the same end result.
-Kevin
On Mon, 2013-02-18 at 14:00 -0500, Kevin O'Connor wrote:
On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
Laszlo explained to me that the problem is that after reset we end up in SeaBIOS reset code instead of OVMF one. This is because kvm starts to execute from ffff0 instead of fffffff0 after reset and this memory location is modifying during CSM loading. Seabios solves this problem by detecting reset condition and copying pristine image of itself from the end of 4G to the end of 1M. OVMF should do the same, but with CSM it does not get control back after reset since Seabios reset vector is executed instead. Why not put OVMF reset code at reset vector in CSM built SeaBIOS to solve the problem?
Why not fix KVM so that it runs at fffffff0 after reset?
The only thing SeaBIOS could do is setup the segment registers and then jump to fffffff0, which is a bit of work for the same end result.
Well, what SeaBIOS already *does* is bash on the keyboard controller to cause a reset. Which *ought* to work too; I have a patch to at least fix *that*, by resetting the PAM setup in the i440.
But yes, KVM definitely ought to be running at 0xfffffff0. This is the *vm86* code that's broken, not the native KVM version.
On Mon, Feb 18, 2013 at 07:04:08PM +0000, David Woodhouse wrote:
Well, what SeaBIOS already *does* is bash on the keyboard controller to cause a reset. Which *ought* to work too; I have a patch to at least fix *that*, by resetting the PAM setup in the i440.
The thing to be aware of here is that not all resets are equal. There is old code out there that will force a reset to go from 80386 mode to 8086 mode (or was it 286 to 8086?). So, some resets are really resumes (which must not alter memory) and some are real resets. It's a mystery to me which is which, but I know this came up the last time the QEMU reset logic was discussed.
-Kevin
On Mon, 2013-02-18 at 14:11 -0500, Kevin O'Connor wrote:
On Mon, Feb 18, 2013 at 07:04:08PM +0000, David Woodhouse wrote:
Well, what SeaBIOS already *does* is bash on the keyboard controller to cause a reset. Which *ought* to work too; I have a patch to at least fix *that*, by resetting the PAM setup in the i440.
The thing to be aware of here is that not all resets are equal. There is old code out there that will force a reset to go from 80386 mode to 8086 mode (or was it 286 to 8086?). So, some resets are really resumes (which must not alter memory) and some are real resets. It's a mystery to me which is which, but I know this came up the last time the QEMU reset logic was discussed.
Hm, yes. It will have been 286 to 8086, because ISTR there was no *other* way for the CPU to get back from 286 mode.
The i440fx data sheet (§3.0) appears to say that the default values are loaded on a *hard* reset, not a soft reset. And a reset invoked by the keyboard controller (as SeaBIOS does) is a *soft* reset. The only way to do a *hard* reset from software that's mentioned in the datasheet is the PMC turbo/reset control register (port 0x93). And that, presumably, is chipset-dependent and not something we can easily use from the reset vector without doing a bunch of hardware probing.
I suppose we could set it up in advance, during the *first* initialisation. Just point a 'do_hard_reset()' function pointer at a function of our choice, perhaps with the existing keyboard reset as a default if we don't know of anything better.
So we could probably solve the software side, in the guest... but qemu doesn't seem to distinguish between a hard reset and a soft reset, so there's no way to make it reset the PAM registers in one case but not the other. Does this reset for 286->8086 mode actually work in qemu at all? Is qemu's "reset" a hard reset, or a soft reset?
I suppose given that the RCR is part of the I440FX, and the behaviour that we want to vary for hard vs. soft reset is also within the I440FX, we *could* contrive to reset the PAM registers *only* when reset via the RCR. But if I propose a patch which does it that way, will someone hunt me down and hurt me?
On Mon, Feb 18, 2013 at 09:12:46PM +0000, David Woodhouse wrote:
The i440fx data sheet (§3.0) appears to say that the default values are loaded on a *hard* reset, not a soft reset. And a reset invoked by the keyboard controller (as SeaBIOS does) is a *soft* reset. The only way to do a *hard* reset from software that's mentioned in the datasheet is the PMC turbo/reset control register (port 0x93). And that, presumably, is chipset-dependent and not something we can easily use from the reset vector without doing a bunch of hardware probing.
The ACPI v2 spec describes a "hard" reset register. SeaBIOS could extract it from the FADT and then use it. Of course, we'd probably want to update the QEMU ACPI tables to implement ACPI v2 then.
-Kevin
On Mon, 2013-02-18 at 17:37 -0500, Kevin O'Connor wrote:
On Mon, Feb 18, 2013 at 09:12:46PM +0000, David Woodhouse wrote:
The i440fx data sheet (§3.0) appears to say that the default values are loaded on a *hard* reset, not a soft reset. And a reset invoked by the keyboard controller (as SeaBIOS does) is a *soft* reset. The only way to do a *hard* reset from software that's mentioned in the datasheet is the PMC turbo/reset control register (port 0x93). And that, presumably, is chipset-dependent and not something we can easily use from the reset vector without doing a bunch of hardware probing.
The ACPI v2 spec describes a "hard" reset register. SeaBIOS could extract it from the FADT and then use it. Of course, we'd probably want to update the QEMU ACPI tables to implement ACPI v2 then.
Yeah, that makes me somewhat happier about the SeaBIOS side of it being hardware-specific. That way the code at the reset vector only has to cope with a single 8-bit write to memory, IO or config space.
Laszlo has hooked up the RCR on the PIIX3 already, so something like this ought to make it reset the PAM setup *only* if reset via that...
diff --git a/hw/piix_pci.c b/hw/piix_pci.c index 6c77e49..f4420bd 100644 --- a/hw/piix_pci.c +++ b/hw/piix_pci.c @@ -77,6 +77,7 @@ typedef struct PIIX3State {
/* Reset Control Register contents */ uint8_t rcr; + uint8_t rcr_hard_reset;
/* IO memory region for Reset Control Register (RCR_IOPORT) */ MemoryRegion rcr_mem; @@ -84,6 +85,7 @@ typedef struct PIIX3State {
struct PCII440FXState { PCIDevice dev; + PIIX3State *piix3; MemoryRegion *system_memory; MemoryRegion *pci_address_space; MemoryRegion *ram_memory; @@ -171,6 +173,29 @@ static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id) return 0; }
+static void i440fx_reset(DeviceState *ds) +{ + PCIDevice *dev = DO_UPCAST(PCIDevice, qdev, ds); + PCII440FXState *d = DO_UPCAST(PCII440FXState, dev, dev); + uint8_t *pci_conf = d->dev.config; + + if (!d->piix3->rcr_hard_reset) + return; + + pci_conf[0x59] = 0x00; // Reset PAM setup + pci_conf[0x5a] = 0x00; + pci_conf[0x5b] = 0x00; + pci_conf[0x5c] = 0x00; + pci_conf[0x5d] = 0x00; + pci_conf[0x5e] = 0x00; + pci_conf[0x5f] = 0x00; + pci_conf[0x72] = 0x02; // And SMM + + i440fx_update_memory_mappings(d); + + d->piix3->rcr_hard_reset = 0; +} + static int i440fx_post_load(void *opaque, int version_id) { PCII440FXState *d = opaque; @@ -297,6 +322,7 @@ static PCIBus *i440fx_common_init(const char *device_name, pci_bus_set_route_irq_fn(b, piix3_route_intx_pin_to_irq); } piix3->pic = pic; + f->piix3 = piix3; *isa_bus = DO_UPCAST(ISABus, qbus, qdev_get_child_bus(&piix3->dev.qdev, "isa.0"));
@@ -521,6 +547,8 @@ static void rcr_write(void *opaque, hwaddr addr, uint64_t val, unsigned len) PIIX3State *d = opaque;
if (val & 4) { + if (val & 2) + d->rcr_hard_reset = 1; qemu_system_reset_request(); return; } @@ -615,6 +643,7 @@ static void i440fx_class_init(ObjectClass *klass, void *data) dc->desc = "Host bridge"; dc->no_user = 1; dc->vmsd = &vmstate_i440fx; + dc->reset = i440fx_reset; }
static const TypeInfo i440fx_info = {
On Mon, 2013-02-18 at 23:08 +0000, David Woodhouse wrote:
Laszlo has hooked up the RCR on the PIIX3 already, so something like this ought to make it reset the PAM setup *only* if reset via that...
+static void i440fx_reset(DeviceState *ds) +{
- PCIDevice *dev = DO_UPCAST(PCIDevice, qdev, ds);
- PCII440FXState *d = DO_UPCAST(PCII440FXState, dev, dev);
- uint8_t *pci_conf = d->dev.config;
- if (!d->piix3->rcr_hard_reset)
return;
... except that bit (referring to PIIX3 state directly from i440fx_reset(), having stashed a pointer to it) is horrible.
I've posted a 'cleaner' but much larger and more intrusive patch which shows how we could introduce a 'reset type' as a proper concept, which may well be useful for other platforms and situations too.
I'm not too bothered which way we go, but it would be very good to fix the PAM reset in qemu, because it's a genuine fix and it's *extremely* convenient to work around the KVM CS segment base bug.
On Mon, 2013-02-18 at 17:37 -0500, Kevin O'Connor wrote:
The ACPI v2 spec describes a "hard" reset register. SeaBIOS could extract it from the FADT and then use it. Of course, we'd probably want to update the QEMU ACPI tables to implement ACPI v2 then.
This sounded great until I actually came to implement it.
The PIIX reset at 0xcf9 requires *two* writes; one to set the reset type and then a second write with bit 2 set to actually do the reset.
The ACPI RESET_REG definition only allows for *one* value to be written.
Is that because the PIIX will actually do a hard reset when you write 0x06 to it *anyway*, despite theoretically saying that you should write 0x02 first? Or is the ACPI definition of RESET_REG simply incapable of being used on the PIIX?
On 02/19/13 16:29, David Woodhouse wrote:
On Mon, 2013-02-18 at 17:37 -0500, Kevin O'Connor wrote:
The ACPI v2 spec describes a "hard" reset register. SeaBIOS could extract it from the FADT and then use it. Of course, we'd probably want to update the QEMU ACPI tables to implement ACPI v2 then.
This sounded great until I actually came to implement it.
The PIIX reset at 0xcf9 requires *two* writes; one to set the reset type and then a second write with bit 2 set to actually do the reset.
The ACPI RESET_REG definition only allows for *one* value to be written.
Is that because the PIIX will actually do a hard reset when you write 0x06 to it *anyway*, despite theoretically saying that you should write 0x02 first? Or is the ACPI definition of RESET_REG simply incapable of being used on the PIIX?
The linux kernel actually considers BOOT_ACPI and BOOT_CF9 separate things; see
- native_machine_emergency_restart() [arch/x86/kernel/reboot.c], - acpi_reboot() [drivers/acpi/reboot.c], - acpi_reset() [drivers/acpi/acpica/hwxface.c].
BOOT_ACPI looks like a single write in any case (io space, memory, pci config).
Funnily enough, on my Thinkpad ( acpidump --table FACP --binary -o fadt.aml iasl -d fadt.aml ):
/* * Intel ACPI Component Architecture * AML Disassembler version 20090123 * * Disassembly of fadt.aml, Tue Feb 19 17:13:43 2013 * * ACPI Data Table [FACP] * * Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue */
[000h 000 4] Signature : "FACP" /* Fixed ACPI Description Table */
[070h 112 4] Flags (decoded below) : 0000C2AD Reset Register Supported (V2) : 0
[074h 116 12] Reset Register : <Generic Address Structure> [074h 116 1] Space ID : 01 (SystemIO) [075h 117 1] Bit Width : 08 [076h 118 1] Bit Offset : 00 [077h 119 1] Access Width : 00 [078h 120 8] Address : 0000000000000CF9 [080h 128 1] Value to cause reset : 06
Same on my HP Z400. "Reset register is not supported, but you could still write 6 to 0xcf9" :)
I'd say "6 to 0xCF9" is good enough; rcr_write() in qemu is OK with it too (including your patch at http://thread.gmane.org/gmane.comp.emulators.qemu/195351/focus=195387.)
Laszlo
On 02/18/13 20:00, Kevin O'Connor wrote:
On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
Laszlo explained to me that the problem is that after reset we end up in SeaBIOS reset code instead of OVMF one. This is because kvm starts to execute from ffff0 instead of fffffff0 after reset and this memory location is modifying during CSM loading. Seabios solves this problem by detecting reset condition and copying pristine image of itself from the end of 4G to the end of 1M. OVMF should do the same, but with CSM it does not get control back after reset since Seabios reset vector is executed instead. Why not put OVMF reset code at reset vector in CSM built SeaBIOS to solve the problem?
Why not fix KVM so that it runs at fffffff0 after reset?
The only thing SeaBIOS could do is setup the segment registers and then jump to fffffff0, which is a bit of work for the same end result.
Gleb told me to test under a kvm/next host kernel; there have been many real-mode related commits. I'll report back.
Laszlo
On 02/18/13 20:09, Laszlo Ersek wrote:
On 02/18/13 20:00, Kevin O'Connor wrote:
On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
Laszlo explained to me that the problem is that after reset we end up in SeaBIOS reset code instead of OVMF one. This is because kvm starts to execute from ffff0 instead of fffffff0 after reset and this memory location is modifying during CSM loading. Seabios solves this problem by detecting reset condition and copying pristine image of itself from the end of 4G to the end of 1M. OVMF should do the same, but with CSM it does not get control back after reset since Seabios reset vector is executed instead. Why not put OVMF reset code at reset vector in CSM built SeaBIOS to solve the problem?
Why not fix KVM so that it runs at fffffff0 after reset?
The only thing SeaBIOS could do is setup the segment registers and then jump to fffffff0, which is a bit of work for the same end result.
Gleb told me to test under a kvm/next host kernel; there have been many real-mode related commits. I'll report back.
I built a host kernel from http://git.kernel.org/?p=virt/kvm/kvm.git;a=shortlog;h=refs/heads/next, currently at commit cbd29cb6.
The guest reboot works now. :)
Thanks all! Laszlo
On Mon, Feb 18, 2013 at 02:00:52PM -0500, Kevin O'Connor wrote:
On Mon, Feb 18, 2013 at 08:31:01PM +0200, Gleb Natapov wrote:
Laszlo explained to me that the problem is that after reset we end up in SeaBIOS reset code instead of OVMF one. This is because kvm starts to execute from ffff0 instead of fffffff0 after reset and this memory location is modifying during CSM loading. Seabios solves this problem by detecting reset condition and copying pristine image of itself from the end of 4G to the end of 1M. OVMF should do the same, but with CSM it does not get control back after reset since Seabios reset vector is executed instead. Why not put OVMF reset code at reset vector in CSM built SeaBIOS to solve the problem?
Why not fix KVM so that it runs at fffffff0 after reset?
Because KVM uses VMX extension and VMX on CPU without "unrestricted guest" is not capable of doing so. Recent KVM code should be able to emulate real mode from the fffffff0 address instead of trying to enter vmx guest mode. I asked Laszlo to check if it is so, but even if KVM in 3.9 will work it will not fix all existent kernels out there. Old behaviour of approximating real mode by vm86 is still supported by using emulate_invalid_guest_state=false kernel module option and it will be nice if it will not break OVMF since it can be used as a workaround in case unemulated instruction is encountered.
The only thing SeaBIOS could do is setup the segment registers and then jump to fffffff0, which is a bit of work for the same end result.
If it will jump to fffffff0 KVM will jump to ffff0 instead :) It should restore pre-CSM loaded OVMF state and reset.
-- Gleb.
On Mon, Feb 18, 2013 at 09:17:05PM +0200, Gleb Natapov wrote:
On Mon, Feb 18, 2013 at 02:00:52PM -0500, Kevin O'Connor wrote:
Why not fix KVM so that it runs at fffffff0 after reset?
Because KVM uses VMX extension and VMX on CPU without "unrestricted guest" is not capable of doing so. Recent KVM code should be able to emulate real mode from the fffffff0 address instead of trying to enter vmx guest mode. I asked Laszlo to check if it is so, but even if KVM in 3.9 will work it will not fix all existent kernels out there. Old behaviour of approximating real mode by vm86 is still supported by using emulate_invalid_guest_state=false kernel module option and it will be nice if it will not break OVMF since it can be used as a workaround in case unemulated instruction is encountered.
For old versions of KVM, SeaBIOS can detect the loop and issue a shutdown. Not nice for users to have their "reboot" turn into a "poweroff", but likely better than just a hang.
The only thing SeaBIOS could do is setup the segment registers and then jump to fffffff0, which is a bit of work for the same end result.
If it will jump to fffffff0 KVM will jump to ffff0 instead :) It should restore pre-CSM loaded OVMF state and reset.
I take it you mean copy 0xfffe0000 to 0xe0000? That would not be fun. SeaBIOS would need to detect that it's in the state (it's definitely not correct to do that on real-hardware or on "working" kvm instances), then setup a trampoline somewhere outside of 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the memory, restore segment registers, and then jump to 0xfffffff0. That's a lot of kvm specific code to add to seabios as a workaround and it seems fragile anyway.
-Kevin
On Mon, Feb 18, 2013 at 02:33:23PM -0500, Kevin O'Connor wrote:
On Mon, Feb 18, 2013 at 09:17:05PM +0200, Gleb Natapov wrote:
On Mon, Feb 18, 2013 at 02:00:52PM -0500, Kevin O'Connor wrote:
Why not fix KVM so that it runs at fffffff0 after reset?
Because KVM uses VMX extension and VMX on CPU without "unrestricted guest" is not capable of doing so. Recent KVM code should be able to emulate real mode from the fffffff0 address instead of trying to enter vmx guest mode. I asked Laszlo to check if it is so, but even if KVM in 3.9 will work it will not fix all existent kernels out there. Old behaviour of approximating real mode by vm86 is still supported by using emulate_invalid_guest_state=false kernel module option and it will be nice if it will not break OVMF since it can be used as a workaround in case unemulated instruction is encountered.
For old versions of KVM, SeaBIOS can detect the loop and issue a shutdown. Not nice for users to have their "reboot" turn into a "poweroff", but likely better than just a hang.
The only thing SeaBIOS could do is setup the segment registers and then jump to fffffff0, which is a bit of work for the same end result.
If it will jump to fffffff0 KVM will jump to ffff0 instead :) It should restore pre-CSM loaded OVMF state and reset.
I take it you mean copy 0xfffe0000 to 0xe0000? That would not be fun. SeaBIOS would need to detect that it's in the state (it's definitely not correct to do that on real-hardware or on "working" kvm instances), then setup a trampoline somewhere outside of 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the memory, restore segment registers, and then jump to 0xfffffff0. That's a lot of kvm specific code to add to seabios as a workaround and it seems fragile anyway.
Isn't this exactly what qemu_prep_reset() is doing now?
-- Gleb.
On Tue, 2013-02-19 at 20:13 +0200, Gleb Natapov wrote:
I take it you mean copy 0xfffe0000 to 0xe0000? That would not be
fun.
SeaBIOS would need to detect that it's in the state (it's definitely not correct to do that on real-hardware or on "working" kvm instances), then setup a trampoline somewhere outside of 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the memory, restore segment registers, and then jump to 0xfffffff0. That's a lot of kvm specific code to add to seabios as a workaround and it seems fragile anyway.
Isn't this exactly what qemu_prep_reset() is doing now?
No. It doesn't do the trampoline thing because it doesn't *have* to; it's copying an identical copy of the code back over itself.
On Tue, Feb 19, 2013 at 06:35:03PM +0000, David Woodhouse wrote:
On Tue, 2013-02-19 at 20:13 +0200, Gleb Natapov wrote:
I take it you mean copy 0xfffe0000 to 0xe0000? That would not be
fun.
SeaBIOS would need to detect that it's in the state (it's definitely not correct to do that on real-hardware or on "working" kvm instances), then setup a trampoline somewhere outside of 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the memory, restore segment registers, and then jump to 0xfffffff0. That's a lot of kvm specific code to add to seabios as a workaround and it seems fragile anyway.
Isn't this exactly what qemu_prep_reset() is doing now?
No. It doesn't do the trampoline thing because it doesn't *have* to; it's copying an identical copy of the code back over itself.
Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or it leaves OVMF code there somewhere. CSM reset code can jump into OVMF code in 0xe0000-0xfffff range and let it do the copy.
-- Gleb.
On Tue, 2013-02-19 at 20:41 +0200, Gleb Natapov wrote:
Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or it leaves OVMF code there somewhere. CSM reset code can jump into OVMF code in 0xe0000-0xfffff range and let it do the copy.
There is no OVMF code there; OVMF doesn't bother to put *anything* into the RAM at 1MiB-δ unless there's a CSM.
CSM code isn't supposed to be hardware-specific, but I suppose for the CSM running under KVM case we could *potentially* have a hack at the reset vector so that when we do find ourselves there under a buggy qemu/KVM implementation, it could set up a trampoline, reset the PAM registers manually (so that the KVM CS base address bug doesn't actually *hurt* us), then try again?
I'd rather implement the 0xcf9 reset properly in qemu though, and make SeaBIOS use that (which it can do *sanely* as a CSM if it's in the ACPI tables).
On Tue, Feb 19, 2013 at 06:48:41PM +0000, David Woodhouse wrote:
On Tue, 2013-02-19 at 20:41 +0200, Gleb Natapov wrote:
Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or it leaves OVMF code there somewhere. CSM reset code can jump into OVMF code in 0xe0000-0xfffff range and let it do the copy.
There is no OVMF code there; OVMF doesn't bother to put *anything* into the RAM at 1MiB-δ unless there's a CSM.
It runs from ROM and do not shadow itself?
CSM code isn't supposed to be hardware-specific, but I suppose for the CSM running under KVM case we could *potentially* have a hack at the reset vector so that when we do find ourselves there under a buggy qemu/KVM implementation, it could set up a trampoline, reset the PAM registers manually (so that the KVM CS base address bug doesn't actually *hurt* us), then try again?
Yes, we are trying to come up with qemu/KVM specific hack here.
I'd rather implement the 0xcf9 reset properly in qemu though, and make SeaBIOS use that (which it can do *sanely* as a CSM if it's in the ACPI tables).
I didn't follow that other discussion about hard/soft reset. How proper 0xcf9 reset will fix the problem? What will it do that system_reset does not?
-- Gleb.
On Tue, 2013-02-19 at 21:01 +0200, Gleb Natapov wrote:
On Tue, Feb 19, 2013 at 06:48:41PM +0000, David Woodhouse wrote:
On Tue, 2013-02-19 at 20:41 +0200, Gleb Natapov wrote:
Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or it leaves OVMF code there somewhere. CSM reset code can jump into OVMF code in 0xe0000-0xfffff range and let it do the copy.
There is no OVMF code there; OVMF doesn't bother to put *anything* into the RAM at 1MiB-δ unless there's a CSM.
It runs from ROM and do not shadow itself?
It has no need to shadow itself. It loads the SeaBIOS CSM into the range under 1MiB, if it wants to support legacy BIOS. Other than that, it never cares about 16-bit code at all.
I'd rather implement the 0xcf9 reset properly in qemu though, and make SeaBIOS use that (which it can do *sanely* as a CSM if it's in the ACPI tables).
I didn't follow that other discussion about hard/soft reset. How proper 0xcf9 reset will fix the problem? What will it do that system_reset does not?
A full *hard* reset (0xcf9) will reset the PAM configuration, and thus the BIOS from 4GiB-δ *would* be shadowed into 1MiB-δ, by hardware.
But qemu doesn't *implement* a full hard reset; it doesn't reset the PAM registers.
And making it do so the naïve way, by just hooking up a simple device reset function to do so, would be wrong. Because it *shouldn't* happen on a soft reset, such as a triple-fault or a reset triggered by the keyboard controller. Since a soft reset was the only way to get back from 80286 protected mode to 8086 mode, some software may actually *use* it and expect it to behave correctly.
Hence the discussion about reset handling.
We'd need to fix SeaBIOS to use the 0xcf9 reset too; currently it'll sit in an endless loop of keyboard-induced *soft* resets anyway, because it tries that before 0xcf9.
And in fact it probably shouldn't use the hard-coded 0xcf9 reset; it should use the one indicated by the ACPI RESET_REG field (which *is* 0xcf9... or should be).
Il 19/02/2013 20:39, David Woodhouse ha scritto:
We'd need to fix SeaBIOS to use the 0xcf9 reset too; currently it'll sit in an endless loop of keyboard-induced *soft* resets anyway, because it tries that before 0xcf9.
And in fact it probably shouldn't use the hard-coded 0xcf9 reset; it should use the one indicated by the ACPI RESET_REG field (which *is* 0xcf9... or should be).
We should implement this: http://mjg59.dreamwidth.org/3561.html
A while back I did some tests with Windows running on top of qemu. This is a great way to evaluate OS behaviour, because you've got complete control of what's handed to the OS and what the OS tries to do to the hardware. And what I discovered was a little surprising. In the absence of an ACPI reboot vector, Windows will hit the keyboard controller, wait a while, hit it again and then give up. If an ACPI reboot vector is present, windows will poke it, try the keyboard controller, poke the ACPI vector again and try the keyboard controller one more time.
This turns out to be important. The first thing it means is that it generates two writes to the ACPI reboot vector. The second is that it leaves a gap between them while it's fiddling with the keyboard controller. And, shockingly, it turns out that on most systems the ACPI reboot vector points at 0xcf9 in system IO space. Even though most implementations nominally require two different values be written, it seems that this isn't a strict requirement and the ACPI method works.
Paolo
On Tue, 2013-02-19 at 21:49 +0100, Paolo Bonzini wrote:
And in fact it probably shouldn't use the hard-coded 0xcf9 reset; it should use the one indicated by the ACPI RESET_REG field (which *is* 0xcf9... or should be).
We should implement this: http://mjg59.dreamwidth.org/3561.html
Matthew fails to distinguish between a hard reset and a soft reset. From the CSM if we do find ourselves running at 0xffff0 (which should never happen except under buggy KVM emulation anyway), we really do need to be using the 0xcf9 reset (or the ACPI reset, which is going to point to the same thing in general), and *not* the keyboard reset. And, of course, we need it to work correctly and reset the PAM configuration (qv).
However, a single bash on the 0xcf9 register ought to suffice so the ACPI/kbd/ACPI/kbd loop that Matthew describes is probably acceptable. As long as it does the ACPI one *first*.
( It's also interesting that, as Laszlo observes, machines tend to set the RESET_REG in the FADT *without* setting the enabled bit in the FADT flags. Does Windows use it anyway? And is there are reason for *not* setting the enabled bit, or is it just that all PC BIOSes are written by crack-smoking hobos that they drag in off the street, and this is just an artefact of the rule "anything they *can* get wrong and still boot Windows, they *will* get wrong"? )
David Woodhouse wrote:
is it just that all PC BIOSes are written by crack-smoking hobos that they drag in off the street, and this is just an artefact of the rule "anything they *can* get wrong and still boot Windows, they *will* get wrong"?
I wouldn't be surprised.
//Peter
On 02/19/13 19:41, Gleb Natapov wrote:
On Tue, Feb 19, 2013 at 06:35:03PM +0000, David Woodhouse wrote:
On Tue, 2013-02-19 at 20:13 +0200, Gleb Natapov wrote:
I take it you mean copy 0xfffe0000 to 0xe0000? That would not be
fun.
SeaBIOS would need to detect that it's in the state (it's definitely not correct to do that on real-hardware or on "working" kvm instances), then setup a trampoline somewhere outside of 0xe0000-0xfffff to do the memcpy, jump to that trampoline, copy the memory, restore segment registers, and then jump to 0xfffffff0. That's a lot of kvm specific code to add to seabios as a workaround and it seems fragile anyway.
Isn't this exactly what qemu_prep_reset() is doing now?
No. It doesn't do the trampoline thing because it doesn't *have* to; it's copying an identical copy of the code back over itself.
Ah, yes of course. So does CSM takes the whole 0xe0000-0xfffff segment or it leaves OVMF code there somewhere. CSM reset code can jump into OVMF code in 0xe0000-0xfffff range and let it do the copy.
I think the only thing you could know about the UEFI environment (call-wise or jump-wise) while in the CSM is ReverseThunkCallSegment / ReverseThunkCallOffset. Theoretically those allow the CSM to call back into EfiCompatibility (32-bit protected mode environment).
Some problems: - Using the reverse thunk might only be allowed if we ended up in real mode coming through the forward thunk to begin with. When qemu/kvm simply jumps to 0xffff0 (reset request from guest OS), this doesn't hold.
- Currently no reverse thunk functions are defined in the CSM spec, and the implementation in TianoCore seems ... absent. The directory "IntelFrameworkModulePkg/Csm/LegacyBiosDxe/Ipf" appears to contain some incomplete Itanium assembly.
Anyway David is fixing qemu to reset the PAMs at hard reset, so OVMF should show up again in the f-segment, accomodating older kvm hosts.
Laszlo