I've recently upgraded a host from Ubuntu Precise (qemu-kvm-1.0) to Ubuntu Trusty (qemu 2.0.0). I have a Windows 2008 32-bit OS running on a 64-bit VM that runs a 16-bit line of business application. (While I realize that "upgrade the application" is the right answer, I can't.) The Windows VM boots, but the business application crashes on startup. Changing the VM to be a 32-bit VM doesn't help.
After several hours of compiling and testing intermediate qemu versions, I accidentally stumbled into the real issue. Trusty has switched from vgabios to seabios. I have confirmed that switching the vgabios*.bin images back to the vgabios package (rather than seabios) fixes the 16-bit application in the guest.
Per a suggestion on the Ubuntu bug I filed, I built an updated seabios package using the source from git (specifically, revision 60e0e55f212dadd043ab9e39bee05a48013ddd8f). It has the same problem.
I then set CONFIG_DEBUG_LEVEL=8 and booted with "-chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios" per: http://www.seabios.org/pipermail/seabios/2011-May/001718.html
The debug log is attached. For more details, including a couple of screenshots of the NTVDM crash dialog, see: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1404396
What's the next step in debugging this?
On 01/09/15 04:36, Richard Laager wrote:
I've recently upgraded a host from Ubuntu Precise (qemu-kvm-1.0) to Ubuntu Trusty (qemu 2.0.0). I have a Windows 2008 32-bit OS running on a 64-bit VM that runs a 16-bit line of business application. (While I realize that "upgrade the application" is the right answer, I can't.) The Windows VM boots, but the business application crashes on startup. Changing the VM to be a 32-bit VM doesn't help.
After several hours of compiling and testing intermediate qemu versions, I accidentally stumbled into the real issue. Trusty has switched from vgabios to seabios. I have confirmed that switching the vgabios*.bin images back to the vgabios package (rather than seabios) fixes the 16-bit application in the guest.
Per a suggestion on the Ubuntu bug I filed, I built an updated seabios package using the source from git (specifically, revision 60e0e55f212dadd043ab9e39bee05a48013ddd8f). It has the same problem.
I then set CONFIG_DEBUG_LEVEL=8 and booted with "-chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios" per: http://www.seabios.org/pipermail/seabios/2011-May/001718.html
The debug log is attached. For more details, including a couple of screenshots of the NTVDM crash dialog, see: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1404396
http://en.wikipedia.org/wiki/NTVDM
Very roughly, it's windows' 16-bit emulator. It parses real mode code and emulates the instructions. The SeaVGBABIOS binary apparently contains at least one instruction that, albeit valid, confuses NTVDM and causes it to crash.
This has happened several times before. Not just with NTVDM but also x86emu -- search this list for "x86emu". x86emu is free software with a similar role, and one version or another of the X server uses it to "execute" 16-bit VBE code.
The original vgabios was written in assembly, which (probably) made its maintenance hell, but it provided full control over the instructions in the final binary (so issues like this had never been encountered or quickly fixed). SeaVGBABIOS is (mostly) written in C and sometimes gcc generates "sophisticated" stuff that confuse old emulators. Then usually Kevin tracks it down and does some magic to make it go away (check out "scripts/vgafixup.py").
What's the next step in debugging this?
The offending instruction should be found.
The NTVDM crash info is not directly useful because that pinpoints (?) a location in the NTVDM code (for which you don't have the source). The problematic SeaVGABIOS instruction counts as data for NTVDM.
You could try to bisect SeaVGABIOS, and/or build it with an older gcc.
(I probably made several errors in the above; corrections more than welcome...)
Thanks Laszlo
On Thu, Jan 08, 2015 at 09:36:44PM -0600, Richard Laager wrote:
I've recently upgraded a host from Ubuntu Precise (qemu-kvm-1.0) to Ubuntu Trusty (qemu 2.0.0). I have a Windows 2008 32-bit OS running on a 64-bit VM that runs a 16-bit line of business application. (While I realize that "upgrade the application" is the right answer, I can't.) The Windows VM boots, but the business application crashes on startup. Changing the VM to be a 32-bit VM doesn't help.
After several hours of compiling and testing intermediate qemu versions, I accidentally stumbled into the real issue. Trusty has switched from vgabios to seabios. I have confirmed that switching the vgabios*.bin images back to the vgabios package (rather than seabios) fixes the 16-bit application in the guest.
Hi Richard,
Thanks for the detailed report, and I'm sorry that you are having problems.
Per a suggestion on the Ubuntu bug I filed, I built an updated seabios package using the source from git (specifically, revision 60e0e55f212dadd043ab9e39bee05a48013ddd8f). It has the same problem.
I then set CONFIG_DEBUG_LEVEL=8 and booted with "-chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios" per: http://www.seabios.org/pipermail/seabios/2011-May/001718.html
The debug log is attached. For more details, including a couple of screenshots of the NTVDM crash dialog, see: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1404396
Does the crash occur around the time one of the debug messages is produced, or is the crash seemingly uncorrelated? I don't see anything suspicious in the log.
What's the next step in debugging this?
Getting a test case that we could run to reproduce on our side would really help. Do other 16bit programs in your environment also crash?
Does the app or some part of its 16bit code run for some time before the crash, or does the crash occur immiediately (ie, is it NTVDM crashing or is something in the app causing NTVDM to crash)?
I'm a bit surprised that NTVDM would be directly accessing the video bios, or would allow the 16bit programs it runs to direclty access the bios. Can you provide some more info on the app itself - is it a dos program, a windows 3.0 program, does it run full screen or in a window? If it runs full screen, is it possible to run it in a window, and does that still crash?
Ultimately what we want to try and do is find what part of seavgabios is at issue.
There are a couple of things you could try to see if it makes any difference - entirely disable debugging in seavgabios (CONFIG_DEBUG_LEVEL=0) in the unlikley case that the debug port writes themselves are causing confusion, and try disabling CONFIG_VGA_ALLOCATE_EXTRA_STACK.
Also, can you check if the emulated cirrus vga card has the same issue (qemu command line of "-vga cirrus" and make sure there is no "-vga std" - not sure how one does that from libvirt).
-Kevin
[I've reordered the quoted text.]
On Fri, 2015-01-09 at 11:58 -0500, Kevin O'Connor wrote:
Does the crash occur around the time one of the debug messages is produced, or is the crash seemingly uncorrelated? I don't see anything suspicious in the log.
Sorry, I forgot to note this in my email. All of the messages are from bootup. There are no logs from when I start the 16-bit application.
Do other 16bit programs in your environment also crash?
This suite of programs are the only 16-bit applications we run.
Does the app or some part of its 16bit code run for some time before the crash, or does the crash occur immiediately (ie, is it NTVDM crashing or is something in the app causing NTVDM to crash)?
It crashes immediately on startup.
I'm a bit surprised that NTVDM would be directly accessing the video bios, or would allow the 16bit programs it runs to direclty access the bios. Can you provide some more info on the app itself - is it a dos program, a windows 3.0 program, does it run full screen or in a window? If it runs full screen, is it possible to run it in a window, and does that still crash?
I'm fairly confident they're Windows programs, as opposed to DOS programs. They are not fullscreen, they run in a window.
Getting a test case that we could run to reproduce on our side would really help.
Paolo Bonzini suggested I try the 16-bit SkiFree. It works on vgabios. It crashes on seabios, immediately on start.
To be specific, 16-bit Skifree crashes with Ubuntu-packaged seabios1.7.4-4 on 32-bit Windows Server 2008 running on Ubuntu Trusty. It also crashes with a build from seabios git I did just now.
Also, can you check if the emulated cirrus vga card has the same issue (qemu command line of "-vga cirrus" and make sure there is no "-vga std" - not sure how one does that from libvirt).
Skifree still crashes with seabios git and cirrus. I verified there is no -vga std in the command line in the cirrus test. The only "vga" on the command line is: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2
There are a couple of things you could try to see if it makes any difference - entirely disable debugging in seavgabios (CONFIG_DEBUG_LEVEL=0) in the unlikley case that the debug port writes themselves are causing confusion, and try disabling CONFIG_VGA_ALLOCATE_EXTRA_STACK.
Turning off CONFIG_VGA_ALLOCATE_EXTRA_STACK makes Skifree work on cirrus and vga.
I don't know how this all works, but as a guess, I further tried with "vga" (not cirrus) using VGA_EXTRA_STACK_SIZE=256 and VGA_EXTRA_STACK_SIZE=16. The crash recurs with both.
On Mon, Jan 12, 2015 at 02:53:58AM -0600, Richard Laager wrote:
On Fri, 2015-01-09 at 11:58 -0500, Kevin O'Connor wrote:
There are a couple of things you could try to see if it makes any difference - entirely disable debugging in seavgabios (CONFIG_DEBUG_LEVEL=0) in the unlikley case that the debug port writes themselves are causing confusion, and try disabling CONFIG_VGA_ALLOCATE_EXTRA_STACK.
Turning off CONFIG_VGA_ALLOCATE_EXTRA_STACK makes Skifree work on cirrus and vga.
Interesting. Does your original business app then also work with CONFIG_VGA_ALLOCATE_EXTRA_STACK disabled?
-Kevin
On Mon, 2015-01-12 at 10:32 -0500, Kevin O'Connor wrote:
Interesting. Does your original business app then also work with CONFIG_VGA_ALLOCATE_EXTRA_STACK disabled?
Yes. I couldn't be sure if I tested before, since I didn't write it down. But I just tested now.
Obviously this extra stack feature exists for a reason. What are the implications of turning it off?
On 12/01/2015 16:59, Richard Laager wrote:
On Mon, 2015-01-12 at 10:32 -0500, Kevin O'Connor wrote:
Interesting. Does your original business app then also work with CONFIG_VGA_ALLOCATE_EXTRA_STACK disabled?
Yes. I couldn't be sure if I tested before, since I didn't write it down. But I just tested now.
Obviously this extra stack feature exists for a reason. What are the implications of turning it off?
commit 4a8b58cb6cccc8f6431167dfdd36f3e39601ff79 Author: Kevin O'Connor kevin@koconnor.net Date: Sat Nov 30 19:16:15 2013 -0500
vgabios: Support allocating an extra stack for vgabios calls and default on.
Add code to allocate an extra stack for the main vgabios int 0x10 entry point. The allocation is done via the PMM spec and uses a PCI v3 permanent low memory region request. This request will work with SeaBIOS - it is unknown how many other main BIOS implementations support this PMM call.
The extra stack is useful for old DOS programs that call the VGABIOS and expect it to work with very small amounts of stack space.
Signed-off-by: Kevin O'Connor kevin@koconnor.net
Paolo
On Mon, Jan 12, 2015 at 09:59:49AM -0600, Richard Laager wrote:
On Mon, 2015-01-12 at 10:32 -0500, Kevin O'Connor wrote:
Interesting. Does your original business app then also work with CONFIG_VGA_ALLOCATE_EXTRA_STACK disabled?
Yes. I couldn't be sure if I tested before, since I didn't write it down. But I just tested now.
Obviously this extra stack feature exists for a reason. What are the implications of turning it off?
There is a high-level description of the feature at:
http://seabios.org/Execution_and_code_flow#Extra_16bit_stack
Basically, SeaVGABIOS is capable of allocating space for an internal stack at startup and then switching to that stack on each vgabios call so that it can work with some really old DOS-era programs that used ridiculously small stacks. If things work for you with the feature off, then it's not a problem to leave it off.
I'm trying to reproduce the fault with skifree locally. If I can't, it would be helpful if you can run some additional tests and grab some additional logs. I'll let you know.
-Kevin
On Mon, Jan 12, 2015 at 02:53:58AM -0600, Richard Laager wrote:
Turning off CONFIG_VGA_ALLOCATE_EXTRA_STACK makes Skifree work on cirrus and vga.
I was able to reproduce this locally with 16bit skifree on Windows Vista. (Interestingly, the problem doesn't occur on winxp.)
The issue doesn't appear to be with the SeaVGABIOS stack switching, but with the fact that the SeaBIOS PMM code places the stack in the e-segment. Turning off MALLOC_UPPERMEMORY in SeaBIOS allows SeaVGABIOS to run even with CONFIG_VGA_ALLOCATE_EXTRA_STACK set.
My guess is that Windows is emulating the vgabios, but marking the 0xc0000-0x100000 region as read-only. Oddly, it doesn't appear as Windows actually lets the code talk to the VGA hardware, as the debug output (and presumably other in/out accesses) is suppressed. So, it's unclear what Windows is attempting to do with its emulation.
Not sure what the best way forward is here. It seems this is a choice between supporting some very old programs vs support for some other very old programs. Paolo and Gerd, maybe you have some ideas?
I can think of a few options:
1 - do nothing - let users use a modified seabios/seavgabios or the "lgpl vgabios" for this situation. Not great - specially considering how difficult it would be to know if one is in this situation or not.
2 - default SeaVGABIOS to CONFIG_VGA_ALLOCATE_EXTRA_STACK off. Known to break old programs - for example, DOS 1.0. SeaVGABIOS can use just under 300 bytes of stack space for some calls.
3 - default SeaBIOS to MALLOC_UPPERMEMORY off. Unfortunately, this wastes additional space below 640K and it's unclear what impact that would have on old programs.
4 - Change SeaVGABIOS to allocate its stack in the EBDA instead of via a PMM call. Unfortunately, I've seen at least one old DOS-era program that ignores the EBDA allocations and writes to the end of 640K memory. It's unclear how it would react to a SeaVBABIOS stack being there.
5 - Like 4, but know that SeaBIOS doesn't use the bottom half of the first 1K of EBDA and use that. Same problems as 4.
6 - Try to detect if the code is called in VM86 mode and don't use the extra stack then - see patch below. The patch does make skifree work, but I'm uncertain if it would catch other users (eg, kvm on some intel chipsets?, some old dos program if dos is using emm386 mode).
-Kevin
--- a/vgasrc/vgaentry.S +++ b/vgasrc/vgaentry.S @@ -8,6 +8,7 @@ #include "asm-offsets.h" // BREGS_* #include "config.h" // CONFIG_* #include "entryfuncs.S" // ENTRY_* +#include "x86.h" // CR0_PE
/**************************************************************** @@ -109,6 +110,13 @@ entry_10: entry_10_extrastack: cli cld + + push %ax // Don't use extra stack if in protected mode + smsww %ax + test $CR0_PE, %ax + pop %ax + jne entry_10 + pushw %ds // Set %ds:%eax to space on ExtraStack pushl %eax movw %cs:ExtraStackSeg, %ds
On 12/01/2015 19:19, Kevin O'Connor wrote:
I was able to reproduce this locally with 16bit skifree on Windows Vista. (Interestingly, the problem doesn't occur on winxp.)
I put "there is a 16-bit freely downloadable version of skifree" on my personal list of trivia that ended up becoming useful. :D
5 - Like 4, but know that SeaBIOS doesn't use the bottom half of the first 1K of EBDA and use that. Same problems as 4.
The 1K EBDA has been around for 15 years now, hasn't it?
6 - Try to detect if the code is called in VM86 mode and don't use the extra stack then - see patch below. The patch does make skifree work, but I'm uncertain if it would catch other users (eg, kvm on some intel chipsets?, some old dos program if dos is using emm386 mode).
No, KVM hides the fact that you are in protected mode. EMM386 would be affected, but then it is not impossible for old programs to require disabling it.
All in all (5) or (6) both sound good.
Paolo
On Mon, Jan 12, 2015 at 07:25:58PM +0100, Paolo Bonzini wrote:
On 12/01/2015 19:19, Kevin O'Connor wrote:
I was able to reproduce this locally with 16bit skifree on Windows Vista. (Interestingly, the problem doesn't occur on winxp.)
I put "there is a 16-bit freely downloadable version of skifree" on my personal list of trivia that ended up becoming useful. :D
5 - Like 4, but know that SeaBIOS doesn't use the bottom half of the first 1K of EBDA and use that. Same problems as 4.
The 1K EBDA has been around for 15 years now, hasn't it?
Way longer than that. Not sure when it was introduced, but my Phoenix "CBIOS" book from 1989 has it. So, at least 25 years.
6 - Try to detect if the code is called in VM86 mode and don't use the extra stack then - see patch below. The patch does make skifree work, but I'm uncertain if it would catch other users (eg, kvm on some intel chipsets?, some old dos program if dos is using emm386 mode).
No, KVM hides the fact that you are in protected mode. EMM386 would be affected, but then it is not impossible for old programs to require disabling it.
I was under the vague impression that kvm uses VM86 mode to run 16bit code on some Intel chipsets. The SMSW instruction isn't privileged so I didn't think it could be hidden.
All in all (5) or (6) both sound good.
Thanks, -Kevin
No, KVM hides the fact that you are in protected mode. EMM386 would be affected, but then it is not impossible for old programs to require disabling it.
I was under the vague impression that kvm uses VM86 mode to run 16bit code on some Intel chipsets. The SMSW instruction isn't privileged so I didn't think it could be hidden.
It isn't privileged indeed (nice trick in fact!), but that doesn't matter for VT-x extensions.
Old processors let you run the processor in VMX non-root mode (i.e. as a VM) only in protected mode, so KVM uses VM86 when the processor is in real mode (and uses an interpreter while in big real mode or during real<->protected mode transitions).
But all the bells and whistles of VMX still apply, including the ability to fake the value of CR0 for both MOV and [LS]MSW instructions.
Paolo
On Mon, Jan 12, 2015 at 02:00:24PM -0500, Paolo Bonzini wrote:
No, KVM hides the fact that you are in protected mode. EMM386 would be affected, but then it is not impossible for old programs to require disabling it.
I was under the vague impression that kvm uses VM86 mode to run 16bit code on some Intel chipsets. The SMSW instruction isn't privileged so I didn't think it could be hidden.
It isn't privileged indeed (nice trick in fact!), but that doesn't matter for VT-x extensions.
Old processors let you run the processor in VMX non-root mode (i.e. as a VM) only in protected mode, so KVM uses VM86 when the processor is in real mode (and uses an interpreter while in big real mode or during real<->protected mode transitions).
But all the bells and whistles of VMX still apply, including the ability to fake the value of CR0 for both MOV and [LS]MSW instructions.
Okay, so it fakes real-mode by setting up a protected mode guest with a fake CR0 that is running vm86, and so SMSW still returns a value with PE off? (As opposed to a regular guest that itself launches a VM86 instance, in which case CR0 from SMSW would have PE on.)
Good to know - thanks. -Kevin
On 12/01/2015 20:36, Kevin O'Connor wrote:
Okay, so it fakes real-mode by setting up a protected mode guest with a fake CR0 that is running vm86, and so SMSW still returns a value with PE off? (As opposed to a regular guest that itself launches a VM86 instance, in which case CR0 from SMSW would have PE on.)
Yes.
Paolo
Your patch implementing idea 6 fixes the problem for my business application on a stock Ubuntu package (with extra stack enabled).
On Mon, Jan 12, 2015 at 01:13:54PM -0600, Richard Laager wrote:
Your patch implementing idea 6 fixes the problem for my business application on a stock Ubuntu package (with extra stack enabled).
Thanks.
The key part of option 5 (as described in my previous email) looks like the patch below. It also works with skifree on Vista for me.
-Kevin
--- a/vgasrc/vgaentry.S +++ b/vgasrc/vgaentry.S @@ -111,8 +111,10 @@ entry_10_extrastack: cld pushw %ds // Set %ds:%eax to space on ExtraStack pushl %eax - movw %cs:ExtraStackSeg, %ds - movl $(CONFIG_VGA_EXTRA_STACK_SIZE-PUSHBREGS_size-16), %eax + movw $SEG_BDA, %ax + movw %ax, %ds + movw 0x0e, %ds + movl $(1024-PUSHBREGS_size-16), %eax SAVEBREGS_POP_DSEAX // Save registers on extra stack movl %esp, PUSHBREGS_size+8(%eax) movw %ss, PUSHBREGS_size+12(%eax)
On Mon, 2015-01-12 at 15:06 -0500, Kevin O'Connor wrote:
On Mon, Jan 12, 2015 at 01:13:54PM -0600, Richard Laager wrote:
Your patch implementing idea 6 fixes the problem for my business application on a stock Ubuntu package (with extra stack enabled).
Thanks.
The key part of option 5 (as described in my previous email) looks like the patch below. It also works with skifree on Vista for me.
-Kevin
--- a/vgasrc/vgaentry.S +++ b/vgasrc/vgaentry.S @@ -111,8 +111,10 @@ entry_10_extrastack: cld pushw %ds // Set %ds:%eax to space on ExtraStack pushl %eax
movw %cs:ExtraStackSeg, %ds
movl $(CONFIG_VGA_EXTRA_STACK_SIZE-PUSHBREGS_size-16), %eax
movw $SEG_BDA, %ax
movw %ax, %ds
movw 0x0e, %ds
movl $(1024-PUSHBREGS_size-16), %eax SAVEBREGS_POP_DSEAX // Save registers on extra stack movl %esp, PUSHBREGS_size+8(%eax) movw %ss, PUSHBREGS_size+12(%eax)
I may have screwed something up while applying, but when I tested with git plus this patch, I didn't get any video output during booting.
I tested with Ubuntu's seabios 1.7.4, and adding a modified version of this patch did fix the problem. The changes seemed pretty obvious in context. It seems that BREGS_size-8 was changed to PUSHBREGS_size-16 somewhere after 1.7.4. Here's the patch that I applied to 1.7.4:
Index: seabios-1.7.4/vgasrc/vgaentry.S =================================================================== --- seabios-1.7.4.orig/vgasrc/vgaentry.S 2015-01-12 15:28:20.060000981 -0600 +++ seabios-1.7.4/vgasrc/vgaentry.S 2015-01-12 15:29:35.296142288 -0600 @@ -101,8 +101,10 @@ cld pushw %ds // Set %ds:%eax to space on ExtraStack pushl %eax - movw %cs:ExtraStackSeg, %ds - movl $(CONFIG_VGA_EXTRA_STACK_SIZE-BREGS_size-8), %eax + movw $SEG_BDA, %ax + movw %ax, %ds + movw 0x0e, %ds + movl $(1024-BREGS_size-8), %eax popl BREGS_eax(%eax) // Backup registers popw BREGS_ds(%eax) movl %edi, BREGS_edi(%eax)
So it seems you have two working solutions (though I should retest with git, if you want to go with option 5).
If/when you choose one of these and commit it, I'll update the Ubuntu bug asking them to deploy the patch as an SRU (stable release update).
On Mon, 2015-01-12 at 16:07 -0600, Richard Laager wrote:
I may have screwed something up while applying, but when I tested with git plus this patch, I didn't get any video output during booting.
...
So it seems you have two working solutions (though I should retest with git, if you want to go with option 5).
Have you decided on a preferred approach? If option 5, I'll retest your patch with seabios git.
On Fri, Jan 16, 2015 at 10:32:45PM -0600, Richard Laager wrote:
On Mon, 2015-01-12 at 16:07 -0600, Richard Laager wrote:
I may have screwed something up while applying, but when I tested with git plus this patch, I didn't get any video output during booting.
...
So it seems you have two working solutions (though I should retest with git, if you want to go with option 5).
Have you decided on a preferred approach? If option 5, I'll retest your patch with seabios git.
I'm inclined to go with option 5. The real patch for that option, however, is a little bigger. I should be able to send a more complete patch in the next couple of days.
Thanks, -Kevin
Kevin O'Connor wrote:
I can think of a few options:
7 - use top of video memory as stack?
//Peter
Paolo Bonzini wrote:
I can think of a few options:
7 - use top of video memory as stack?
That would be pretty slow on KVM, since video memory is MMIO.
slow reliable > fast unreliable
But worse, one could imagine that NTVDM blocks a0000-bffff as well.
I doubt that? Then how would the VGA BIOS write its data?
//Peter
On 13/01/2015 15:30, Peter Stuge wrote:
That would be pretty slow on KVM, since video memory is MMIO.
slow reliable > fast unreliable
But worse, one could imagine that NTVDM blocks a0000-bffff as well.
I doubt that? Then how would the VGA BIOS write its data?
The question is more "why is NTVDM executing the VGA BIOS?" Remember this is a 16-bit windowed application. It's possible that some old Win16 app was doing INT 10h calls and so they have to support it in NTVDM: they initialize the VGABIOS, but do not write to VRAM because... nothing should be there, right?
Not writing the data to VRAM could very well be intentional.
Paolo