On 02/18/13 13:53, David Woodhouse wrote:
Nevertheless, on my workstation as on yours, we do seem to end up executing from the CSM in RAM when we reset. But on my laptop, it executes the *ROM* as it should.
This patch 'fixes' it, and I think it might even be correct in itself, but I don't think it's a correct fix for the problem we're discussing. And I certainly want to know what's different on my laptop that makes it work *without* this patch.
Either there's some weirdness with setting the high CS base address, on CPU reset. Or perhaps the contents of the memory region at 0xfffffff0 have *really* been changed along with the sub-1MiB range. Or maybe the universe just hates us...
We're ending up in the wrong place, under 1MB (which is consistent with your "reset the PAMs" patch -- state of PAMs should only matter below 1MB).
I single-stepped qemu-1.3.1 in x86_cpu_reset() / cpu_x86_load_seg_cache(), and we seem to set the correct base. However when I pause the VM when it's spinning in the reset loop, and I issue the following in virsh:
# qemu-monitor-command --domain \ fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \ cpu 0
# qemu-monitor-command --domain \ fw-mixed.g-f18xfce2012121716.e-upstream --hmp --cmd \ info registers
for EIP and CS I get (from cpu_x86_dump_seg_cache(), in the "HF_CS64_MASK clear" branch):
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000623 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 0000f300 CS =f000 000f0000 0000ffff 0000f300 ^ ^ ^ ^ | base limit flags selector
SS =0000 00000000 0000ffff 0000f300 DS =0000 00000000 0000ffff 0000f300 FS =0000 00000000 0000ffff 0000f300 GS =0000 00000000 0000ffff 0000f300 LDT=0000 00000000 0000ffff 00008200 TR =0000 feffd000 00002088 00008b00 GDT= 00000000 0000ffff IDT= 00000000 0000ffff CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000 XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000 XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000 XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000
(1) The three high nibbles of CS base are lost.
Furthermore, the flags value is (Intel SDM Vol.3A, 3.4.5):
1 11 1 0011 00000000 P DPL S type base 23:16 ^ ^ ^ | | descriptor type (1 == code or data segment, 0 == system segment), DESC_S_MASK | descriptor privilege level (3 == least privileged) segment present, DESC_P_MASK
The "type" field depends on the S bit (here 1 == code/data). 0011b means (see 3.4.5.1):
0 0 1 1 D/C E W A C R A ^ ^ ^ ^ | | | accessed, DESC_A_MASK | | | | | for data: 0=r/o, 1==r/w | | for code: 0==exec/only, 1==exec/read, DESC_R_MASK | | | for data: 1==expand down | for code: 1==conforming | 0 == data, 1 == code, DESC_CS_MASK
The type dumped by "info registers" is "data segment, expand up, read/write, accessed".
I believe the D/C bit (bit 11) should be set, and then 1011b would mean "code segment, non-conforming, exec/read, accessed".
(2) x86_cpu_reset() does pass DESC_CS_MASK for R_CS, but it doesn't seem to be present in the dumped value.
I have no idea what's going on, but vmx_set_segment() in the kernel has a bunch of hacks for CS && selector == 0xf000 && base == 0xffff0000, and it seems to be host processor dependent. Eg. from commit b246dd5d:
/* * Fix segments for real mode guest in hosts that don't have * "unrestricted_mode" or it was disabled. * This is done to allow migration of the guests from hosts with * unrestricted guest like Westmere to older host that don't have * unrestricted guest like Nehelem. */ if (vmx->rmode.vm86_active) { switch (seg) { case VCPU_SREG_CS: vmcs_write32(GUEST_CS_AR_BYTES, 0xf3); vmcs_write32(GUEST_CS_LIMIT, 0xffff); if (vmcs_readl(GUEST_CS_BASE) == 0xffff0000) vmcs_writel(GUEST_CS_BASE, 0xf0000); vmcs_write16(GUEST_CS_SELECTOR, vmcs_readl(GUEST_CS_BASE) >> 4); break;
Also in init_vmcb() [arch/x86/kvm/svm.c] I can see (from commit d92899a0):
/* * cs.base should really be 0xffff0000, but vmx can't handle that, so * be consistent with it. * * Replace when we have real mode working for vmx. */ save->cs.base = 0xf0000;
Going back to vmx, vmx_vcpu_reset() [arch/x86/kvm/vmx.c]:
/* * GUEST_CS_BASE should really be 0xffff0000, but VT vm86 mode * insists on having GUEST_CS_BASE == GUEST_CS_SELECTOR << 4. Sigh. */ if (kvm_vcpu_is_bsp(&vmx->vcpu)) { vmcs_write16(GUEST_CS_SELECTOR, 0xf000); vmcs_writel(GUEST_CS_BASE, 0x000f0000); } else { vmcs_write16(GUEST_CS_SELECTOR, vmx->vcpu.arch.sipi_vector << 8); vmcs_writel(GUEST_CS_BASE, vmx->vcpu.arch.sipi_vector << 12); }
The leading comment and the main logic date back to commit 6aa8b732 ([PATCH] kvm: userspace interface).
(3) I wanted to ask you whether your laptop CPU is "more modern" than your workstation CPU, but from your other email I guess they're indeed different.
Laszlo