Hi Kevin,
On Fri, 12 Feb 2016 21:49:04 -0500 "Kevin O'Connor" kevin@koconnor.net wrote:
On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote:
Intel IGD makes use of memory allocated and marked reserved by the BIOS as a stolen memory range. For the most part, guest drivers don't make use of this, but our achilles heel is the vBIOS. The vBIOS programs the device to use the host stolen memory range and it's used in the pre-boot environment. Generally the guest won't have access to the host stolen memory area, so these accesses either land in VM memory or unassigned space and generate IOMMU faults. By allocating this range in SeaBIOS and programming it into the device, QEMU (via vfio) can make sure this guest allocated stolen memory range is used instead.
What does "vBIOS" mean in this context? Is it the video device option rom or something else?
vBIOS = video BIOS, you're correct, it's just the video device option ROM.
Signed-off-by: Alex Williamson alex.williamson@redhat.com
src/fw/pciinit.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index 92170d5..c1ad5d4 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -260,7 +260,7 @@ static void ich9_smbus_setup(struct pci_device *dev, void *arg) static void intel_igd_opregion_setup(struct pci_device *dev, void *arg) { struct romfile_s *file = romfile_find("etc/igd-opregion");
- void *opregion;
void *opregion, *bdsm; u16 bdf = dev->bdf;
if (!file || !file->size)
@@ -281,6 +281,17 @@ static void intel_igd_opregion_setup(struct pci_device *dev, void *arg)
dprintf(1, "Intel IGD OpRegion enabled on %02x:%02x.%x\n", pci_bdf_to_bus(bdf), pci_bdf_to_dev(bdf), pci_bdf_to_fn(bdf));
- bdsm = memalign_high(1024 * 1024, 1024 * 1024);
- if (!bdsm) {
warn_noalloc();
return;
- }
The "high" memory pool is not a good fit for such a large allocation. For so much space, I'd use memalign_tmphigh() followed by e820_add(addr, size, E820_RESERVED).
Ok, as you saw on IRC I was looking for alternatives, I can easily switch to this.
- pci_config_writel(bdf, 0x5C, cpu_to_le32((u32)bdsm));
- dprintf(1, "Allocated 1MB reserved memory for Intel IGD stolen memory at "
"0x%08x\n", (u32)bdsm);
}
Does this make sense to do unconditionally in the firmware whenever the device is present? The SeaBIOS release schedule is not in sync with QEMU, so changes in the firmware tend to take longer to deploy if something needs to be done differently in the future.
What happens if this register is not set? Does anything go wrong if register 0xFC is set, but 0x5C is not (eg, due to allocation failure).
It's not entirely unconditional, it's tagged onto the end of the OpRegion setup, so we know that we're at least dealing with a QEMU that has that support. We could link it to the PCI ROM BAR for IGD being present since this is necessary to support the code in that ROM, but that doesn't give us any idea if the QEMU support is there. Maybe we could do both, OpRegion support and ROM BAR present.
The write to 0x5C is used by QEMU code that traps writes to the device I/O port BAR and replaces the host stolen memory address with the new guest address generated here. 0x5C is initialized to 0x0 by kernel vfio code, so we can detect whether it has been written. If not written, QEMU has no place to redirect to for stolen memory and it will either overlap VM memory or an unassigned area. The former may corrupt VM memory, the latter throws host errors. We could in QEMU halt with a hardware error if 0x5C hasn't been programmed.
One alternative I was considering was to simply use the unused BAR5 slot on IGD to expose a fake 1MB, 32bit MMIO BAR. SeaBIOS would program this normally, QEMU would back it with its own allocation and we could easily find the address of it when we need it. The downside being that the PCI BAR can be disabled and remapped, but for the small window where we need it, I'm not sure that matters. It would certainly be more transparent to the VM BIOS. It would also make it much easier to change the size if we found some devices need more/less/none space. I initially thought the fake BAR wasn't a great fix, but maybe its advantages outweigh its downsides. I'll give it a shot. Thanks,
Alex