[Public]
Hi all,
I've ran into an issue where the 64 bit address window allocated for one of the PCI-e host devices overlaps with a BIOS reserved range of fd00000000-ffffffffff, making that device unusable.
4000000000-7bfffffffff : PCI Bus 0000:00
4000000000-bfffffffff : PCI Bus 0000:0c
4000000000-7fffffffff : 0000:0c:00.0
8000000000-80001fffff : 0000:0c:00.0
c000000000-13fffffffff : PCI Bus 0000:0b
c000000000-ffffffffff : 0000:0b:00.0
fd00000000-ffffffffff : Reserved
10000000000-100001fffff : 0000:0b:00.0
8000000000-7ffffffffff : PCI Bus 0000:00
8000000000-ffffffffff : PCI Bus 0000:0c
8000000000-bfffffffff : 0000:0c:00.0
c000000000-c0001fffff : 0000:0c:00.0
fd00000000-ffffffffff : Reserved
10000000000-17fffffffff : PCI Bus 0000:0b
10000000000-13fffffffff : 0000:0b:00.0
14000000000-140001fffff : 0000:0b:00.0
I found that increasing the system ram of the VM to ~256G changes the address window and avoids the issue, and with some digging I think I found the root cause
To begin, QEMU set this address range to be reserved here, and this is passed to Seabios via "etc/e820".
https://gitlab.com/qemu-project/qemu/-/blob/master/hw/i386/pc.c#L865
Separately QEMU pass to Seabios "etc/reserved-memory-end" here which is derived from machine->device_memory
https://gitlab.com/qemu-project/qemu/-/blob/master/hw/i386/pc.c#L1007
In Seabios, "etc/e820" is consumed here, which sets RamSizeOver4G only using the E820_RAM entries, ignoring any E820_RESERVED entries.
https://gitlab.com/qemu-project/seabios/-/blob/master/src/fw/paravirt.c#L782
Later "etc/reserved-memory-end" and RamSizeOver4G is used to determine the start of the PCI-e address window.
https://gitlab.com/qemu-project/seabios/-/blob/master/src/fw/pciinit.c#L1138
I think either QEMU should set etc/reserved-memory-end to be after both physical memory and the reserved ranges, or Seabios need to check both etc/820 and etc/reserved-memory-end. But I'm not sure which would be the correct move and indeed how to patch them.
Regards,
Yunxiang Li (Teddy)
Currently, pci region on busses[0] may be migrate to
64-bit mmio space. this will cause a mistake to read/write
device config space
example:
A modern virtio device map to 64-bit mmio space will set
mode to VP_ACCESS_PCICFG, But the real device cap is
VIRTIO_PCI_CAP_COMMON_CFG, we can not access the cap
rightly.
If this is a virtio blk/scsi device as system disk, VM
will not be booted on the device.
A simple solution is make device use the 32-bit address
space as much as possible.
This patch changes the placement of the PCI bars.
[1] commit 0e21548b15e2 ("virtio: pci cfg access")
Signed-off-by: hulang <hulang13(a)huawei.com>
---
src/fw/pciinit.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
index b3e359d7..2e9544dc 100644
--- a/src/fw/pciinit.c
+++ b/src/fw/pciinit.c
@@ -1100,8 +1100,12 @@ static void pci_region_map_entries(struct pci_bus *busses, struct pci_region *r)
{
struct hlist_node *n;
struct pci_region_entry *entry;
+ u64 r_end = r->base + pci_region_sum(r)
+
hlist_for_each_entry_safe(entry, n, &r->list, node) {
u64 addr = r->base;
+ if (addr + entry->size >= r_end)
+ continue;
r->base += entry->size;
if (entry->bar == -1)
// Update bus base address if entry is a bridge region
@@ -1114,6 +1118,8 @@ static void pci_region_map_entries(struct pci_bus *busses, struct pci_region *r)
static void pci_bios_map_devices(struct pci_bus *busses)
{
+ int bus;
+
if (pci_bios_init_root_regions_io(busses))
panic("PCI: out of I/O address space\n");
@@ -1122,6 +1128,15 @@ static void pci_bios_map_devices(struct pci_bus *busses)
struct pci_region r64_mem, r64_pref;
r64_mem.list.first = NULL;
r64_pref.list.first = NULL;
+
+ // try map pci region to 32bit
+ for (bus = 0; bus<=MaxPCIBus; bus++) {
+ int type;
+ for (type = 0; type < PCI_REGION_TYPE_COUNT; type++)
+ pci_region_map_entries(busses, &busses[bus].r[type]);
+ }
+
+ // map remaining pci region to 64bit
pci_region_migrate_64bit_entries(&busses[0].r[PCI_REGION_TYPE_MEM],
&r64_mem);
pci_region_migrate_64bit_entries(&busses[0].r[PCI_REGION_TYPE_PREFMEM],
@@ -1166,7 +1181,6 @@ static void pci_bios_map_devices(struct pci_bus *busses)
pcimem64_start = 0;
}
// Map regions on each device.
- int bus;
for (bus = 0; bus<=MaxPCIBus; bus++) {
int type;
for (type = 0; type < PCI_REGION_TYPE_COUNT; type++)
--
2.33.0
Best Regards!
胡 浪 Hu Lang
华为云计算公司 德科 软件开发工程师
Espace:h30028282
Mobile:+86 18212092054
E-mail: hulang11(a)huawei.com<mailto:jinmeng11@huawei.com>
www.fescoadecco.com<http://www.fescoadecco.com/>
办公地址:杭州市滨江区江淑路华为研究所Z6-4-B06R-087S