The series fixes some issues when more than one root primary bus is present.
First patch scans all the bus range to find the extra root buses. Second patch extends memory and IO mapping for found buses.
Marcel Apfelbaum (2): fw/pci: scan all buses if extraroots romfile is present fw/pci: map memory and IO regions for multiple pci root buses
src/fw/pciinit.c | 123 +++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 88 insertions(+), 35 deletions(-)
If there are extra primary root buses, scanning the bus's 0 subtree is not enough. Scan all the range.
Signed-off-by: Marcel Apfelbaum marcel.a@redhat.com --- src/fw/pciinit.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index fd5dfb9..a5f6505 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -499,8 +499,17 @@ pci_bios_init_bus_rec(int bus, u8 *pci_bus) static void pci_bios_init_bus(void) { + u8 extraroots = romfile_loadint("etc/extra-pci-roots", 0); u8 pci_bus = 0; + pci_bios_init_bus_rec(0 /* host bus */, &pci_bus); + + if (extraroots) { + while (pci_bus < 0xff) { + pci_bus++; + pci_bios_init_bus_rec(pci_bus, &pci_bus); + } + } }
Removed the assumption that the system has only one primary root bus. When mapping memory and IO regions go over all buses, skipping secondary and absent buses.
Signed-off-by: Marcel Apfelbaum marcel.a@redhat.com --- src/fw/pciinit.c | 114 ++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 79 insertions(+), 35 deletions(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index a5f6505..97b1af6 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -763,43 +763,80 @@ static int pci_bios_init_root_regions_io(struct pci_bus *bus) * where they want, on older versions its fixed ] * c000 - ffff free, traditionally used for pci io */ - struct pci_region *r_io = &bus->r[PCI_REGION_TYPE_IO]; - u64 sum = pci_region_sum(r_io); - if (sum < 0x4000) { - /* traditional region is big enougth, use it */ - r_io->base = 0xc000; - } else if (sum < pci_io_low_end - 0x1000) { - /* use the larger region at 0x1000 */ - r_io->base = 0x1000; - } else { - /* not enouth io address space -> error out */ - return -1; + + u64 sum_pci = 0; /* c000 - ffff */ + u64 sum_free = 0; /* 1000 - 9fff */ + int i; + + for (i = 0; i <= MaxPCIBus; i++) { + struct pci_region *r_io = &bus[i].r[PCI_REGION_TYPE_IO]; + u64 sum = pci_region_sum(r_io); + if (bus[i].bus_dev || !sum) + continue; + + sum_pci = ALIGN(sum_pci, sum); + sum_free = ALIGN(sum_free, sum); + if (sum + sum_pci < 0x4000) { + /* traditional region is big enough, use it */ + r_io->base = 0xc000 + sum_pci; + sum_pci += sum; + } else if (sum < pci_io_low_end - 0x1000 - sum_free) { + /* use the larger region at 0x1000 */ + r_io->base = 0x1000 + sum_free; + sum_free += sum; + } else { + /* not enough io address space -> error out */ + return -1; + } + + if (sum) { + dprintf(1, "PCI: BUS %d IO: %4llx - %4llx\n", + i, r_io->base, r_io->base + sum - 1); + } } - dprintf(1, "PCI: IO: %4llx - %4llx\n", r_io->base, r_io->base + sum - 1); return 0; }
static int pci_bios_init_root_regions_mem(struct pci_bus *bus) { - struct pci_region *r_end = &bus->r[PCI_REGION_TYPE_PREFMEM]; - struct pci_region *r_start = &bus->r[PCI_REGION_TYPE_MEM]; - - if (pci_region_align(r_start) < pci_region_align(r_end)) { - // Swap regions to improve alignment. - r_end = r_start; - r_start = &bus->r[PCI_REGION_TYPE_PREFMEM]; - } - u64 sum = pci_region_sum(r_end); - u64 align = pci_region_align(r_end); - r_end->base = ALIGN_DOWN((pcimem_end - sum), align); - sum = pci_region_sum(r_start); - align = pci_region_align(r_start); - r_start->base = ALIGN_DOWN((r_end->base - sum), align); - - if ((r_start->base < pcimem_start) || - (r_start->base > pcimem_end)) - // Memory range requested is larger than available. - return -1; + u64 region_end = pcimem_end; + int i; + + for (i = 0; i <= MaxPCIBus; i++) { + struct pci_region *r_end = &bus[i].r[PCI_REGION_TYPE_PREFMEM]; + struct pci_region *r_start = &bus[i].r[PCI_REGION_TYPE_MEM]; + + if (pci_region_align(r_start) < pci_region_align(r_end)) { + // Swap regions to improve alignment. + r_end = r_start; + r_start = &bus[i].r[PCI_REGION_TYPE_PREFMEM]; + } + + u64 sum = pci_region_sum(r_end); + if (bus[i].bus_dev || !sum) + continue; + + u64 align = pci_region_align(r_end); + r_end->base = ALIGN_DOWN((region_end - sum), align); + + dprintf(1, "PCI: BUS %d MEM: %4llx - %4llx\n", + i, r_end->base, r_end->base + sum - 1); + + sum = pci_region_sum(r_start); + align = pci_region_align(r_start); + r_start->base = ALIGN_DOWN((r_end->base - sum), align); + region_end = r_start->base; + + dprintf(1, "PCI: BUS %d MEM: %4llx - %4llx\n", + i, r_start->base, r_start->base + sum - 1); + + if ((r_start->base < pcimem_start) || + (r_start->base > pcimem_end)) + // Memory range requested is larger than available. + return -1; + + } + return 0; }
@@ -864,12 +901,19 @@ static void pci_bios_map_devices(struct pci_bus *busses) dprintf(1, "PCI: 32: %016llx - %016llx\n", pcimem_start, pcimem_end); if (pci_bios_init_root_regions_mem(busses)) { struct pci_region r64_mem, r64_pref; + int i; r64_mem.list.first = NULL; r64_pref.list.first = NULL; - pci_region_migrate_64bit_entries(&busses[0].r[PCI_REGION_TYPE_MEM], - &r64_mem); - pci_region_migrate_64bit_entries(&busses[0].r[PCI_REGION_TYPE_PREFMEM], - &r64_pref); + + for (i = 0; i <= MaxPCIBus; i++) { + if (busses[i].bus_dev) + continue; + + pci_region_migrate_64bit_entries(&busses[i].r[PCI_REGION_TYPE_MEM], + &r64_mem); + pci_region_migrate_64bit_entries(&busses[i].r[PCI_REGION_TYPE_PREFMEM], + &r64_pref); + }
if (pci_bios_init_root_regions_mem(busses)) panic("PCI: out of 32bit address space\n");
On Mo, 2014-11-24 at 11:47 +0200, Marcel Apfelbaum wrote:
Removed the assumption that the system has only one primary root bus. When mapping memory and IO regions go over all buses, skipping secondary and absent buses.
I guess there are qemu patches adding support for multiple pci busses somewhere?
- for (i = 0; i <= MaxPCIBus; i++) {
struct pci_region *r_io = &bus[i].r[PCI_REGION_TYPE_IO];
u64 sum = pci_region_sum(r_io);
if (bus[i].bus_dev || !sum)
continue;
sum_pci = ALIGN(sum_pci, sum);
sum_free = ALIGN(sum_free, sum);
if (sum + sum_pci < 0x4000) {
/* traditional region is big enough, use it */
r_io->base = 0xc000 + sum_pci;
sum_pci += sum;
} else if (sum < pci_io_low_end - 0x1000 - sum_free) {
/* use the larger region at 0x1000 */
r_io->base = 0x1000 + sum_free;
sum_free += sum;
} else {
/* not enough io address space -> error out */
return -1;
}
Hmm, that assigns the io regions in bus order. I think it would be better to integrate this with the packing logic we already have: regions are sorted by size for best packing. Regions behind pci bridges are grouped together. I think we could group regions belonging to a pci bus in a simliar way.
cheers, Gerd
On Mon, 2014-11-24 at 13:28 +0100, Gerd Hoffmann wrote:
On Mo, 2014-11-24 at 11:47 +0200, Marcel Apfelbaum wrote:
Removed the assumption that the system has only one primary root bus. When mapping memory and IO regions go over all buses, skipping secondary and absent buses.
Hi Gerd, Thanks for the review.
I guess there are qemu patches adding support for multiple pci busses somewhere?
Yes, an RFC will be released soon on QEMU mailing list, is a work in progress and I am currently trying to decide if I wait for a fully functional series or release is sooner for comments.
- for (i = 0; i <= MaxPCIBus; i++) {
struct pci_region *r_io = &bus[i].r[PCI_REGION_TYPE_IO];
u64 sum = pci_region_sum(r_io);
if (bus[i].bus_dev || !sum)
continue;
sum_pci = ALIGN(sum_pci, sum);
sum_free = ALIGN(sum_free, sum);
if (sum + sum_pci < 0x4000) {
/* traditional region is big enough, use it */
r_io->base = 0xc000 + sum_pci;
sum_pci += sum;
} else if (sum < pci_io_low_end - 0x1000 - sum_free) {
/* use the larger region at 0x1000 */
r_io->base = 0x1000 + sum_free;
sum_free += sum;
} else {
/* not enough io address space -> error out */
return -1;
}
Hmm, that assigns the io regions in bus order. I think it would be better to integrate this with the packing logic we already have: regions are sorted by size for best packing. Regions behind pci bridges are grouped together. I think we could group regions belonging to a pci bus in a simliar way.
You are right, but only partial. Let me explain (as I understand it). The devices IO regions behind the bus are *already* packed (sorted by size). The buses themselves are indeed not sorted by their total IO space. The same is happening in pci_bios_init_root_regions_mem with the MMIO regions.
It seems that what I have to do is simply replace the current bus enumeration: for (i = 0; i <= MaxPCIBus; i++) { with one that goes over the sorted list of buses in decreasing order of their total mem/IO consumed.
Do you agree?
Also, it seems cleaner to add a patch on top of this (and re-sending the series again) since while not optimal, it is still correct.
Thanks, Marcel
cheers, Gerd
On Mo, 2014-11-24 at 15:38 +0200, Marcel Apfelbaum wrote:
On Mon, 2014-11-24 at 13:28 +0100, Gerd Hoffmann wrote:
On Mo, 2014-11-24 at 11:47 +0200, Marcel Apfelbaum wrote:
Removed the assumption that the system has only one primary root bus. When mapping memory and IO regions go over all buses, skipping secondary and absent buses.
Hi Gerd, Thanks for the review.
I guess there are qemu patches adding support for multiple pci busses somewhere?
Yes, an RFC will be released soon on QEMU mailing list, is a work in progress and I am currently trying to decide if I wait for a fully functional series or release is sooner for comments.
Early rfc would be good I guess. First I have something to play with when looking closer at this patch series ;)
Second I suspect there might be some non-trivial interactions with acpi (pci region declaration), so early review is probably helpful to make sure the design goes into the right direction.
Hmm, that assigns the io regions in bus order. I think it would be better to integrate this with the packing logic we already have: regions are sorted by size for best packing. Regions behind pci bridges are grouped together. I think we could group regions belonging to a pci bus in a simliar way.
You are right, but only partial. Let me explain (as I understand it). The devices IO regions behind the bus are *already* packed (sorted by size). The buses themselves are indeed not sorted by their total IO space. The same is happening in pci_bios_init_root_regions_mem with the MMIO regions.
It seems that what I have to do is simply replace the current bus enumeration: for (i = 0; i <= MaxPCIBus; i++) { with one that goes over the sorted list of buses in decreasing order of their total mem/IO consumed.
Do you agree?
I think I would try to reuse the existing code which does the same for bridges. Reuse "struct pci_bus" to add one more level (and maybe rename the struct), then have the resource propagation code in pci_bios_check_devices() do the job.
That will need some reorganization, because the simple "struct pci_bus *busses" array indexed by bus number will not work any more. But I think that should be done nevertheless. Your patch numbers all busses across pci domains incrementally. Which I think is not how pci busses are numbered, I think each pci domain has its private bus number namespace.
cheers, Gerd
On Mon, 2014-11-24 at 15:28 +0100, Gerd Hoffmann wrote:
On Mo, 2014-11-24 at 15:38 +0200, Marcel Apfelbaum wrote:
On Mon, 2014-11-24 at 13:28 +0100, Gerd Hoffmann wrote:
On Mo, 2014-11-24 at 11:47 +0200, Marcel Apfelbaum wrote:
Removed the assumption that the system has only one primary root bus. When mapping memory and IO regions go over all buses, skipping secondary and absent buses.
Hi Gerd, Thanks for the review.
I guess there are qemu patches adding support for multiple pci busses somewhere?
Yes, an RFC will be released soon on QEMU mailing list, is a work in progress and I am currently trying to decide if I wait for a fully functional series or release is sooner for comments.
Early rfc would be good I guess. First I have something to play with when looking closer at this patch series ;)
Second I suspect there might be some non-trivial interactions with acpi (pci region declaration), so early review is probably helpful to make sure the design goes into the right direction.
Got it! Sending the RFC soon :)
Hmm, that assigns the io regions in bus order. I think it would be better to integrate this with the packing logic we already have: regions are sorted by size for best packing. Regions behind pci bridges are grouped together. I think we could group regions belonging to a pci bus in a simliar way.
You are right, but only partial. Let me explain (as I understand it). The devices IO regions behind the bus are *already* packed (sorted by size). The buses themselves are indeed not sorted by their total IO space. The same is happening in pci_bios_init_root_regions_mem with the MMIO regions.
It seems that what I have to do is simply replace the current bus enumeration: for (i = 0; i <= MaxPCIBus; i++) { with one that goes over the sorted list of buses in decreasing order of their total mem/IO consumed.
Do you agree?
I think I would try to reuse the existing code which does the same for bridges. Reuse "struct pci_bus" to add one more level (and maybe rename the struct), then have the resource propagation code in pci_bios_check_devices() do the job.
Thanks for the idea. I'll try following it.
That will need some reorganization, because the simple "struct pci_bus *busses" array indexed by bus number will not work any more. But I think that should be done nevertheless. Your patch numbers all busses across pci domains incrementally. Which I think is not how pci busses are numbered, I think each pci domain has its private bus number namespace.
The solution does not cover multiple domains. It only covers multiple root buses in the default domain. (and this is what I am interested in)
The operating systems can support multiple pci root buses in the same domain like this: - The BIOS assigns for each pci primary bus a range from 0x0 to 0xff. - The ACPI tables inform the OS the bus ranges for each primary bus. (and the IO/mem ranges for each, of course)
The current code uses the busses array indexed by bus number (as you said). Patch 1/2 leverages that by creating separate ranges for each primary bus. Let's say we have 2 primary buses having numbers 0 and 10, each with 2 bridges. The existing code will give range 1-2-3 to first primary bus and 10-11-12 to the second one.
You are right, seeing the QEMU RFC for it it will make more sense. I'll get to it.
Thanks, Marcel
cheers, Gerd
On Mon, Nov 24, 2014 at 03:28:52PM +0100, Gerd Hoffmann wrote:
On Mo, 2014-11-24 at 15:38 +0200, Marcel Apfelbaum wrote:
On Mon, 2014-11-24 at 13:28 +0100, Gerd Hoffmann wrote:
Hmm, that assigns the io regions in bus order. I think it would be better to integrate this with the packing logic we already have: regions are sorted by size for best packing. Regions behind pci bridges are grouped together. I think we could group regions belonging to a pci bus in a simliar way.
You are right, but only partial. Let me explain (as I understand it). The devices IO regions behind the bus are *already* packed (sorted by size). The buses themselves are indeed not sorted by their total IO space. The same is happening in pci_bios_init_root_regions_mem with the MMIO regions.
It seems that what I have to do is simply replace the current bus enumeration: for (i = 0; i <= MaxPCIBus; i++) { with one that goes over the sorted list of buses in decreasing order of their total mem/IO consumed.
Do you agree?
I think I would try to reuse the existing code which does the same for bridges. Reuse "struct pci_bus" to add one more level (and maybe rename the struct), then have the resource propagation code in pci_bios_check_devices() do the job.
That will need some reorganization, because the simple "struct pci_bus *busses" array indexed by bus number will not work any more.
Why not just pretend that the extra root PCI buses are children of bus zero (for resource sizing and mapping purposes).
That is, where pci_bios_check_devices() checks for "if (!s->bus_dev) continue;" have it instead continue to process the bus, but have it assume "parent = &busses[0];". Any code that looks at bus->bus_dev would also have to check that it is non-null, but I think much of the existing code might just work.
-Kevin
On Mon, Nov 24, 2014 at 12:21:31PM -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 03:28:52PM +0100, Gerd Hoffmann wrote:
I think I would try to reuse the existing code which does the same for bridges. Reuse "struct pci_bus" to add one more level (and maybe rename the struct), then have the resource propagation code in pci_bios_check_devices() do the job.
That will need some reorganization, because the simple "struct pci_bus *busses" array indexed by bus number will not work any more.
Why not just pretend that the extra root PCI buses are children of bus zero (for resource sizing and mapping purposes).
Thinking about this further, an even easier route may be to just place all devices on the extra root PCI buses in busses[0]. If I understand it correctly, when a system has more than one PCI root bus, all the root buses snoop all mem and io accesses, so there isn't a requirement for the io/mem regions to be continuous for that bus. So, if all the devices are thrown into busses[0], I think it may just work.
Specifically, I'm thinking about something like:
--- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -668,6 +668,9 @@ static int pci_bios_check_devices(struct pci_bus *busses) busses[pci->secondary_bus].bus_dev = pci;
struct pci_bus *bus = &busses[pci_bdf_to_bus(pci->bdf)]; + if (!bus->bus_dev) + // Treat devices on root buses as if they are on the 0 root bus + bus = &busses[0]; int i; for (i = 0; i < PCI_NUM_REGIONS; i++) { if ((pci->class == PCI_CLASS_BRIDGE_PCI) &&
-Kevin
On Mon, 2014-11-24 at 13:01 -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 12:21:31PM -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 03:28:52PM +0100, Gerd Hoffmann wrote:
I think I would try to reuse the existing code which does the same for bridges. Reuse "struct pci_bus" to add one more level (and maybe rename the struct), then have the resource propagation code in pci_bios_check_devices() do the job.
That will need some reorganization, because the simple "struct pci_bus *busses" array indexed by bus number will not work any more.
Why not just pretend that the extra root PCI buses are children of bus zero (for resource sizing and mapping purposes).
Thinking about this further, an even easier route may be to just place all devices on the extra root PCI buses in busses[0]. If I understand it correctly, when a system has more than one PCI root bus, all the root buses snoop all mem and io accesses, so there isn't a requirement for the io/mem regions to be continuous for that bus.
While from the host bridges perspective you are right, from the OS point of view we have a problem, at least for APCI based ones. They want disjoint IO/mem ranges for each primary root bus. (See: http://www.acpi.info/acpi_faq.htm)
I have a POC on a private branch with a working version and especially Windows needs the disjoint regions.
However, we can do the above and broke the _CRS into little pieces, one for each device. This will work, but maybe is not elegant and we may end with a lot of problems on QEMU side. What do you think?
Thanks for the review! Marcel
So, if all the devices are thrown into busses[0], I think it may just work.
Specifically, I'm thinking about something like:
--- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -668,6 +668,9 @@ static int pci_bios_check_devices(struct pci_bus *busses) busses[pci->secondary_bus].bus_dev = pci;
struct pci_bus *bus = &busses[pci_bdf_to_bus(pci->bdf)];
if (!bus->bus_dev)
// Treat devices on root buses as if they are on the 0 root bus
bus = &busses[0]; int i; for (i = 0; i < PCI_NUM_REGIONS; i++) { if ((pci->class == PCI_CLASS_BRIDGE_PCI) &&
-Kevin
On Mon, Nov 24, 2014 at 09:38:38PM +0200, Marcel Apfelbaum wrote:
On Mon, 2014-11-24 at 13:01 -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 12:21:31PM -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 03:28:52PM +0100, Gerd Hoffmann wrote:
I think I would try to reuse the existing code which does the same for bridges. Reuse "struct pci_bus" to add one more level (and maybe rename the struct), then have the resource propagation code in pci_bios_check_devices() do the job.
That will need some reorganization, because the simple "struct pci_bus *busses" array indexed by bus number will not work any more.
Why not just pretend that the extra root PCI buses are children of bus zero (for resource sizing and mapping purposes).
Thinking about this further, an even easier route may be to just place all devices on the extra root PCI buses in busses[0]. If I understand it correctly, when a system has more than one PCI root bus, all the root buses snoop all mem and io accesses, so there isn't a requirement for the io/mem regions to be continuous for that bus.
While from the host bridges perspective you are right, from the OS point of view we have a problem, at least for APCI based ones. They want disjoint IO/mem ranges for each primary root bus. (See: http://www.acpi.info/acpi_faq.htm)
I have a POC on a private branch with a working version and especially Windows needs the disjoint regions.
However, we can do the above and broke the _CRS into little pieces, one for each device. This will work, but maybe is not elegant and we may end with a lot of problems on QEMU side. What do you think?
If it's better to group the allocations for a given root bus together, then I'd just go with my first email - pretend each extra root bus is a child of bus zero for allocation purposes, and then audit the code so that it works with bus->bus_dev == NULL.
I do wonder why it's better to have extra PCI root buses instead of using PCI-to-PCI buses, but I guess I'll find out when I read your RFC.
-Kevin
On Mon, 2014-11-24 at 15:18 -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 09:38:38PM +0200, Marcel Apfelbaum wrote:
On Mon, 2014-11-24 at 13:01 -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 12:21:31PM -0500, Kevin O'Connor wrote:
On Mon, Nov 24, 2014 at 03:28:52PM +0100, Gerd Hoffmann wrote:
I think I would try to reuse the existing code which does the same for bridges. Reuse "struct pci_bus" to add one more level (and maybe rename the struct), then have the resource propagation code in pci_bios_check_devices() do the job.
That will need some reorganization, because the simple "struct pci_bus *busses" array indexed by bus number will not work any more.
Why not just pretend that the extra root PCI buses are children of bus zero (for resource sizing and mapping purposes).
Thinking about this further, an even easier route may be to just place all devices on the extra root PCI buses in busses[0]. If I understand it correctly, when a system has more than one PCI root bus, all the root buses snoop all mem and io accesses, so there isn't a requirement for the io/mem regions to be continuous for that bus.
While from the host bridges perspective you are right, from the OS point of view we have a problem, at least for APCI based ones. They want disjoint IO/mem ranges for each primary root bus. (See: http://www.acpi.info/acpi_faq.htm)
I have a POC on a private branch with a working version and especially Windows needs the disjoint regions.
However, we can do the above and broke the _CRS into little pieces, one for each device. This will work, but maybe is not elegant and we may end with a lot of problems on QEMU side. What do you think?
If it's better to group the allocations for a given root bus together, then I'd just go with my first email - pretend each extra root bus is a child of bus zero for allocation purposes, and then audit the code so that it works with bus->bus_dev == NULL.
I'll try it thanks!
I do wonder why it's better to have extra PCI root buses instead of using PCI-to-PCI buses, but I guess I'll find out when I read your RFC.
In short, guests with multiple NUMA nodes (mapped from hosts with multiple NUMA nodes) can group together only their CPU and memory, but not the PCI devices. The reason is that the operating systems can associate only one NUMA node per PCI primary bus.
Extra PCI root buses will give this flexibility and will be really handy, especially with pass-through devices.
The RFC is coming soon.
Thanks again for your interest :), Marcel
-Kevin