On Mon, Jun 15, 2015 at 07:39:18AM +1000, Benjamin Herrenschmidt wrote:
On Sun, 2015-06-14 at 20:06 +0200, Michael S. Tsirkin wrote:
As I understand it, the use case for multiple PCI roots is large servers that process a lot of IO.
For better or worse, guest OS-es assume that numa locality can only be specified for PCI roots.
So the use case is to specify numa locality for virtual devices.
In addition, I'd add that on ppc "pseries", the standard way of hotplugging devices is to hotplug PCI roots (ie, a virtual host bridge with the new device(s) below it).
I'm not aware of real world hardware with hot-plugable root buses. Should it come about then some kind of OS visible spec would be needed for the OS to identify and enumerate newly added buses, and I suppose we could figure out how to handle it once that type of thing happens.
On IBM ppc systems, both exist. IE, real HW hot pluggable roots (on older systems mostly, GX based drawers with PCI host bridges in them nowadays, we tend to use PCIe cable card based drawers), and in virtualized systems, hot plugging root bridges is the standard way PowerVM (aka pHyp) uses for hotplug which we need to support in qemu/pseries at some point.
But more importantly, if the sort is by the bus number, then how is it better than just using the bus number directly?
PCI Bus number makes no sense. Any root can have the whole range of bus numbers 0...255 and the bus number assignment is under control of the guest anyway. Or are you talking about a different bus number (somewhat picked up the conversation half way...)
There are x86 systems with multiple separate PCI root buses where one can access the pci config space of all the buses using the same 0x0cf8 IO space. During system setup, the multiple PCI root buses are each configured to only respond to PCI config accesses within its range of bus numbers. So if "root1" is configured for bus ids between 64-128, then it will only forward the request if the bus id in the request is between 64-128.
I suspect in your PPC example that the separate root buses all had separate io/memory space as well and thus were completely separate. (That is, they don't share the equivalent of IO 0x0cf8.) If so, that's different from how the x86 qemu code and the x86 systems I was discussing above work.
-Kevin