If I may chime in here (I worked at Intel back when this issue was first encountered when porting coreboot to Jacobsville)… The way I ended up making it work was to introduce another object in devicetree (which I called a root bridge) to model the concept of the PCI stack. In the picture below from Arthur these would be the PCI host bridge [0/1]. I called them root bridges because the PCI spec describes a host bridge as the path that the CPU takes to get to the PCI ‘domain’ in our use of it. I had worked on a much earlier project (no coreboot) called Jasper Forest where we had 2 separate CPU packages with a bus between them and each had a separate set of PCI busses below them. The cpus were in the same coherency domain and would each be able to access the busses below the other. This situation was what I considered to be 2 host bridges because there were 2 separate ways the cpus could get to the PCI domain, one via their own direct connection and one crossing the cpu-cpu bus and the sibling cpu directing the accesses and response back appropriately. The pci domain in this case was ‘pre-split’, allocating busses 0-0x7F to the first (the one with the DMI to PCH connection) cpu and busses 0x80-0xFD to the other. (Each CPU also had a bus number allocated to their own C-bus, 0xFF for the first, 0xFE for the second which was used early in the boot process to configure the bus splits between them.). After the first boot cycle was completed and all resources were gathered for each host bridge it was determined whether each had enough resources to map in everything required under each and if a rebalance needed to happen. If a rebalance had to occur (one side needed more memory or io space) nvram variables were set and a reset occurred so the BIOS could set up the split according to the variables and in this way only the first boot (or if new devices were added to open slots) would be the long ones.

Maybe the above description helps with making some choices about how to do things, maybe not as Mariusz said there’s multiple (and probably always a better) way(s) of doing things. Using my experience on Jasper Forest, introducing a root bridge object to devicetree gave a nice way to describe the system logically, and gave a mechanism for the implementor to do this pre-allocation of resources so we wouldn’t have to go through the possible reboot to balance resources as above. The stacks (root bridges) in Xeon may be able to handle changing the decoding at runtime (with certain limitations like bus quiescence) unlike the Jasper Forest example above where the initial decoding of the resources between CPUs required a reset to change the values. Using devicetree to describe the resources was my solution to making the enumeration faster and simpler, at the expense of flexibility. But, since these were mostly networking type platforms they were more static in nature so it wasn’t really thought to be an issue at the time. (These are the same stacks as are used in Xeon-SP today, as well as back to what Skylake-D for example used). I left Intel a few years ago before that work was completed.

Fast forward, I was doing some work porting coreboot to Skylake-D early last year I recalled some of the difficulties in communicating the PCI domain enumerated in coreboot to Tianocore via ACPI. I rememberd that it might have well been that the stacks are considered as host bridges because we could describe in ASL that each stack had a separate and invariable (as far as Tianocore was concerned) set of resources. I think that I had actually done it that way, extending the acpi code in coreboot to generate a PCI host bridge device when the new root bridge object was encountered in the devicetree. For Skylake-D work (which was eventually dropped) I had run into a problem where if not all the memory space in the system was allocated or reserved for things (meaning that holes were left in the memory space), if a device under a stack wasn’t allocated resources because the stack didn’t have a large enough window that Linux would assume these holes subtractively decoded to the PCI domain and try to stick them in there. Another thing that added some complexity was that each stack had it’s own IOAPIC and in non-APIC mode all virtual legacy wire interrupts had to get forwarded down to the stack that had the PCH before interrupts got back to the CPU.

Not sure if any of this helps or if it just sounds like rambling, but I thought maybe some of these thoughts could be helpful in design decisions made in the future. Personally I liked the idea of having the stacks understood in devicetree, but there were also some drawbacks as well. One thing that might be a drawback is whether or not the stack implementation in the hardware can be flexible enough for what you might like to do in devicetree as far as assigning bus ranges, etc. The stacks’ maximum bus number is determined by the starting bus number of the next stack in the line.

Some further info regarding the stacks which may influence any future designs…. Intel regards a stack as being ‘implemented’ when it decodes a PCIe root bridge below it. Stack-based SoC designs may not implement all stacks, as they may have different PCIe requirements. The thing to understand (at least on the current generation of stack-based designs) is that the devices/functions that used to be part of what was called the Uncore (memory controllers, CSI/UPI bus configuration, etc) are now spread across devices 8-31 of the stacks’ root bus numbers. The exception to this is stack 0 which only has uncore device 8 on it, because it’s also the DMI decode to the PCH complex which has (and can only have) devices 9-31 on it. So, while a stack may be ‘unimplemented’ it still needs to have a bus number if the uncore devices on it need to be accessible (or at least need to not collide with other bus assignments if not needed). Example is the uncore integrated memory controller device(s) are now on stack 2, devices 8-13 (SKX/CSX platforms). Stack 2 needs a bus number assigned to it (via a register in stack 0 dev 8), in order to access the imc registers. By default this bus number is 2 and the stack bus decoder takes precedence so any stack bus numbering needs to be in increasing bus number per stack. The type of thing that can’t happen in this case at early boot is trying to ‘fake’ a bus number decoding say for some device under a pcie root bridge on the PCH, you wouldn’t be able to setup a pcie root bridge subordinate/secondary bus decoding to get to it until you’ve changed the stack bus numbering from the power on default.

One of the upshots of this new schema (and probably the reason that it was done this way in the first place) is that now none of the uncore devices use any MMIO resources for internal registers. When more register space is needed, they will be in MMCFG space.

From: Lance Zhao <lance.zhao@gmail.com>
Sent: Friday, March 18, 2022 12:06 AM
To: Nico Huber <nico.h@gmx.de>
Cc: Arthur Heymans <arthur@aheymans.xyz>; coreboot <coreboot@coreboot.org>
Subject: [coreboot] Re: Multi domain PCI resource allocation: How to deal with multiple root busses on one domain

Caution: This is an external email. Please take care when clicking links or opening attachments.

Stack idea is from https://www.intel.com/content/www/us/en/developer/articles/technical/utilizing-the-intel-xeon-processor-scalable-family-iio-performance-monitoring-events.html.

In linux, sometimes domain is same as "segment", I am not sure current coreboot on xeon_sp already cover the case of multiple segment yet.

Nico Huber <nico.h@gmx.de> 于2022年3月18日周五 02:50写道：

Hi Arthur,

On 17.03.22 19:03, Arthur Heymans wrote:
> Now my question is the following:
> On some Stacks there are multiple root busses, but the resources need to be
> allocated on the same window. My initial idea was to add those root busses
> as separate struct bus in the domain->link_list. However currently the
> allocator assumes only one bus on domains (and bridges).
> In the code you'll see a lot of things like
>
> for (child = domain->link_list->children; child; child = child->sibling)
> ....

this is correct, we often (if not always by now) ignore that `link_list`
is a list itself and only walk the children of the first entry.

>
> This is fine if there is only one bus on the domain.
> Looping over link_list->next, struct bus'ses is certainly an option here,
> but I was told that having only one bus here was a design decision on the
> allocator v4 rewrite. I'm not sure how common that assumption is in the
> tree, so things could be broken in awkward ways.

I wouldn't say it was a design choice, probably rather a convenience
choice. The old concepts around multiple buses directly downstream of
a single device seemed inconsistent, AFAICT. And at the time the allo-
cator v4 was written it seemed unnecessary to keep compatibility around.

That doesn't mean we can't bring it back, of course. There is at least
one alternative, though.

The currently common case looks like this:

PCI bus 0
|
v

domain 0 --.
|-- PCI 00:00.0
|
|-- PCI 00:02.0
|
:

Now we could have multiple PCI buses directly below the domain. But
instead of modelling this with the `link_list`, we could also model
it with an abstract "host" bus below the domain device and another
layer of "host bridge" devices in between:

host bus
|
v

domain 0 --.
|-- PCI host bridge 0 --.
| |-- PCI 00:00.0
| |
| `-- PCI 00:02.0
|
|
|-- PCI host bridge 1 --.
| |-- PCI 16:00.0
| |
| :
:

I guess this would reduce complexity in generic code at the expense of
more data structures (devices) to manage. OTOH, if we'd make a final
decision for such a model, we could also get rid of the `link_list`.
Basically, setting in stone that we only allow one bus downstream of
any device node.

I'm not fully familiar with the hierarchy on Xeon-SP systems. Would
this be an adequate solution? Also, does the term `stack` map to our
`domain` 1:1 or are there differences?

Nico
_______________________________________________
coreboot mailing list -- coreboot@coreboot.org
To unsubscribe send an email to coreboot-leave@coreboot.org