[SeaBIOS] [RFC PATCH v2 0/4] Allow RedHat PCI bridges reserve more buses than necessary during init

Wed Jul 26 18:22:42 CEST 2017

On 26/07/2017 18:20, Laszlo Ersek wrote:
> On 07/26/17 08:48, Marcel Apfelbaum wrote:
>> On 25/07/2017 18:46, Laszlo Ersek wrote:
> 
> [snip]
> 
>>> (2) Bus range reservation, and hotplugging bridges. What's the
>>> motivation? Our recommendations in "docs/pcie.txt" suggest flat
>>> hierarchies.
>>>
>>
>> It remains flat. You have one single PCIE-PCI bridge plugged
>> into a PCIe Root Port, no deep nesting.
>>
>> The reason is to be able to support legacy PCI devices without
>> "committing" with a DMI-PCI bridge in advance. (Keep Q35 without)
>> legacy hw.
>>
>> The only way to support PCI devices in Q35 is to have them cold-plugged
>> into the pcie.0 bus, which is good, but not enough for expanding the
>> Q35 usability in order to make it eventually the default
>> QEMU x86 machine (I know this is another discussion and I am in
>> minority, at least for now).
>>
>> The plan is:
>> Start Q35 machine as usual, but one of the PCIe Root Ports includes
>> hints for firmware needed t support legacy PCI devices. (IO Ports range,
>> extra bus,...)
>>
>> Once a pci device is needed you have 2 options:
>> 1. Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device
>>     in the bridge.
>> 2. Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug
>>     a PCI device into the bridge.
>

Hi Laszlo,

> Thank you for the explanation, it makes the intent a lot clearer.
> 
> However, what does the hot-pluggability of the PCIe-PCI bridge buy us?
> In other words, what does it buy us when we do not add the PCIe-PCI
> bridge immediately at guest startup, as an integrated device?
>  > Why is it a problem to "commit" in advance? I understand that we might
> not like the DMI-PCI bridge (due to it being legacy), but what speaks
> against cold-plugging the PCIe-PCI bridge either as an integrated device
> in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI
> bridge in a similarly cold-plugged PCIe root port?
> 

We want to keep Q35 clean, and for most cases we don't want any
legacy PCI stuff if not especially required.

> I mean, in the cold-plugged case, you use up two bus numbers at the
> most, one for the root port, and another for the PCIe-PCI bridge. In the
> hot-plugged case, you have to start with the cold-plugged root port just
> the same (so that you can communicate the bus number reservation *at
> all*), and then reserve (= use up in advance) the bus number, the IO
> space, and the MMIO space(s). I don't see the difference; hot-plugging
> the PCIe-PCI bridge (= not committing in advance) doesn't seem to save
> any resources.
> 

Is not about resources, more about usage model.

> I guess I would see a difference if we reserved more than one bus number
> in the hotplug case, namely in order to support recursive hotplug under
> the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat
> hierarchy (ie the exercise is only for enabling legacy PCI endpoints,
> not for recursive hotplug).  The PCIe-PCI bridge isn't a device that
> does anything at all on its own, so why not just coldplug it? Its
> resources have to be reserved in advance anyway.
> 

Even if we prefer flat hierarchies, we should allow a sane nested
bridges configuration, so we will some times reserve more than one.

> So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup,
> possibly even make it an integrated device, and then you don't need to
> reserve bus numbers (and other apertures)".
> 
> Where am I wrong?
> 

Nothing wrong, I am just looking for feature parity Q35 vs PC.
Users may want to continue using [nested] PCI bridges, and
we want the Q35 machine to be used by more users in order
to make it reliable faster, while keeping it clean by default.

We had a discussion on this matter on last year KVM forum
and the hot-pluggable PCIe-PCI bridge was the general consensus.
As a bonus we get the PCI firmware hints capability.

> [snip]
> 
>>> (4) Whether the reservation size should be absolute or relative (raised
>>> by Gerd). IIUC, Gerd suggests that the absolute aperture size should be
>>> specified (as a minimum), not the increment / reservation for hotplug
>>> purposes.
>>>
>>> The Platform Initialization Specification, v1.6, downloadable at
>>> <http://www.uefi.org/specs>, writes the following under
>>>
>>>     EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
>>>
>>> in whose implementation I will have to parse the values from the
>>> capability structure, and return the appropriate representation to the
>>> platform-independent PciBusDxe driver (i.e., the enumeration /
>>> allocation agent):
>>>
>>>> The padding is returned in the form of ACPI (2.0 & 3.0) resource
>>>> descriptors. The exact definition of each of the fields is the same as
>>>> in the
>>>> EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.SubmitResources()
>>>> function. See the section 10.8 for the definition of this function.
>>>>
>>>> The PCI bus driver is responsible for adding this resource request to
>>>> the resource requests by the physical PCI devices. If Attributes is
>>>> EfiPaddingPciBus, the padding takes effect at the PCI bus level. If
>>>> Attributes is EfiPaddingPciRootBridge, the required padding takes
>>>> effect at the root bridge level. For details, see the definition of
>>>> EFI_HPC_PADDING_ATTRIBUTES in "Related Definitions" below.
>>>
>>> Emphasis on "*adding* this resource request to the resource requests by
>>> the physical PCI devices".
>>>
>>> However... After checking some OVMF logs, it seems that in practice,
>>> PciBusDxe does exactly what Gerd suggests.
>>>
>>> Currently OVMF returns a constant 2MB MMIO padding and a constant 512B
>>> IO padding for all hotplug-capable bridges (including PCI Express root
>>> ports), from GetResourcePadding(). For example, for the following QEMU
>>> command line fragment:
>>>
>>>     -device
>>> pxb-pcie,bus_nr=64,id=rootbr1,numa_node=1,bus=pcie.0,addr=0x3 \
>>>     \
>>>     -device ioh3420,id=rootbr1-port1,bus=rootbr1,addr=0x1,slot=3 \
>>>     -device e1000e,bus=rootbr1-port1,netdev=net0 \
>>>     \
>>>     -device ioh3420,id=rootbr1-port2,bus=rootbr1,addr=0x2,slot=4 \
>>>     -device e1000e,bus=rootbr1-port2,netdev=net1 \
>>>
>>> PciBusDxe produces the following log (extract):
>>>
>>>> PciBus: Resource Map for Root Bridge PciRoot(0x40)
>>>> Type =   Io16; Base = 0x8000;   Length = 0x2000;        Alignment =
>>>> 0xFFF
>>>>      Base = 0x8000;       Length = 0x1000;        Alignment =
>>>> 0xFFF;      Owner = PPB [40|02|00:**]
>>>>      Base = 0x9000;       Length = 0x1000;        Alignment =
>>>> 0xFFF;      Owner = PPB [40|01|00:**]
>>>> Type =  Mem32; Base = 0x98600000;       Length = 0x400000;
>>>> Alignment = 0x1FFFFF
>>>>      Base = 0x98600000;   Length = 0x200000;      Alignment =
>>>> 0x1FFFFF;   Owner = PPB [40|02|00:**]
>>>>      Base = 0x98800000;   Length = 0x200000;      Alignment =
>>>> 0x1FFFFF;   Owner = PPB [40|01|00:**]
>>>>
>>>> PciBus: Resource Map for Bridge [40|01|00]
>>>> Type =   Io16; Base = 0x9000;   Length = 0x1000;        Alignment =
>>>> 0xFFF
>>>>      Base = Padding;      Length = 0x200; Alignment = 0x1FF
>>>>      Base = 0x9000;       Length = 0x20;  Alignment = 0x1F;
>>>> Owner = PCI [41|00|00:18]
>>>> Type =  Mem32; Base = 0x98800000;       Length = 0x200000;
>>>> Alignment = 0x1FFFFF
>>>>      Base = Padding;      Length = 0x200000;      Alignment = 0x1FFFFF
>>>>      Base = 0x98800000;   Length = 0x20000;       Alignment =
>>>> 0x1FFFF;    Owner = PCI [41|00|00:14]
>>>>      Base = 0x98820000;   Length = 0x20000;       Alignment =
>>>> 0x1FFFF;    Owner = PCI [41|00|00:10]
>>>>      Base = 0x98840000;   Length = 0x4000;        Alignment =
>>>> 0x3FFF;     Owner = PCI [41|00|00:1C]
>>>>
>>>> PciBus: Resource Map for Bridge [40|02|00]
>>>> Type =   Io16; Base = 0x8000;   Length = 0x1000;        Alignment =
>>>> 0xFFF
>>>>      Base = Padding;      Length = 0x200; Alignment = 0x1FF
>>>>      Base = 0x8000;       Length = 0x20;  Alignment = 0x1F;
>>>> Owner = PCI [42|00|00:18]
>>>> Type =  Mem32; Base = 0x98600000;       Length = 0x200000;
>>>> Alignment = 0x1FFFFF
>>>>      Base = Padding;      Length = 0x200000;      Alignment = 0x1FFFFF
>>>>      Base = 0x98600000;   Length = 0x20000;       Alignment =
>>>> 0x1FFFF;    Owner = PCI [42|00|00:14]
>>>>      Base = 0x98620000;   Length = 0x20000;       Alignment =
>>>> 0x1FFFF;    Owner = PCI [42|00|00:10]
>>>>      Base = 0x98640000;   Length = 0x4000;        Alignment =
>>>> 0x3FFF;     Owner = PCI [42|00|00:1C]
>>>
>>> This means that
>>> - both ioh3420 root ports get a padding of 512B for IO and 2MB fo MMIO,
>>
>> So the 512 IO will be reserved even if the IO handling is not enabled
>> in the bridge? (I am asking because a similar issue I might have with
>> other guest code.)
> 
> I apologize for being unclear about the "distribution of work" between
> edk2's generic PciBusDxe driver and OVMF's platform-specific
> PciHotPlugInitDxe driver (which provides the
> EFI_PCI_HOT_PLUG_INIT_PROTOCOL, of which GetResourcePadding() is a
> member function).
>  > Basically, PciBusDxe is a platform-independent, universal driver, that
> calls into "hooks", optionally provided by the platform, at specific
> points in the enumeration / allocation.
> 
> EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() is such a platform
> specific hook, and OVMF is "reasonably free" to provide PciBusDxe with
> reservation hints from it, as OVMF sees fit, for any specific hotplug
> controller. (Of course OVMF's freedom is limited by two factors: first
> by the information that QEMU exposes to firmware, and second by the
> "imperative" that in GetResourcePadding() we really shouldn't look at
> any *other* PCI(e) controllers than the exact hotplug controller that
> PciBusDxe is asking about.)
> 

Until now sounds good :)

> With that in mind:
> 
> *Currently*, OVMF advises PciBusDxe to "reserve 512B of IO space" for
> *any* hotplug controller. (From the above log, we can see that on the
> root bridge, this 512B are then rounded up to 4KB.)
> 

OK

> *In the future*, OVMF should replace the constant 512B with the
> following logic: consult the IO base/limit registers of the subject
> hotplug controller, and if they are zero, then return "reserve no IO
> space for this hotplug controller" to PciBusDxe. If the base/limit
> registers are nonzero, OVMF would say "reserve 4KB IO space", or even
> propagate the value from the capability.
> 

Exactly

> [snip]
> 
>>> The PI spec says,
>>>
>>>> [...] For all the root HPCs and the nonroot HPCs, call
>>>> EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the
>>>> amount of overallocation and add that amount to the requests from the
>>>> physical devices. Reprogram the bus numbers by taking into account the
>>>> bus resource padding information. [...]
>>>
>>> However, according to my interpretation of the source code, PciBusDxe
>>> does not consider bus number padding for non-root HPCs (which are "all"
>>> HPCs on QEMU).
>>>
>>
>> Theoretically speaking, it is possible to change the  behavior, right?
> 
> Not just theoretically; in the past I have changed PciBusDxe -- it
> wouldn't identify QEMU's hotplug controllers (root port, downstream port
> etc) appropriately, and I managed to get some patches in. It's just that
> the less we understand the current code and the more intrusive/extensive
> the change is, the harder it is to sell the *idea*. PciBusDxe is
> platform-independent and shipped on many a physical system too.
> 

Understood, but from your explanation it sounds like the existings
callback sites(hooks) are enough.

>>> So, I agree that we can add a bus number range field to the capability,
>>> but I'm fairly sure it won't work with edk2's PciBusDxe without major
>>> surgery. (Given that we haven't ever considered hot-plugging entire
>>> bridges until now, i.e., anything that would require bus number
>>> reservation, I think we can live with such a limitation when using OVMF,
>>> for an indefinite time to come.)
>>>
>>
>> Since the OVMF will still support cold-plug, I think is OK for now.
>> Once the feature gets in SeaBIOS I will open a BZ for tracking.
> 
> Please do that (at that time); it will certainly need a dedicated
> discussion on edk2-devel.
> 

Sure, I'll start a discussion on the edk2-level as soon as we
finish the feature on QEMU/SeaBIOS.

> (Also, your statement about cold-plug being viable for at least a while
> is consistent with my questions / implications near the top: what does
> the hot-pluggability of the PCIe-PCI bridge buy us ultimately?)
> 

Please see above.

Thanks,
Marcel

> Thanks!
> Laszlo
>