Now PCI bridges get a bus range number on a system init, basing on currently plugged devices. That's why when one wants to hotplug another bridge, it needs his child bus, which the parent is unable to provide (speaking about virtual device). The suggested workaround is to have vendor-specific capability in Red Hat PCI bridges that contains number of additional bus to reserve on BIOS PCI init. So this capability is intented only for pure QEMU->SeaBIOS usage.
Considering all aforesaid, this series is directly connected with QEMU RFC series (v2) "Generic PCIE-PCI Bridge".
Although the new PCI capability is supposed to contain various limits along with bus number to reserve, now only its full layout is proposed, but only bus_reserve field is used in QEMU and BIOS. Limits usage is still a subject for implementation as now the main goal of this series to provide necessary support from the firmware side to PCIE-PCI bridge hotplug.
Changes v1->v2: 1. New #define for Red Hat vendor added (addresses Konrad's comment). 2. Refactored pci_find_capability function (addresses Marcel's comment). 3. Capability reworked: - data type added; - reserve space in a structure for IO, memory and prefetchable memory limits.
Aleksandr Bezzubikov (4): pci: refactor pci_find_capapibilty to get bdf as the first argument instead of the whole pci_device pci: add RedHat vendor ID pci: add QEMU-specific PCI capability structure pci: enable RedHat PCI bridges to reserve additional buses on PCI init
src/fw/pciinit.c | 18 ++++++++++++++---- src/hw/pci_cap.h | 23 +++++++++++++++++++++++ src/hw/pci_ids.h | 2 ++ src/hw/pcidevice.c | 12 ++++++------ src/hw/pcidevice.h | 2 +- src/hw/virtio-pci.c | 4 ++-- 6 files changed, 48 insertions(+), 13 deletions(-) create mode 100644 src/hw/pci_cap.h
Refactor pci_find_capability function to get bdf instead of a whole pci_device* as the only necessary field for this function is still bdf. It greatly helps when we have bdf but not pci_device.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com --- src/fw/pciinit.c | 4 ++-- src/hw/pcidevice.c | 12 ++++++------ src/hw/pcidevice.h | 2 +- src/hw/virtio-pci.c | 4 ++-- 4 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index 08221e6..864954f 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -762,7 +762,7 @@ static int pci_bus_hotplug_support(struct pci_bus *bus, u8 pcie_cap) return downstream_port && slot_implemented; }
- shpc_cap = pci_find_capability(bus->bus_dev, PCI_CAP_ID_SHPC, 0); + shpc_cap = pci_find_capability(bus->bus_dev->bdf, PCI_CAP_ID_SHPC, 0); return !!shpc_cap; }
@@ -844,7 +844,7 @@ static int pci_bios_check_devices(struct pci_bus *busses) */ parent = &busses[0]; int type; - u8 pcie_cap = pci_find_capability(s->bus_dev, PCI_CAP_ID_EXP, 0); + u8 pcie_cap = pci_find_capability(s->bus_dev->bdf, PCI_CAP_ID_EXP, 0); int hotplug_support = pci_bus_hotplug_support(s, pcie_cap); for (type = 0; type < PCI_REGION_TYPE_COUNT; type++) { u64 align = (type == PCI_REGION_TYPE_IO) ? diff --git a/src/hw/pcidevice.c b/src/hw/pcidevice.c index cfebf66..d01e27b 100644 --- a/src/hw/pcidevice.c +++ b/src/hw/pcidevice.c @@ -134,25 +134,25 @@ pci_find_init_device(const struct pci_device_id *ids, void *arg) return NULL; }
-u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap) +u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap) { int i; - u16 status = pci_config_readw(pci->bdf, PCI_STATUS); + u16 status = pci_config_readw(bdf, PCI_STATUS);
if (!(status & PCI_STATUS_CAP_LIST)) return 0;
if (cap == 0) { /* find first */ - cap = pci_config_readb(pci->bdf, PCI_CAPABILITY_LIST); + cap = pci_config_readb(bdf, PCI_CAPABILITY_LIST); } else { /* find next */ - cap = pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_NEXT); + cap = pci_config_readb(bdf, cap + PCI_CAP_LIST_NEXT); } for (i = 0; cap && i <= 0xff; i++) { - if (pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_ID) == cap_id) + if (pci_config_readb(bdf, cap + PCI_CAP_LIST_ID) == cap_id) return cap; - cap = pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_NEXT); + cap = pci_config_readb(bdf, cap + PCI_CAP_LIST_NEXT); }
return 0; diff --git a/src/hw/pcidevice.h b/src/hw/pcidevice.h index 354b549..adcc75a 100644 --- a/src/hw/pcidevice.h +++ b/src/hw/pcidevice.h @@ -69,7 +69,7 @@ int pci_init_device(const struct pci_device_id *ids , struct pci_device *pci, void *arg); struct pci_device *pci_find_init_device(const struct pci_device_id *ids , void *arg); -u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap); +u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap); void pci_enable_busmaster(struct pci_device *pci); u16 pci_enable_iobar(struct pci_device *pci, u32 addr); void *pci_enable_membar(struct pci_device *pci, u32 addr); diff --git a/src/hw/virtio-pci.c b/src/hw/virtio-pci.c index e5c2c33..4e33033 100644 --- a/src/hw/virtio-pci.c +++ b/src/hw/virtio-pci.c @@ -381,7 +381,7 @@ fail:
void vp_init_simple(struct vp_device *vp, struct pci_device *pci) { - u8 cap = pci_find_capability(pci, PCI_CAP_ID_VNDR, 0); + u8 cap = pci_find_capability(pci->bdf, PCI_CAP_ID_VNDR, 0); struct vp_cap *vp_cap; const char *mode; u32 offset, base, mul; @@ -479,7 +479,7 @@ void vp_init_simple(struct vp_device *vp, struct pci_device *pci) vp_cap->cap, type, vp_cap->bar, addr, offset, mode); }
- cap = pci_find_capability(pci, PCI_CAP_ID_VNDR, cap); + cap = pci_find_capability(pci->bdf, PCI_CAP_ID_VNDR, cap); }
if (vp->common.cap && vp->notify.cap && vp->isr.cap && vp->device.cap) {
Hi Alexandr,
On 23/07/2017 1:11, Aleksandr Bezzubikov wrote:
Refactor pci_find_capability function to get bdf instead of a whole pci_device* as the only necessary field for this function is still bdf. It greatly helps when we have bdf but not pci_device.
You can drop the last sentence. Other than that:
Reviewed-by: Marcel Apfelbaum marcel@redhat.com
Thanks, Marcel
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/fw/pciinit.c | 4 ++-- src/hw/pcidevice.c | 12 ++++++------ src/hw/pcidevice.h | 2 +- src/hw/virtio-pci.c | 4 ++-- 4 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index 08221e6..864954f 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -762,7 +762,7 @@ static int pci_bus_hotplug_support(struct pci_bus *bus, u8 pcie_cap) return downstream_port && slot_implemented; }
- shpc_cap = pci_find_capability(bus->bus_dev, PCI_CAP_ID_SHPC, 0);
- shpc_cap = pci_find_capability(bus->bus_dev->bdf, PCI_CAP_ID_SHPC, 0); return !!shpc_cap; }
@@ -844,7 +844,7 @@ static int pci_bios_check_devices(struct pci_bus *busses) */ parent = &busses[0]; int type;
u8 pcie_cap = pci_find_capability(s->bus_dev, PCI_CAP_ID_EXP, 0);
u8 pcie_cap = pci_find_capability(s->bus_dev->bdf, PCI_CAP_ID_EXP, 0); int hotplug_support = pci_bus_hotplug_support(s, pcie_cap); for (type = 0; type < PCI_REGION_TYPE_COUNT; type++) { u64 align = (type == PCI_REGION_TYPE_IO) ?
diff --git a/src/hw/pcidevice.c b/src/hw/pcidevice.c index cfebf66..d01e27b 100644 --- a/src/hw/pcidevice.c +++ b/src/hw/pcidevice.c @@ -134,25 +134,25 @@ pci_find_init_device(const struct pci_device_id *ids, void *arg) return NULL; }
-u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap) +u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap) { int i;
- u16 status = pci_config_readw(pci->bdf, PCI_STATUS);
u16 status = pci_config_readw(bdf, PCI_STATUS);
if (!(status & PCI_STATUS_CAP_LIST)) return 0;
if (cap == 0) { /* find first */
cap = pci_config_readb(pci->bdf, PCI_CAPABILITY_LIST);
cap = pci_config_readb(bdf, PCI_CAPABILITY_LIST); } else { /* find next */
cap = pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_NEXT);
cap = pci_config_readb(bdf, cap + PCI_CAP_LIST_NEXT); } for (i = 0; cap && i <= 0xff; i++) {
if (pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_ID) == cap_id)
if (pci_config_readb(bdf, cap + PCI_CAP_LIST_ID) == cap_id) return cap;
cap = pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_NEXT);
cap = pci_config_readb(bdf, cap + PCI_CAP_LIST_NEXT); } return 0;
diff --git a/src/hw/pcidevice.h b/src/hw/pcidevice.h index 354b549..adcc75a 100644 --- a/src/hw/pcidevice.h +++ b/src/hw/pcidevice.h @@ -69,7 +69,7 @@ int pci_init_device(const struct pci_device_id *ids , struct pci_device *pci, void *arg); struct pci_device *pci_find_init_device(const struct pci_device_id *ids , void *arg); -u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap); +u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap); void pci_enable_busmaster(struct pci_device *pci); u16 pci_enable_iobar(struct pci_device *pci, u32 addr); void *pci_enable_membar(struct pci_device *pci, u32 addr); diff --git a/src/hw/virtio-pci.c b/src/hw/virtio-pci.c index e5c2c33..4e33033 100644 --- a/src/hw/virtio-pci.c +++ b/src/hw/virtio-pci.c @@ -381,7 +381,7 @@ fail:
void vp_init_simple(struct vp_device *vp, struct pci_device *pci) {
- u8 cap = pci_find_capability(pci, PCI_CAP_ID_VNDR, 0);
- u8 cap = pci_find_capability(pci->bdf, PCI_CAP_ID_VNDR, 0); struct vp_cap *vp_cap; const char *mode; u32 offset, base, mul;
@@ -479,7 +479,7 @@ void vp_init_simple(struct vp_device *vp, struct pci_device *pci) vp_cap->cap, type, vp_cap->bar, addr, offset, mode); }
cap = pci_find_capability(pci, PCI_CAP_ID_VNDR, cap);
cap = pci_find_capability(pci->bdf, PCI_CAP_ID_VNDR, cap); } if (vp->common.cap && vp->notify.cap && vp->isr.cap && vp->device.cap) {
On Sun, Jul 23, 2017 at 01:11:47AM +0300, Aleksandr Bezzubikov wrote:
Refactor pci_find_capability function to get bdf instead of a whole pci_device* as the only necessary field for this function is still bdf. It greatly helps when we have bdf but not pci_device.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/fw/pciinit.c | 4 ++-- src/hw/pcidevice.c | 12 ++++++------ src/hw/pcidevice.h | 2 +- src/hw/virtio-pci.c | 4 ++-- 4 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index 08221e6..864954f 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -762,7 +762,7 @@ static int pci_bus_hotplug_support(struct pci_bus *bus, u8 pcie_cap) return downstream_port && slot_implemented; }
- shpc_cap = pci_find_capability(bus->bus_dev, PCI_CAP_ID_SHPC, 0);
- shpc_cap = pci_find_capability(bus->bus_dev->bdf, PCI_CAP_ID_SHPC, 0); return !!shpc_cap;
}
@@ -844,7 +844,7 @@ static int pci_bios_check_devices(struct pci_bus *busses) */ parent = &busses[0]; int type;
u8 pcie_cap = pci_find_capability(s->bus_dev, PCI_CAP_ID_EXP, 0);
u8 pcie_cap = pci_find_capability(s->bus_dev->bdf, PCI_CAP_ID_EXP, 0); int hotplug_support = pci_bus_hotplug_support(s, pcie_cap); for (type = 0; type < PCI_REGION_TYPE_COUNT; type++) { u64 align = (type == PCI_REGION_TYPE_IO) ?
diff --git a/src/hw/pcidevice.c b/src/hw/pcidevice.c index cfebf66..d01e27b 100644 --- a/src/hw/pcidevice.c +++ b/src/hw/pcidevice.c @@ -134,25 +134,25 @@ pci_find_init_device(const struct pci_device_id *ids, void *arg) return NULL; }
-u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap) +u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap)
Thanks. If you respin this series, please also move pci_find_capability() function from pcidevice.c to pci.c. (If the function no longer takes a pci_device, it should be moved out of pcidevice.c.)
-Kevin
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com --- src/hw/pci_ids.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/src/hw/pci_ids.h b/src/hw/pci_ids.h index 4ac73b4..db2e694 100644 --- a/src/hw/pci_ids.h +++ b/src/hw/pci_ids.h @@ -2263,6 +2263,8 @@ #define PCI_DEVICE_ID_KORENIX_JETCARDF0 0x1600 #define PCI_DEVICE_ID_KORENIX_JETCARDF1 0x16ff
+#define PCI_VENDOR_ID_REDHAT 0x1b36 + #define PCI_VENDOR_ID_TEKRAM 0x1de1 #define PCI_DEVICE_ID_TEKRAM_DC290 0xdc29
On 23/07/2017 1:11, Aleksandr Bezzubikov wrote:
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/hw/pci_ids.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/src/hw/pci_ids.h b/src/hw/pci_ids.h index 4ac73b4..db2e694 100644 --- a/src/hw/pci_ids.h +++ b/src/hw/pci_ids.h @@ -2263,6 +2263,8 @@ #define PCI_DEVICE_ID_KORENIX_JETCARDF0 0x1600 #define PCI_DEVICE_ID_KORENIX_JETCARDF1 0x16ff
+#define PCI_VENDOR_ID_REDHAT 0x1b36
- #define PCI_VENDOR_ID_TEKRAM 0x1de1 #define PCI_DEVICE_ID_TEKRAM_DC290 0xdc29
I suggest to merge this patch with patch 4 that uses it.
Thanks, Marcel
On PCI init PCI bridge devices may need some extra info about bus number to reserve, IO, memory and prefetchable memory limits. QEMU can provide this with special vendor-specific PCI capability.
This capability is intended to be used only for Red Hat PCI bridges, i.e. QEMU cooperation.
Sizes of limits match ones from PCI Type 1 Configuration Space Header, number of buses to reserve occupies only 1 byte since it is the size of Subordinate Bus Number register.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com --- src/hw/pci_cap.h | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 src/hw/pci_cap.h
diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h new file mode 100644 index 0000000..1382b0b --- /dev/null +++ b/src/hw/pci_cap.h @@ -0,0 +1,23 @@ +#ifndef _PCI_CAP_H +#define _PCI_CAP_H + +#include "types.h" + +struct vendor_pci_cap { + u8 id; + u8 next; + u8 len; +}; + +struct redhat_pci_bridge_cap { + struct vendor_pci_cap hdr; + u8 bus_res; + u32 pref_lim_upper; + u16 pref_lim; + u16 mem_lim; + u16 io_lim_upper; + u8 io_lim; + u8 padd; +}; + +#endif /* _PCI_CAP_H */
On Sun, Jul 23, 2017 at 01:11:49AM +0300, Aleksandr Bezzubikov wrote:
On PCI init PCI bridge devices may need some extra info about bus number to reserve, IO, memory and prefetchable memory limits. QEMU can provide this with special vendor-specific PCI capability.
This capability is intended to be used only for Red Hat PCI bridges, i.e. QEMU cooperation.
Sizes of limits match ones from PCI Type 1 Configuration Space Header, number of buses to reserve occupies only 1 byte since it is the size of Subordinate Bus Number register.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/hw/pci_cap.h | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 src/hw/pci_cap.h
diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h new file mode 100644 index 0000000..1382b0b --- /dev/null +++ b/src/hw/pci_cap.h @@ -0,0 +1,23 @@ +#ifndef _PCI_CAP_H +#define _PCI_CAP_H
+#include "types.h"
+struct vendor_pci_cap {
- u8 id;
- u8 next;
- u8 len;
+};
+struct redhat_pci_bridge_cap {
- struct vendor_pci_cap hdr;
You want to add some kind of identifier here after the header, such that more capabilities can be added in future without breaking this one.
- u8 bus_res;
- u32 pref_lim_upper;
- u16 pref_lim;
- u16 mem_lim;
- u16 io_lim_upper;
- u8 io_lim;
- u8 padd;
Please add documentation.
+};
+#endif /* _PCI_CAP_H */
2.7.4
2017-07-23 5:44 GMT+03:00 Michael S. Tsirkin mst@redhat.com:
On Sun, Jul 23, 2017 at 01:11:49AM +0300, Aleksandr Bezzubikov wrote:
On PCI init PCI bridge devices may need some extra info about bus number to reserve, IO, memory and prefetchable memory limits. QEMU can provide this with special vendor-specific PCI capability.
This capability is intended to be used only for Red Hat PCI bridges, i.e. QEMU cooperation.
Sizes of limits match ones from PCI Type 1 Configuration Space Header, number of buses to reserve occupies only 1 byte since it is the size of Subordinate Bus Number register.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/hw/pci_cap.h | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 src/hw/pci_cap.h
diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h new file mode 100644 index 0000000..1382b0b --- /dev/null +++ b/src/hw/pci_cap.h @@ -0,0 +1,23 @@ +#ifndef _PCI_CAP_H +#define _PCI_CAP_H
+#include "types.h"
+struct vendor_pci_cap {
- u8 id;
- u8 next;
- u8 len;
+};
+struct redhat_pci_bridge_cap {
- struct vendor_pci_cap hdr;
Hi Michael,
Thanks for the quick reply.
You want to add some kind of identifier here after the header, such that more capabilities can be added in future without breaking this one.
You mean to distinguish different vendor-specific capabilities? Agreed if so, will add it in the next version.
- u8 bus_res;
- u32 pref_lim_upper;
- u16 pref_lim;
- u16 mem_lim;
- u16 io_lim_upper;
- u8 io_lim;
- u8 padd;
Please add documentation.
+};
+#endif /* _PCI_CAP_H */
2.7.4
On Sun, Jul 23, 2017 at 01:11:49AM +0300, Aleksandr Bezzubikov wrote:
On PCI init PCI bridge devices may need some extra info about bus number to reserve, IO, memory and prefetchable memory limits. QEMU can provide this with special vendor-specific PCI capability.
This capability is intended to be used only for Red Hat PCI bridges, i.e. QEMU cooperation.
Sizes of limits match ones from PCI Type 1 Configuration Space Header, number of buses to reserve occupies only 1 byte since it is the size of Subordinate Bus Number register.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/hw/pci_cap.h | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 src/hw/pci_cap.h
diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h new file mode 100644 index 0000000..1382b0b --- /dev/null +++ b/src/hw/pci_cap.h @@ -0,0 +1,23 @@ +#ifndef _PCI_CAP_H +#define _PCI_CAP_H
+#include "types.h"
+struct vendor_pci_cap {
- u8 id;
- u8 next;
- u8 len;
+};
Thanks. If you respin this series, please add this header to the src/fw/ directory instead of src/hw/. Also, I'd prefer to avoid a "pci_" prefix on the header as it makes it seem similar to the existing pci_regs.h and pci_ids.h headers which are a bit different - how about src/fw/dev-pci.h instead?
-Kevin
2017-07-23 19:30 GMT+03:00 Kevin O'Connor kevin@koconnor.net:
On Sun, Jul 23, 2017 at 01:11:49AM +0300, Aleksandr Bezzubikov wrote:
On PCI init PCI bridge devices may need some extra info about bus number to reserve, IO, memory and prefetchable memory limits. QEMU can provide this with special vendor-specific PCI capability.
This capability is intended to be used only for Red Hat PCI bridges, i.e. QEMU cooperation.
Sizes of limits match ones from PCI Type 1 Configuration Space Header, number of buses to reserve occupies only 1 byte since it is the size of Subordinate Bus Number register.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/hw/pci_cap.h | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 src/hw/pci_cap.h
diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h new file mode 100644 index 0000000..1382b0b --- /dev/null +++ b/src/hw/pci_cap.h @@ -0,0 +1,23 @@ +#ifndef _PCI_CAP_H +#define _PCI_CAP_H
+#include "types.h"
+struct vendor_pci_cap {
- u8 id;
- u8 next;
- u8 len;
+};
Thanks.
Thanks for the review.
If you respin this series, please add this header to the src/fw/ directory instead of src/hw/. Also, I'd prefer to avoid a "pci_" prefix on the header as it makes it seem similar to the existing pci_regs.h and pci_ids.h headers which are a bit different - how about src/fw/dev-pci.h instead?
-Kevin
No objections, Will do that in v3. Except 'pci_' prefix - it's still a PCI capability, isn't it?
In case of Red Hat PCI bridges reserve additional buses, which number is provided in a vendor-specific capability.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com --- src/fw/pciinit.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index 864954f..f05a8b9 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -15,6 +15,7 @@ #include "hw/pcidevice.h" // pci_probe_devices #include "hw/pci_ids.h" // PCI_VENDOR_ID_INTEL #include "hw/pci_regs.h" // PCI_COMMAND +#include "hw/pci_cap.h" // qemu_pci_cap #include "list.h" // struct hlist_node #include "malloc.h" // free #include "output.h" // dprintf @@ -578,9 +579,18 @@ pci_bios_init_bus_rec(int bus, u8 *pci_bus) pci_bios_init_bus_rec(secbus, pci_bus);
if (subbus != *pci_bus) { + u8 res_bus = 0; + if (pci_config_readw(bdf, PCI_VENDOR_ID) == PCI_VENDOR_ID_REDHAT) { + u8 cap = pci_find_capability(bdf, PCI_CAP_ID_VNDR, 0); + if (cap) { + res_bus = pci_config_readb(bdf, + cap + offsetof(struct redhat_pci_bridge_cap, + bus_res)); + } + } dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n", - subbus, *pci_bus); - subbus = *pci_bus; + subbus, *pci_bus + res_bus); + subbus = *pci_bus + res_bus; } else { dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus); }
On Sun, Jul 23, 2017 at 01:11:50AM +0300, Aleksandr Bezzubikov wrote:
In case of Red Hat PCI bridges reserve additional buses, which number is provided in a vendor-specific capability.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/fw/pciinit.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index 864954f..f05a8b9 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -15,6 +15,7 @@ #include "hw/pcidevice.h" // pci_probe_devices #include "hw/pci_ids.h" // PCI_VENDOR_ID_INTEL #include "hw/pci_regs.h" // PCI_COMMAND +#include "hw/pci_cap.h" // qemu_pci_cap #include "list.h" // struct hlist_node #include "malloc.h" // free #include "output.h" // dprintf @@ -578,9 +579,18 @@ pci_bios_init_bus_rec(int bus, u8 *pci_bus) pci_bios_init_bus_rec(secbus, pci_bus);
if (subbus != *pci_bus) {
u8 res_bus = 0;
if (pci_config_readw(bdf, PCI_VENDOR_ID) == PCI_VENDOR_ID_REDHAT) {
Check device ID as well.
u8 cap = pci_find_capability(bdf, PCI_CAP_ID_VNDR, 0);
There could be multiple vendor capabilities. You want to scan them all.
if (cap) {
res_bus = pci_config_readb(bdf,
cap + offsetof(struct redhat_pci_bridge_cap,
bus_res));
You might want to add sanity checks e.g. overflow, and capability length.
Also, if all you use is offsetof, don't bother with a struct, just add some defines.
}
} dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
subbus, *pci_bus);
subbus = *pci_bus;
subbus, *pci_bus + res_bus);
subbus = *pci_bus + res_bus;
So you take all present devices and add reserved ones - is that it? If so it looks like this will steal extra buses each time you add a child bus and reboot.
} else { dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus); }
-- 2.7.4
2017-07-23 5:49 GMT+03:00 Michael S. Tsirkin mst@redhat.com:
On Sun, Jul 23, 2017 at 01:11:50AM +0300, Aleksandr Bezzubikov wrote:
In case of Red Hat PCI bridges reserve additional buses, which number is
provided
in a vendor-specific capability.
Signed-off-by: Aleksandr Bezzubikov zuban32s@gmail.com
src/fw/pciinit.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index 864954f..f05a8b9 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -15,6 +15,7 @@ #include "hw/pcidevice.h" // pci_probe_devices #include "hw/pci_ids.h" // PCI_VENDOR_ID_INTEL #include "hw/pci_regs.h" // PCI_COMMAND +#include "hw/pci_cap.h" // qemu_pci_cap #include "list.h" // struct hlist_node #include "malloc.h" // free #include "output.h" // dprintf @@ -578,9 +579,18 @@ pci_bios_init_bus_rec(int bus, u8 *pci_bus) pci_bios_init_bus_rec(secbus, pci_bus);
if (subbus != *pci_bus) {
u8 res_bus = 0;
if (pci_config_readw(bdf, PCI_VENDOR_ID) ==
PCI_VENDOR_ID_REDHAT) {
Check device ID as well.
Not sure what ID/IDs we want to see here exactly. Now it can be only pcie-root-port (0xC then), but
u8 cap = pci_find_capability(bdf, PCI_CAP_ID_VNDR, 0);
There could be multiple vendor capabilities. You want to scan them all.
if (cap) {
res_bus = pci_config_readb(bdf,
cap + offsetof(struct redhat_pci_bridge_cap,
bus_res));
You might want to add sanity checks e.g. overflow, and capability length.
Also, if all you use is offsetof, don't bother with a struct, just add some defines.
OK, will move this to defines.
}
} dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
subbus, *pci_bus);
subbus = *pci_bus;
subbus, *pci_bus + res_bus);
subbus = *pci_bus + res_bus;
So you take all present devices and add reserved ones - is that it? If so it looks like this will steal extra buses each time you add a child bus and reboot.
You're right, such problem persists. Will think on what can be done here.
} else { dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus); }
-- 2.7.4
On 07/23/17 00:11, Aleksandr Bezzubikov wrote:
Now PCI bridges get a bus range number on a system init, basing on currently plugged devices. That's why when one wants to hotplug another bridge, it needs his child bus, which the parent is unable to provide (speaking about virtual device). The suggested workaround is to have vendor-specific capability in Red Hat PCI bridges that contains number of additional bus to reserve on BIOS PCI init. So this capability is intented only for pure QEMU->SeaBIOS usage.
Considering all aforesaid, this series is directly connected with QEMU RFC series (v2) "Generic PCIE-PCI Bridge".
Although the new PCI capability is supposed to contain various limits along with bus number to reserve, now only its full layout is proposed, but only bus_reserve field is used in QEMU and BIOS. Limits usage is still a subject for implementation as now the main goal of this series to provide necessary support from the firmware side to PCIE-PCI bridge hotplug.
Changes v1->v2:
- New #define for Red Hat vendor added (addresses Konrad's comment).
- Refactored pci_find_capability function (addresses Marcel's comment).
- Capability reworked:
- data type added;
- reserve space in a structure for IO, memory and prefetchable memory limits.
Aleksandr Bezzubikov (4): pci: refactor pci_find_capapibilty to get bdf as the first argument instead of the whole pci_device pci: add RedHat vendor ID pci: add QEMU-specific PCI capability structure pci: enable RedHat PCI bridges to reserve additional buses on PCI init
src/fw/pciinit.c | 18 ++++++++++++++---- src/hw/pci_cap.h | 23 +++++++++++++++++++++++ src/hw/pci_ids.h | 2 ++ src/hw/pcidevice.c | 12 ++++++------ src/hw/pcidevice.h | 2 +- src/hw/virtio-pci.c | 4 ++-- 6 files changed, 48 insertions(+), 13 deletions(-) create mode 100644 src/hw/pci_cap.h
Coming back from PTO, it's hard for me to follow up on all the comments that have been made across the v1 and v2 of this RFC series, so I'll just provide a brain dump here:
(1) Mentioned by Michael: documentation. That's the most important part. I haven't seen the QEMU patches, so perhaps they already include documentation. If not, please start this work with adding a detailed description do QEMU's docs/ or docs/specs/.
There are a number of preexistent documents that might be related, just search docs/ for filenames with "pci" in them.
(2) Bus range reservation, and hotplugging bridges. What's the motivation? Our recommendations in "docs/pcie.txt" suggest flat hierarchies.
If this use case is really necessary, I think it should be covered in "docs/pcie.txt". In particular it has a consequence for PXB as well (search "pcie.txt" for "bus_nr") -- if users employ extra root buses, then the bus number partitions that they specify must account for any bridges that they plan to hot-plug (and for the bus range reservations on the cold-plugged bridges behind those extra root buses).
(3) Regarding the contents and the format of the capability structure, I wrote up my thoughts earlier in
https://bugzilla.redhat.com/show_bug.cgi?id=1434747#c8
Let me quote it here for ease of commenting:
(In reply to Gerd Hoffmann from comment #7)
So, now that the generic ports are there we can go on figure how to handle this best. I still think the best way to communicate window size hints would be to use a vendor specific pci capability (instead of setting the desired size on reset). The information will always be available then and we don't run into initialization order issues.
This seems good to me -- I can't promise 100% without actually trying, but I think I should be able to parse the capability list in config space for this hint, in the GetResourcePadding() callback.
I propose that we try to handle this issue "holistically", together with bug 1434740. We need a method that provides controls for both IO and MMIO:
For IO, we need a mechanism that can prevent *both* firmware *and* Linux from reserving IO for PCI Express ports. I think Marcel's approach in bug 1344299 is sufficient, i.e., set the IO base/limit registers of the bridge to 0 for disabling IO support. And, if not disabled, just go with the default 4KB IO reservation (for both PCI Express ports and legacy PCI bridges, as the latter is documented in the guidelines).
For MMIO, the vendor specific capability structure should work something like this:
if the capability is missing, reserve 2MB, 32-bit, non-prefetchable,
otherwise, the capability structure should consist of 3 fields (reservation sizes):
- uint32_t non_prefetchable_32,
- uint32_t prefetchable_32,
- uint64_t prefetchable_64,
of prefetchable_32 and prefetchable_64, at most one may be nonzero (they are mutually exclusive, and they can be both zero),
whenever a field is 0, that kind of reservation is not needed.
This would now have to be extended with bus range reservation.
(4) Whether the reservation size should be absolute or relative (raised by Gerd). IIUC, Gerd suggests that the absolute aperture size should be specified (as a minimum), not the increment / reservation for hotplug purposes.
The Platform Initialization Specification, v1.6, downloadable at http://www.uefi.org/specs, writes the following under
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
in whose implementation I will have to parse the values from the capability structure, and return the appropriate representation to the platform-independent PciBusDxe driver (i.e., the enumeration / allocation agent):
The padding is returned in the form of ACPI (2.0 & 3.0) resource descriptors. The exact definition of each of the fields is the same as in the EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.SubmitResources() function. See the section 10.8 for the definition of this function.
The PCI bus driver is responsible for adding this resource request to the resource requests by the physical PCI devices. If Attributes is EfiPaddingPciBus, the padding takes effect at the PCI bus level. If Attributes is EfiPaddingPciRootBridge, the required padding takes effect at the root bridge level. For details, see the definition of EFI_HPC_PADDING_ATTRIBUTES in "Related Definitions" below.
Emphasis on "*adding* this resource request to the resource requests by the physical PCI devices".
However... After checking some OVMF logs, it seems that in practice, PciBusDxe does exactly what Gerd suggests.
Currently OVMF returns a constant 2MB MMIO padding and a constant 512B IO padding for all hotplug-capable bridges (including PCI Express root ports), from GetResourcePadding(). For example, for the following QEMU command line fragment:
-device pxb-pcie,bus_nr=64,id=rootbr1,numa_node=1,bus=pcie.0,addr=0x3 \ \ -device ioh3420,id=rootbr1-port1,bus=rootbr1,addr=0x1,slot=3 \ -device e1000e,bus=rootbr1-port1,netdev=net0 \ \ -device ioh3420,id=rootbr1-port2,bus=rootbr1,addr=0x2,slot=4 \ -device e1000e,bus=rootbr1-port2,netdev=net1 \
PciBusDxe produces the following log (extract):
PciBus: Resource Map for Root Bridge PciRoot(0x40) Type = Io16; Base = 0x8000; Length = 0x2000; Alignment = 0xFFF Base = 0x8000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|02|00:**] Base = 0x9000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|01|00:**] Type = Mem32; Base = 0x98600000; Length = 0x400000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|02|00:**] Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|01|00:**]
PciBus: Resource Map for Bridge [40|01|00] Type = Io16; Base = 0x9000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x9000; Length = 0x20; Alignment = 0x1F; Owner = PCI [41|00|00:18] Type = Mem32; Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98800000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:14] Base = 0x98820000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:10] Base = 0x98840000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [41|00|00:1C]
PciBus: Resource Map for Bridge [40|02|00] Type = Io16; Base = 0x8000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x8000; Length = 0x20; Alignment = 0x1F; Owner = PCI [42|00|00:18] Type = Mem32; Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:14] Base = 0x98620000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:10] Base = 0x98640000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [42|00|00:1C]
This means that - both ioh3420 root ports get a padding of 512B for IO and 2MB fo MMIO, - the actual resources for each e1000e NIC on the respective ioh3420 root port are allocated *within* the padding, - because each MMIO padding is larger than the actual resource needs of the respective e1000e NIC, it is the MMIO paddings that are ultimately considered on the root bridge, - additionally, the 0x200 IO paddings are rounded up to 0x1000 on the root bridge.
So, it seems that PciBusDxe - conflicts with the Platform Initialization spec, but - at the same time it seems to do what Gerd is suggesting.
(5) Returning to bus range reservation... This looks problematic with edk2.
* Some background first:
In the Platform Init spec, two kinds of hotplug controllers are distinguished: "root" and "non-root". This naming is confusing because in this context, "root" simply means
cannot be found by normal enumeration, or needs special initialization when found before recursing it
whereas "non-root" means
can be found by normal enumeration and needs no special initialization before recursing it
In this context, QEMU does not provide any "root" hotplug controllers, because all the controllers that support hotplug -- i.e., can *accept* a hot-plugged device -- *can* be found with enumeration, and need no platform-specific initialization callback: - PCI Express root ports - PCI Express downstream ports - PCI-PCI Bridges
Now, if a platform provides "root" (in the above context) hotplug controllers, then the platform's
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetRootHpcList()
method has to return them. (Importantly, this function has to return the list of "root HPCs" *without* accessing PCI config space, using "apriori" knowledge.) Accordingly to the QEMU situation, OVMF returns an empty list here.
* Why this is a problem:
For "non-root" (in the above sense) hotplug controllers, PciBusDxe considers the IO and MMIO paddings, but it does *not* consider bus number paddings, even if EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() returns some. (As far as I can see from the source -- I have never tested it, obviously.) For QEMU, this means all of the hotplug controllers.
For "root" (in the above sense) hotplug controllers, PciBusDxe considers bus number paddings. However, on QEMU, this set of controllers is empty.
The PI spec says,
[...] For all the root HPCs and the nonroot HPCs, call EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the amount of overallocation and add that amount to the requests from the physical devices. Reprogram the bus numbers by taking into account the bus resource padding information. [...]
However, according to my interpretation of the source code, PciBusDxe does not consider bus number padding for non-root HPCs (which are "all" HPCs on QEMU).
So, I agree that we can add a bus number range field to the capability, but I'm fairly sure it won't work with edk2's PciBusDxe without major surgery. (Given that we haven't ever considered hot-plugging entire bridges until now, i.e., anything that would require bus number reservation, I think we can live with such a limitation when using OVMF, for an indefinite time to come.)
Thanks Laszlo
On 25/07/2017 18:46, Laszlo Ersek wrote:
On 07/23/17 00:11, Aleksandr Bezzubikov wrote:
Now PCI bridges get a bus range number on a system init, basing on currently plugged devices. That's why when one wants to hotplug another bridge, it needs his child bus, which the parent is unable to provide (speaking about virtual device). The suggested workaround is to have vendor-specific capability in Red Hat PCI bridges that contains number of additional bus to reserve on BIOS PCI init. So this capability is intented only for pure QEMU->SeaBIOS usage.
Considering all aforesaid, this series is directly connected with QEMU RFC series (v2) "Generic PCIE-PCI Bridge".
Although the new PCI capability is supposed to contain various limits along with bus number to reserve, now only its full layout is proposed, but only bus_reserve field is used in QEMU and BIOS. Limits usage is still a subject for implementation as now the main goal of this series to provide necessary support from the firmware side to PCIE-PCI bridge hotplug.
Changes v1->v2:
- New #define for Red Hat vendor added (addresses Konrad's comment).
- Refactored pci_find_capability function (addresses Marcel's comment).
- Capability reworked:
- data type added;
- reserve space in a structure for IO, memory and prefetchable memory limits.
Aleksandr Bezzubikov (4): pci: refactor pci_find_capapibilty to get bdf as the first argument instead of the whole pci_device pci: add RedHat vendor ID pci: add QEMU-specific PCI capability structure pci: enable RedHat PCI bridges to reserve additional buses on PCI init
src/fw/pciinit.c | 18 ++++++++++++++---- src/hw/pci_cap.h | 23 +++++++++++++++++++++++ src/hw/pci_ids.h | 2 ++ src/hw/pcidevice.c | 12 ++++++------ src/hw/pcidevice.h | 2 +- src/hw/virtio-pci.c | 4 ++-- 6 files changed, 48 insertions(+), 13 deletions(-) create mode 100644 src/hw/pci_cap.h
Coming back from PTO, it's hard for me to follow up on all the comments that have been made across the v1 and v2 of this RFC series, so I'll just provide a brain dump here:
Hi Laszlo, Thanks for the review.
(1) Mentioned by Michael: documentation. That's the most important part. I haven't seen the QEMU patches, so perhaps they already include documentation. If not, please start this work with adding a detailed description do QEMU's docs/ or docs/specs/.
I agree is time. Aleksandr, please be sure to document the PCIE-PCI bridge in docs/pcie.txt
There are a number of preexistent documents that might be related, just search docs/ for filenames with "pci" in them.
(2) Bus range reservation, and hotplugging bridges. What's the motivation? Our recommendations in "docs/pcie.txt" suggest flat hierarchies.
It remains flat. You have one single PCIE-PCI bridge plugged into a PCIe Root Port, no deep nesting.
The reason is to be able to support legacy PCI devices without "committing" with a DMI-PCI bridge in advance. (Keep Q35 without) legacy hw.
The only way to support PCI devices in Q35 is to have them cold-plugged into the pcie.0 bus, which is good, but not enough for expanding the Q35 usability in order to make it eventually the default QEMU x86 machine (I know this is another discussion and I am in minority, at least for now).
The plan is: Start Q35 machine as usual, but one of the PCIe Root Ports includes hints for firmware needed t support legacy PCI devices. (IO Ports range, extra bus,...)
Once a pci device is needed you have 2 options: 1. Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device in the bridge. 2. Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug a PCI device into the bridge.
If this use case is really necessary, I think it should be covered in "docs/pcie.txt". In particular it has a consequence for PXB as well (search "pcie.txt" for "bus_nr") -- if users employ extra root buses, then the bus number partitions that they specify must account for any bridges that they plan to hot-plug (and for the bus range reservations on the cold-plugged bridges behind those extra root buses).
Agreed about the doc.
(3) Regarding the contents and the format of the capability structure, I wrote up my thoughts earlier in
https://bugzilla.redhat.com/show_bug.cgi?id=1434747#c8
Let me quote it here for ease of commenting:
(In reply to Gerd Hoffmann from comment #7)
So, now that the generic ports are there we can go on figure how to handle this best. I still think the best way to communicate window size hints would be to use a vendor specific pci capability (instead of setting the desired size on reset). The information will always be available then and we don't run into initialization order issues.
This seems good to me -- I can't promise 100% without actually trying, but I think I should be able to parse the capability list in config space for this hint, in the GetResourcePadding() callback.
I propose that we try to handle this issue "holistically", together with bug 1434740. We need a method that provides controls for both IO and MMIO:
- For IO, we need a mechanism that can prevent *both* firmware *and* Linux from reserving IO for PCI Express ports. I think Marcel's approach in bug 1344299 is sufficient, i.e., set the IO base/limit registers of the bridge to 0 for disabling IO support. And, if not disabled, just go with the default 4KB IO reservation (for both PCI Express ports and legacy PCI bridges, as the latter is documented in the guidelines).
I have some patches on this, however I remember guests miss-behaving, but maybe is a bug on my part... . However we can use the IO hint to reserve less than 4KB IO windows.
For MMIO, the vendor specific capability structure should work something like this:
if the capability is missing, reserve 2MB, 32-bit, non-prefetchable,
otherwise, the capability structure should consist of 3 fields (reservation sizes):
- uint32_t non_prefetchable_32,
- uint32_t prefetchable_32,
- uint64_t prefetchable_64,
of prefetchable_32 and prefetchable_64, at most one may be nonzero (they are mutually exclusive, and they can be both zero),
whenever a field is 0, that kind of reservation is not needed.
This would now have to be extended with bus range reservation.
Agreed, please be aware that as part of Alexander's project he will prepare the capability, but implement only the bus-numbers reservation part.
(4) Whether the reservation size should be absolute or relative (raised by Gerd). IIUC, Gerd suggests that the absolute aperture size should be specified (as a minimum), not the increment / reservation for hotplug purposes.
The Platform Initialization Specification, v1.6, downloadable at http://www.uefi.org/specs, writes the following under
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
in whose implementation I will have to parse the values from the capability structure, and return the appropriate representation to the platform-independent PciBusDxe driver (i.e., the enumeration / allocation agent):
The padding is returned in the form of ACPI (2.0 & 3.0) resource descriptors. The exact definition of each of the fields is the same as in the EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.SubmitResources() function. See the section 10.8 for the definition of this function.
The PCI bus driver is responsible for adding this resource request to the resource requests by the physical PCI devices. If Attributes is EfiPaddingPciBus, the padding takes effect at the PCI bus level. If Attributes is EfiPaddingPciRootBridge, the required padding takes effect at the root bridge level. For details, see the definition of EFI_HPC_PADDING_ATTRIBUTES in "Related Definitions" below.
Emphasis on "*adding* this resource request to the resource requests by the physical PCI devices".
However... After checking some OVMF logs, it seems that in practice, PciBusDxe does exactly what Gerd suggests.
Currently OVMF returns a constant 2MB MMIO padding and a constant 512B IO padding for all hotplug-capable bridges (including PCI Express root ports), from GetResourcePadding(). For example, for the following QEMU command line fragment:
-device pxb-pcie,bus_nr=64,id=rootbr1,numa_node=1,bus=pcie.0,addr=0x3 \ \ -device ioh3420,id=rootbr1-port1,bus=rootbr1,addr=0x1,slot=3 \ -device e1000e,bus=rootbr1-port1,netdev=net0 \ \ -device ioh3420,id=rootbr1-port2,bus=rootbr1,addr=0x2,slot=4 \ -device e1000e,bus=rootbr1-port2,netdev=net1 \
PciBusDxe produces the following log (extract):
PciBus: Resource Map for Root Bridge PciRoot(0x40) Type = Io16; Base = 0x8000; Length = 0x2000; Alignment = 0xFFF Base = 0x8000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|02|00:**] Base = 0x9000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|01|00:**] Type = Mem32; Base = 0x98600000; Length = 0x400000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|02|00:**] Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|01|00:**]
PciBus: Resource Map for Bridge [40|01|00] Type = Io16; Base = 0x9000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x9000; Length = 0x20; Alignment = 0x1F; Owner = PCI [41|00|00:18] Type = Mem32; Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98800000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:14] Base = 0x98820000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:10] Base = 0x98840000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [41|00|00:1C]
PciBus: Resource Map for Bridge [40|02|00] Type = Io16; Base = 0x8000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x8000; Length = 0x20; Alignment = 0x1F; Owner = PCI [42|00|00:18] Type = Mem32; Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:14] Base = 0x98620000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:10] Base = 0x98640000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [42|00|00:1C]
This means that
- both ioh3420 root ports get a padding of 512B for IO and 2MB fo MMIO,
So the 512 IO will be reserved even if the IO handling is not enabled in the bridge? (I am asking because a similar issue I might have with other guest code.)
- the actual resources for each e1000e NIC on the respective ioh3420 root port are allocated *within* the padding,
- because each MMIO padding is larger than the actual resource needs of the respective e1000e NIC, it is the MMIO paddings that are ultimately considered on the root bridge,
- additionally, the 0x200 IO paddings are rounded up to 0x1000 on the root bridge.
So, it seems that PciBusDxe
- conflicts with the Platform Initialization spec, but
- at the same time it seems to do what Gerd is suggesting.
(5) Returning to bus range reservation... This looks problematic with edk2.
- Some background first:
In the Platform Init spec, two kinds of hotplug controllers are distinguished: "root" and "non-root". This naming is confusing because in this context, "root" simply means
cannot be found by normal enumeration, or needs special initialization when found before recursing it
whereas "non-root" means
can be found by normal enumeration and needs no special initialization before recursing it
In this context, QEMU does not provide any "root" hotplug controllers, because all the controllers that support hotplug -- i.e., can *accept* a hot-plugged device -- *can* be found with enumeration, and need no platform-specific initialization callback:
- PCI Express root ports
- PCI Express downstream ports
- PCI-PCI Bridges
Now, if a platform provides "root" (in the above context) hotplug controllers, then the platform's
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetRootHpcList()
method has to return them. (Importantly, this function has to return the list of "root HPCs" *without* accessing PCI config space, using "apriori" knowledge.) Accordingly to the QEMU situation, OVMF returns an empty list here.
- Why this is a problem:
For "non-root" (in the above sense) hotplug controllers, PciBusDxe considers the IO and MMIO paddings, but it does *not* consider bus number paddings, even if EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() returns some. (As far as I can see from the source -- I have never tested it, obviously.) For QEMU, this means all of the hotplug controllers.
For "root" (in the above sense) hotplug controllers, PciBusDxe considers bus number paddings. However, on QEMU, this set of controllers is empty.
The PI spec says,
[...] For all the root HPCs and the nonroot HPCs, call EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the amount of overallocation and add that amount to the requests from the physical devices. Reprogram the bus numbers by taking into account the bus resource padding information. [...]
However, according to my interpretation of the source code, PciBusDxe does not consider bus number padding for non-root HPCs (which are "all" HPCs on QEMU).
Theoretically speaking, it is possible to change the behavior, right?
So, I agree that we can add a bus number range field to the capability, but I'm fairly sure it won't work with edk2's PciBusDxe without major surgery. (Given that we haven't ever considered hot-plugging entire bridges until now, i.e., anything that would require bus number reservation, I think we can live with such a limitation when using OVMF, for an indefinite time to come.)
Since the OVMF will still support cold-plug, I think is OK for now. Once the feature gets in SeaBIOS I will open a BZ for tracking.
Thanks, Marcel
Thanks Laszlo
On 07/26/17 08:48, Marcel Apfelbaum wrote:
On 25/07/2017 18:46, Laszlo Ersek wrote:
[snip]
(2) Bus range reservation, and hotplugging bridges. What's the motivation? Our recommendations in "docs/pcie.txt" suggest flat hierarchies.
It remains flat. You have one single PCIE-PCI bridge plugged into a PCIe Root Port, no deep nesting.
The reason is to be able to support legacy PCI devices without "committing" with a DMI-PCI bridge in advance. (Keep Q35 without) legacy hw.
The only way to support PCI devices in Q35 is to have them cold-plugged into the pcie.0 bus, which is good, but not enough for expanding the Q35 usability in order to make it eventually the default QEMU x86 machine (I know this is another discussion and I am in minority, at least for now).
The plan is: Start Q35 machine as usual, but one of the PCIe Root Ports includes hints for firmware needed t support legacy PCI devices. (IO Ports range, extra bus,...)
Once a pci device is needed you have 2 options:
- Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device in the bridge.
- Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug a PCI device into the bridge.
Thank you for the explanation, it makes the intent a lot clearer.
However, what does the hot-pluggability of the PCIe-PCI bridge buy us? In other words, what does it buy us when we do not add the PCIe-PCI bridge immediately at guest startup, as an integrated device?
Why is it a problem to "commit" in advance? I understand that we might not like the DMI-PCI bridge (due to it being legacy), but what speaks against cold-plugging the PCIe-PCI bridge either as an integrated device in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI bridge in a similarly cold-plugged PCIe root port?
I mean, in the cold-plugged case, you use up two bus numbers at the most, one for the root port, and another for the PCIe-PCI bridge. In the hot-plugged case, you have to start with the cold-plugged root port just the same (so that you can communicate the bus number reservation *at all*), and then reserve (= use up in advance) the bus number, the IO space, and the MMIO space(s). I don't see the difference; hot-plugging the PCIe-PCI bridge (= not committing in advance) doesn't seem to save any resources.
I guess I would see a difference if we reserved more than one bus number in the hotplug case, namely in order to support recursive hotplug under the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat hierarchy (ie the exercise is only for enabling legacy PCI endpoints, not for recursive hotplug). The PCIe-PCI bridge isn't a device that does anything at all on its own, so why not just coldplug it? Its resources have to be reserved in advance anyway.
So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup, possibly even make it an integrated device, and then you don't need to reserve bus numbers (and other apertures)".
Where am I wrong?
[snip]
(4) Whether the reservation size should be absolute or relative (raised by Gerd). IIUC, Gerd suggests that the absolute aperture size should be specified (as a minimum), not the increment / reservation for hotplug purposes.
The Platform Initialization Specification, v1.6, downloadable at http://www.uefi.org/specs, writes the following under
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
in whose implementation I will have to parse the values from the capability structure, and return the appropriate representation to the platform-independent PciBusDxe driver (i.e., the enumeration / allocation agent):
The padding is returned in the form of ACPI (2.0 & 3.0) resource descriptors. The exact definition of each of the fields is the same as in the EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.SubmitResources() function. See the section 10.8 for the definition of this function.
The PCI bus driver is responsible for adding this resource request to the resource requests by the physical PCI devices. If Attributes is EfiPaddingPciBus, the padding takes effect at the PCI bus level. If Attributes is EfiPaddingPciRootBridge, the required padding takes effect at the root bridge level. For details, see the definition of EFI_HPC_PADDING_ATTRIBUTES in "Related Definitions" below.
Emphasis on "*adding* this resource request to the resource requests by the physical PCI devices".
However... After checking some OVMF logs, it seems that in practice, PciBusDxe does exactly what Gerd suggests.
Currently OVMF returns a constant 2MB MMIO padding and a constant 512B IO padding for all hotplug-capable bridges (including PCI Express root ports), from GetResourcePadding(). For example, for the following QEMU command line fragment:
-device pxb-pcie,bus_nr=64,id=rootbr1,numa_node=1,bus=pcie.0,addr=0x3 \ \ -device ioh3420,id=rootbr1-port1,bus=rootbr1,addr=0x1,slot=3 \ -device e1000e,bus=rootbr1-port1,netdev=net0 \ \ -device ioh3420,id=rootbr1-port2,bus=rootbr1,addr=0x2,slot=4 \ -device e1000e,bus=rootbr1-port2,netdev=net1 \
PciBusDxe produces the following log (extract):
PciBus: Resource Map for Root Bridge PciRoot(0x40) Type = Io16; Base = 0x8000; Length = 0x2000; Alignment = 0xFFF Base = 0x8000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|02|00:**] Base = 0x9000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|01|00:**] Type = Mem32; Base = 0x98600000; Length = 0x400000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|02|00:**] Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|01|00:**]
PciBus: Resource Map for Bridge [40|01|00] Type = Io16; Base = 0x9000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x9000; Length = 0x20; Alignment = 0x1F; Owner = PCI [41|00|00:18] Type = Mem32; Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98800000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:14] Base = 0x98820000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:10] Base = 0x98840000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [41|00|00:1C]
PciBus: Resource Map for Bridge [40|02|00] Type = Io16; Base = 0x8000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x8000; Length = 0x20; Alignment = 0x1F; Owner = PCI [42|00|00:18] Type = Mem32; Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:14] Base = 0x98620000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:10] Base = 0x98640000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [42|00|00:1C]
This means that
- both ioh3420 root ports get a padding of 512B for IO and 2MB fo MMIO,
So the 512 IO will be reserved even if the IO handling is not enabled in the bridge? (I am asking because a similar issue I might have with other guest code.)
I apologize for being unclear about the "distribution of work" between edk2's generic PciBusDxe driver and OVMF's platform-specific PciHotPlugInitDxe driver (which provides the EFI_PCI_HOT_PLUG_INIT_PROTOCOL, of which GetResourcePadding() is a member function).
Basically, PciBusDxe is a platform-independent, universal driver, that calls into "hooks", optionally provided by the platform, at specific points in the enumeration / allocation.
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() is such a platform specific hook, and OVMF is "reasonably free" to provide PciBusDxe with reservation hints from it, as OVMF sees fit, for any specific hotplug controller. (Of course OVMF's freedom is limited by two factors: first by the information that QEMU exposes to firmware, and second by the "imperative" that in GetResourcePadding() we really shouldn't look at any *other* PCI(e) controllers than the exact hotplug controller that PciBusDxe is asking about.)
With that in mind:
*Currently*, OVMF advises PciBusDxe to "reserve 512B of IO space" for *any* hotplug controller. (From the above log, we can see that on the root bridge, this 512B are then rounded up to 4KB.)
*In the future*, OVMF should replace the constant 512B with the following logic: consult the IO base/limit registers of the subject hotplug controller, and if they are zero, then return "reserve no IO space for this hotplug controller" to PciBusDxe. If the base/limit registers are nonzero, OVMF would say "reserve 4KB IO space", or even propagate the value from the capability.
[snip]
The PI spec says,
[...] For all the root HPCs and the nonroot HPCs, call EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the amount of overallocation and add that amount to the requests from the physical devices. Reprogram the bus numbers by taking into account the bus resource padding information. [...]
However, according to my interpretation of the source code, PciBusDxe does not consider bus number padding for non-root HPCs (which are "all" HPCs on QEMU).
Theoretically speaking, it is possible to change the behavior, right?
Not just theoretically; in the past I have changed PciBusDxe -- it wouldn't identify QEMU's hotplug controllers (root port, downstream port etc) appropriately, and I managed to get some patches in. It's just that the less we understand the current code and the more intrusive/extensive the change is, the harder it is to sell the *idea*. PciBusDxe is platform-independent and shipped on many a physical system too.
So, I agree that we can add a bus number range field to the capability, but I'm fairly sure it won't work with edk2's PciBusDxe without major surgery. (Given that we haven't ever considered hot-plugging entire bridges until now, i.e., anything that would require bus number reservation, I think we can live with such a limitation when using OVMF, for an indefinite time to come.)
Since the OVMF will still support cold-plug, I think is OK for now. Once the feature gets in SeaBIOS I will open a BZ for tracking.
Please do that (at that time); it will certainly need a dedicated discussion on edk2-devel.
(Also, your statement about cold-plug being viable for at least a while is consistent with my questions / implications near the top: what does the hot-pluggability of the PCIe-PCI bridge buy us ultimately?)
Thanks! Laszlo
On 26/07/2017 18:20, Laszlo Ersek wrote:
On 07/26/17 08:48, Marcel Apfelbaum wrote:
On 25/07/2017 18:46, Laszlo Ersek wrote:
[snip]
(2) Bus range reservation, and hotplugging bridges. What's the motivation? Our recommendations in "docs/pcie.txt" suggest flat hierarchies.
It remains flat. You have one single PCIE-PCI bridge plugged into a PCIe Root Port, no deep nesting.
The reason is to be able to support legacy PCI devices without "committing" with a DMI-PCI bridge in advance. (Keep Q35 without) legacy hw.
The only way to support PCI devices in Q35 is to have them cold-plugged into the pcie.0 bus, which is good, but not enough for expanding the Q35 usability in order to make it eventually the default QEMU x86 machine (I know this is another discussion and I am in minority, at least for now).
The plan is: Start Q35 machine as usual, but one of the PCIe Root Ports includes hints for firmware needed t support legacy PCI devices. (IO Ports range, extra bus,...)
Once a pci device is needed you have 2 options:
- Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device in the bridge.
- Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug a PCI device into the bridge.
Hi Laszlo,
Thank you for the explanation, it makes the intent a lot clearer.
However, what does the hot-pluggability of the PCIe-PCI bridge buy us? In other words, what does it buy us when we do not add the PCIe-PCI bridge immediately at guest startup, as an integrated device?
Why is it a problem to "commit" in advance? I understand that we might
not like the DMI-PCI bridge (due to it being legacy), but what speaks against cold-plugging the PCIe-PCI bridge either as an integrated device in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI bridge in a similarly cold-plugged PCIe root port?
We want to keep Q35 clean, and for most cases we don't want any legacy PCI stuff if not especially required.
I mean, in the cold-plugged case, you use up two bus numbers at the most, one for the root port, and another for the PCIe-PCI bridge. In the hot-plugged case, you have to start with the cold-plugged root port just the same (so that you can communicate the bus number reservation *at all*), and then reserve (= use up in advance) the bus number, the IO space, and the MMIO space(s). I don't see the difference; hot-plugging the PCIe-PCI bridge (= not committing in advance) doesn't seem to save any resources.
Is not about resources, more about usage model.
I guess I would see a difference if we reserved more than one bus number in the hotplug case, namely in order to support recursive hotplug under the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat hierarchy (ie the exercise is only for enabling legacy PCI endpoints, not for recursive hotplug). The PCIe-PCI bridge isn't a device that does anything at all on its own, so why not just coldplug it? Its resources have to be reserved in advance anyway.
Even if we prefer flat hierarchies, we should allow a sane nested bridges configuration, so we will some times reserve more than one.
So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup, possibly even make it an integrated device, and then you don't need to reserve bus numbers (and other apertures)".
Where am I wrong?
Nothing wrong, I am just looking for feature parity Q35 vs PC. Users may want to continue using [nested] PCI bridges, and we want the Q35 machine to be used by more users in order to make it reliable faster, while keeping it clean by default.
We had a discussion on this matter on last year KVM forum and the hot-pluggable PCIe-PCI bridge was the general consensus. As a bonus we get the PCI firmware hints capability.
[snip]
(4) Whether the reservation size should be absolute or relative (raised by Gerd). IIUC, Gerd suggests that the absolute aperture size should be specified (as a minimum), not the increment / reservation for hotplug purposes.
The Platform Initialization Specification, v1.6, downloadable at http://www.uefi.org/specs, writes the following under
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
in whose implementation I will have to parse the values from the capability structure, and return the appropriate representation to the platform-independent PciBusDxe driver (i.e., the enumeration / allocation agent):
The padding is returned in the form of ACPI (2.0 & 3.0) resource descriptors. The exact definition of each of the fields is the same as in the EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.SubmitResources() function. See the section 10.8 for the definition of this function.
The PCI bus driver is responsible for adding this resource request to the resource requests by the physical PCI devices. If Attributes is EfiPaddingPciBus, the padding takes effect at the PCI bus level. If Attributes is EfiPaddingPciRootBridge, the required padding takes effect at the root bridge level. For details, see the definition of EFI_HPC_PADDING_ATTRIBUTES in "Related Definitions" below.
Emphasis on "*adding* this resource request to the resource requests by the physical PCI devices".
However... After checking some OVMF logs, it seems that in practice, PciBusDxe does exactly what Gerd suggests.
Currently OVMF returns a constant 2MB MMIO padding and a constant 512B IO padding for all hotplug-capable bridges (including PCI Express root ports), from GetResourcePadding(). For example, for the following QEMU command line fragment:
-device
pxb-pcie,bus_nr=64,id=rootbr1,numa_node=1,bus=pcie.0,addr=0x3 \ \ -device ioh3420,id=rootbr1-port1,bus=rootbr1,addr=0x1,slot=3 \ -device e1000e,bus=rootbr1-port1,netdev=net0 \ \ -device ioh3420,id=rootbr1-port2,bus=rootbr1,addr=0x2,slot=4 \ -device e1000e,bus=rootbr1-port2,netdev=net1 \
PciBusDxe produces the following log (extract):
PciBus: Resource Map for Root Bridge PciRoot(0x40) Type = Io16; Base = 0x8000; Length = 0x2000; Alignment = 0xFFF Base = 0x8000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|02|00:**] Base = 0x9000; Length = 0x1000; Alignment = 0xFFF; Owner = PPB [40|01|00:**] Type = Mem32; Base = 0x98600000; Length = 0x400000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|02|00:**] Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF; Owner = PPB [40|01|00:**]
PciBus: Resource Map for Bridge [40|01|00] Type = Io16; Base = 0x9000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x9000; Length = 0x20; Alignment = 0x1F; Owner = PCI [41|00|00:18] Type = Mem32; Base = 0x98800000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98800000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:14] Base = 0x98820000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [41|00|00:10] Base = 0x98840000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [41|00|00:1C]
PciBus: Resource Map for Bridge [40|02|00] Type = Io16; Base = 0x8000; Length = 0x1000; Alignment = 0xFFF Base = Padding; Length = 0x200; Alignment = 0x1FF Base = 0x8000; Length = 0x20; Alignment = 0x1F; Owner = PCI [42|00|00:18] Type = Mem32; Base = 0x98600000; Length = 0x200000; Alignment = 0x1FFFFF Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF Base = 0x98600000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:14] Base = 0x98620000; Length = 0x20000; Alignment = 0x1FFFF; Owner = PCI [42|00|00:10] Base = 0x98640000; Length = 0x4000; Alignment = 0x3FFF; Owner = PCI [42|00|00:1C]
This means that
- both ioh3420 root ports get a padding of 512B for IO and 2MB fo MMIO,
So the 512 IO will be reserved even if the IO handling is not enabled in the bridge? (I am asking because a similar issue I might have with other guest code.)
I apologize for being unclear about the "distribution of work" between edk2's generic PciBusDxe driver and OVMF's platform-specific PciHotPlugInitDxe driver (which provides the EFI_PCI_HOT_PLUG_INIT_PROTOCOL, of which GetResourcePadding() is a member function).
Basically, PciBusDxe is a platform-independent, universal driver, that
calls into "hooks", optionally provided by the platform, at specific points in the enumeration / allocation.
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() is such a platform specific hook, and OVMF is "reasonably free" to provide PciBusDxe with reservation hints from it, as OVMF sees fit, for any specific hotplug controller. (Of course OVMF's freedom is limited by two factors: first by the information that QEMU exposes to firmware, and second by the "imperative" that in GetResourcePadding() we really shouldn't look at any *other* PCI(e) controllers than the exact hotplug controller that PciBusDxe is asking about.)
Until now sounds good :)
With that in mind:
*Currently*, OVMF advises PciBusDxe to "reserve 512B of IO space" for *any* hotplug controller. (From the above log, we can see that on the root bridge, this 512B are then rounded up to 4KB.)
OK
*In the future*, OVMF should replace the constant 512B with the following logic: consult the IO base/limit registers of the subject hotplug controller, and if they are zero, then return "reserve no IO space for this hotplug controller" to PciBusDxe. If the base/limit registers are nonzero, OVMF would say "reserve 4KB IO space", or even propagate the value from the capability.
Exactly
[snip]
The PI spec says,
[...] For all the root HPCs and the nonroot HPCs, call EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the amount of overallocation and add that amount to the requests from the physical devices. Reprogram the bus numbers by taking into account the bus resource padding information. [...]
However, according to my interpretation of the source code, PciBusDxe does not consider bus number padding for non-root HPCs (which are "all" HPCs on QEMU).
Theoretically speaking, it is possible to change the behavior, right?
Not just theoretically; in the past I have changed PciBusDxe -- it wouldn't identify QEMU's hotplug controllers (root port, downstream port etc) appropriately, and I managed to get some patches in. It's just that the less we understand the current code and the more intrusive/extensive the change is, the harder it is to sell the *idea*. PciBusDxe is platform-independent and shipped on many a physical system too.
Understood, but from your explanation it sounds like the existings callback sites(hooks) are enough.
So, I agree that we can add a bus number range field to the capability, but I'm fairly sure it won't work with edk2's PciBusDxe without major surgery. (Given that we haven't ever considered hot-plugging entire bridges until now, i.e., anything that would require bus number reservation, I think we can live with such a limitation when using OVMF, for an indefinite time to come.)
Since the OVMF will still support cold-plug, I think is OK for now. Once the feature gets in SeaBIOS I will open a BZ for tracking.
Please do that (at that time); it will certainly need a dedicated discussion on edk2-devel.
Sure, I'll start a discussion on the edk2-level as soon as we finish the feature on QEMU/SeaBIOS.
(Also, your statement about cold-plug being viable for at least a while is consistent with my questions / implications near the top: what does the hot-pluggability of the PCIe-PCI bridge buy us ultimately?)
Please see above.
Thanks, Marcel
Thanks! Laszlo
On 07/26/17 18:22, Marcel Apfelbaum wrote:
On 26/07/2017 18:20, Laszlo Ersek wrote:
[snip]
However, what does the hot-pluggability of the PCIe-PCI bridge buy us? In other words, what does it buy us when we do not add the PCIe-PCI bridge immediately at guest startup, as an integrated device?
Why is it a problem to "commit" in advance? I understand that we might
not like the DMI-PCI bridge (due to it being legacy), but what speaks against cold-plugging the PCIe-PCI bridge either as an integrated device in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI bridge in a similarly cold-plugged PCIe root port?
We want to keep Q35 clean, and for most cases we don't want any legacy PCI stuff if not especially required.
I mean, in the cold-plugged case, you use up two bus numbers at the most, one for the root port, and another for the PCIe-PCI bridge. In the hot-plugged case, you have to start with the cold-plugged root port just the same (so that you can communicate the bus number reservation *at all*), and then reserve (= use up in advance) the bus number, the IO space, and the MMIO space(s). I don't see the difference; hot-plugging the PCIe-PCI bridge (= not committing in advance) doesn't seem to save any resources.
Is not about resources, more about usage model.
I guess I would see a difference if we reserved more than one bus number in the hotplug case, namely in order to support recursive hotplug under the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat hierarchy (ie the exercise is only for enabling legacy PCI endpoints, not for recursive hotplug). The PCIe-PCI bridge isn't a device that does anything at all on its own, so why not just coldplug it? Its resources have to be reserved in advance anyway.
Even if we prefer flat hierarchies, we should allow a sane nested bridges configuration, so we will some times reserve more than one.
So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup, possibly even make it an integrated device, and then you don't need to reserve bus numbers (and other apertures)".
Where am I wrong?
Nothing wrong, I am just looking for feature parity Q35 vs PC. Users may want to continue using [nested] PCI bridges, and we want the Q35 machine to be used by more users in order to make it reliable faster, while keeping it clean by default.
We had a discussion on this matter on last year KVM forum and the hot-pluggable PCIe-PCI bridge was the general consensus.
OK. I don't want to question or go back on that consensus now; I'd just like to point out that all that you describe (nested bridges, and enabling legacy PCI with PCIe-PCI bridges, *on demand*) is still possible with cold-plugging.
I.e., the default setup of Q35 does not need to include legacy PCI bridges. It's just that the pre-launch configuration effort for a Q35 user to *reserve* resources for legacy PCI is the exact same as the pre-launch configuration effort to *actually cold-plug* the bridge.
[snip]
The PI spec says,
[...] For all the root HPCs and the nonroot HPCs, call EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the amount of overallocation and add that amount to the requests from the physical devices. Reprogram the bus numbers by taking into account the bus resource padding information. [...]
However, according to my interpretation of the source code, PciBusDxe does not consider bus number padding for non-root HPCs (which are "all" HPCs on QEMU).
Theoretically speaking, it is possible to change the behavior, right?
Not just theoretically; in the past I have changed PciBusDxe -- it wouldn't identify QEMU's hotplug controllers (root port, downstream port etc) appropriately, and I managed to get some patches in. It's just that the less we understand the current code and the more intrusive/extensive the change is, the harder it is to sell the *idea*. PciBusDxe is platform-independent and shipped on many a physical system too.
Understood, but from your explanation it sounds like the existings callback sites(hooks) are enough.
That's the problem: they don't appear to, if you consider bus number reservations. The existing callback sites seem fine regarding IO and MMIO, but the only callback site that honors bus number reservation is limited to "root" (in the previously defined sense) hotplug controllers.
So this is something that will need investigation, and my most recent queries into the "hotplug preparation" parts of PciBusDxe indicate that those parts are quite... "forgotten". :) I guess this might be because on physical systems the level of PCI(e) hotpluggery that we plan to do is likely unheard of :)
Thanks! Laszlo
On 26/07/2017 21:31, Laszlo Ersek wrote:
On 07/26/17 18:22, Marcel Apfelbaum wrote:
On 26/07/2017 18:20, Laszlo Ersek wrote:
[snip]
However, what does the hot-pluggability of the PCIe-PCI bridge buy us? In other words, what does it buy us when we do not add the PCIe-PCI bridge immediately at guest startup, as an integrated device?
Why is it a problem to "commit" in advance? I understand that we might
not like the DMI-PCI bridge (due to it being legacy), but what speaks against cold-plugging the PCIe-PCI bridge either as an integrated device in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI bridge in a similarly cold-plugged PCIe root port?
We want to keep Q35 clean, and for most cases we don't want any legacy PCI stuff if not especially required.
I mean, in the cold-plugged case, you use up two bus numbers at the most, one for the root port, and another for the PCIe-PCI bridge. In the hot-plugged case, you have to start with the cold-plugged root port just the same (so that you can communicate the bus number reservation *at all*), and then reserve (= use up in advance) the bus number, the IO space, and the MMIO space(s). I don't see the difference; hot-plugging the PCIe-PCI bridge (= not committing in advance) doesn't seem to save any resources.
Is not about resources, more about usage model.
I guess I would see a difference if we reserved more than one bus number in the hotplug case, namely in order to support recursive hotplug under the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat hierarchy (ie the exercise is only for enabling legacy PCI endpoints, not for recursive hotplug). The PCIe-PCI bridge isn't a device that does anything at all on its own, so why not just coldplug it? Its resources have to be reserved in advance anyway.
Even if we prefer flat hierarchies, we should allow a sane nested bridges configuration, so we will some times reserve more than one.
So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup, possibly even make it an integrated device, and then you don't need to reserve bus numbers (and other apertures)".
Where am I wrong?
Nothing wrong, I am just looking for feature parity Q35 vs PC. Users may want to continue using [nested] PCI bridges, and we want the Q35 machine to be used by more users in order to make it reliable faster, while keeping it clean by default.
We had a discussion on this matter on last year KVM forum and the hot-pluggable PCIe-PCI bridge was the general consensus.
OK. I don't want to question or go back on that consensus now; I'd just like to point out that all that you describe (nested bridges, and enabling legacy PCI with PCIe-PCI bridges, *on demand*) is still possible with cold-plugging.
I.e., the default setup of Q35 does not need to include legacy PCI bridges. It's just that the pre-launch configuration effort for a Q35 user to *reserve* resources for legacy PCI is the exact same as the pre-launch configuration effort to *actually cold-plug* the bridge.
[snip]
The PI spec says,
[...] For all the root HPCs and the nonroot HPCs, call EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the amount of overallocation and add that amount to the requests from the physical devices. Reprogram the bus numbers by taking into account the bus resource padding information. [...]
However, according to my interpretation of the source code, PciBusDxe does not consider bus number padding for non-root HPCs (which are "all" HPCs on QEMU).
Theoretically speaking, it is possible to change the behavior, right?
Not just theoretically; in the past I have changed PciBusDxe -- it wouldn't identify QEMU's hotplug controllers (root port, downstream port etc) appropriately, and I managed to get some patches in. It's just that the less we understand the current code and the more intrusive/extensive the change is, the harder it is to sell the *idea*. PciBusDxe is platform-independent and shipped on many a physical system too.
Understood, but from your explanation it sounds like the existings callback sites(hooks) are enough.
That's the problem: they don't appear to, if you consider bus number reservations. The existing callback sites seem fine regarding IO and MMIO, but the only callback site that honors bus number reservation is limited to "root" (in the previously defined sense) hotplug controllers.
So this is something that will need investigation, and my most recent queries into the "hotplug preparation" parts of PciBusDxe indicate that those parts are quite... "forgotten". :) I guess this might be because on physical systems the level of PCI(e) hotpluggery that we plan to do is likely unheard of :)
I admit is possible that it looks a little "crazy" on bare-metal, but as long as we "color inside the lines" we are allowed to push it a little :)
Thanks, Marcel
Thanks! Laszlo
On Wed, Jul 26, 2017 at 07:22:42PM +0300, Marcel Apfelbaum wrote:
On 26/07/2017 18:20, Laszlo Ersek wrote:
On 07/26/17 08:48, Marcel Apfelbaum wrote:
On 25/07/2017 18:46, Laszlo Ersek wrote:
[snip]
(2) Bus range reservation, and hotplugging bridges. What's the motivation? Our recommendations in "docs/pcie.txt" suggest flat hierarchies.
It remains flat. You have one single PCIE-PCI bridge plugged into a PCIe Root Port, no deep nesting.
The reason is to be able to support legacy PCI devices without "committing" with a DMI-PCI bridge in advance. (Keep Q35 without) legacy hw.
The only way to support PCI devices in Q35 is to have them cold-plugged into the pcie.0 bus, which is good, but not enough for expanding the Q35 usability in order to make it eventually the default QEMU x86 machine (I know this is another discussion and I am in minority, at least for now).
The plan is: Start Q35 machine as usual, but one of the PCIe Root Ports includes hints for firmware needed t support legacy PCI devices. (IO Ports range, extra bus,...)
Once a pci device is needed you have 2 options:
- Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device in the bridge.
- Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug a PCI device into the bridge.
Hi Laszlo,
Thank you for the explanation, it makes the intent a lot clearer.
However, what does the hot-pluggability of the PCIe-PCI bridge buy us? In other words, what does it buy us when we do not add the PCIe-PCI bridge immediately at guest startup, as an integrated device?
Why is it a problem to "commit" in advance? I understand that we might
not like the DMI-PCI bridge (due to it being legacy), but what speaks against cold-plugging the PCIe-PCI bridge either as an integrated device in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI bridge in a similarly cold-plugged PCIe root port?
We want to keep Q35 clean, and for most cases we don't want any legacy PCI stuff if not especially required.
BTW, what are the PCI devices that we actually need?
On 26/07/2017 21:49, Michael S. Tsirkin wrote:
On Wed, Jul 26, 2017 at 07:22:42PM +0300, Marcel Apfelbaum wrote:
On 26/07/2017 18:20, Laszlo Ersek wrote:
On 07/26/17 08:48, Marcel Apfelbaum wrote:
On 25/07/2017 18:46, Laszlo Ersek wrote:
[snip]
(2) Bus range reservation, and hotplugging bridges. What's the motivation? Our recommendations in "docs/pcie.txt" suggest flat hierarchies.
It remains flat. You have one single PCIE-PCI bridge plugged into a PCIe Root Port, no deep nesting.
The reason is to be able to support legacy PCI devices without "committing" with a DMI-PCI bridge in advance. (Keep Q35 without) legacy hw.
The only way to support PCI devices in Q35 is to have them cold-plugged into the pcie.0 bus, which is good, but not enough for expanding the Q35 usability in order to make it eventually the default QEMU x86 machine (I know this is another discussion and I am in minority, at least for now).
The plan is: Start Q35 machine as usual, but one of the PCIe Root Ports includes hints for firmware needed t support legacy PCI devices. (IO Ports range, extra bus,...)
Once a pci device is needed you have 2 options:
- Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device in the bridge.
- Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug a PCI device into the bridge.
Hi Laszlo,
Thank you for the explanation, it makes the intent a lot clearer.
However, what does the hot-pluggability of the PCIe-PCI bridge buy us? In other words, what does it buy us when we do not add the PCIe-PCI bridge immediately at guest startup, as an integrated device?
Why is it a problem to "commit" in advance? I understand that we might
not like the DMI-PCI bridge (due to it being legacy), but what speaks against cold-plugging the PCIe-PCI bridge either as an integrated device in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI bridge in a similarly cold-plugged PCIe root port?
We want to keep Q35 clean, and for most cases we don't want any legacy PCI stuff if not especially required.
BTW, what are the PCI devices that we actually need?
Is not about what we need, if Q35 will become a "transition" machine, any existing emulated PCI device is fair game, since we would want to run on Q35 also pc configurations.
Thanks, Marcel