The virt machine in the NEMU project has the ambition to be a platform that has no emulated legacy hardware. This patch series enables support for that machine type in Seabios.
The impact that this has for a Seabios port is that the CMOS is not available to query details of CPUs or memory configuration; instead this patch series modifies the code that queries those details to prefer those from QEMU FW CFG over CMOS.
I've tested this patch series with pc, Q35 and virt as part of our automated testing for NEMU.
Rob
Add the PCI ID used by the host bridge from the virt machine as a support PCI ID in the detection code.
Signed-off-by: Rob Bradford robert.bradford@intel.com --- src/fw/paravirt.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/src/fw/paravirt.c b/src/fw/paravirt.c index 0770c47..597bd1c 100644 --- a/src/fw/paravirt.c +++ b/src/fw/paravirt.c @@ -95,6 +95,9 @@ static void qemu_detect(void) case 0x29c0: dprintf(1, "Running on QEMU (q35)\n"); break; + case 0x0d57: + dprintf(1, "Running on QEMU (virt)\n"); + break; default: dprintf(1, "Running on QEMU (unknown nb: %04x:%04x)\n", v, d); break;
Our BIOS region is alway read-write so there is no need to program the PAM or any other mechanism to allow the BIOS to continue.
Signed-off-by: Rob Bradford robert.bradford@intel.com --- src/fw/shadow.c | 5 +++++ src/hw/pci_ids.h | 1 + 2 files changed, 6 insertions(+)
diff --git a/src/fw/shadow.c b/src/fw/shadow.c index 4c627a8..4b909bd 100644 --- a/src/fw/shadow.c +++ b/src/fw/shadow.c @@ -142,6 +142,11 @@ make_bios_writable(void) ShadowBDF = bdf; return; } + if (vendor == PCI_VENDOR_ID_INTEL + && device == PCI_DEVICE_ID_INTEL_VIRT) { + ShadowBDF = bdf; + return; + } } dprintf(1, "Unable to unlock ram - bridge not found\n"); } diff --git a/src/hw/pci_ids.h b/src/hw/pci_ids.h index 1096461..49c27f2 100644 --- a/src/hw/pci_ids.h +++ b/src/hw/pci_ids.h @@ -2528,6 +2528,7 @@ #define PCI_DEVICE_ID_INTEL_IXP4XX 0x8500 #define PCI_DEVICE_ID_INTEL_IXP2800 0x9004 #define PCI_DEVICE_ID_INTEL_S21152BB 0xb152 +#define PCI_DEVICE_ID_INTEL_VIRT 0x0d57
#define PCI_VENDOR_ID_SCALEMP 0x8686 #define PCI_DEVICE_ID_SCALEMP_VSMP_CTL 0x1010
Split detection of QEMU FW CFG into two phases: detection and parsing of complex structures. The qemu_cfg_preinit() function does not require malloc() to be available.
Signed-off-by: Rob Bradford robert.bradford@intel.com --- src/fw/paravirt.c | 11 +++++++++-- src/fw/paravirt.h | 1 + 2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/src/fw/paravirt.c b/src/fw/paravirt.c index 597bd1c..7044378 100644 --- a/src/fw/paravirt.c +++ b/src/fw/paravirt.c @@ -124,6 +124,8 @@ qemu_preinit(void) kvm_detect(); }
+ qemu_cfg_preinit(); + // On emulators, get memory size from nvram. u32 rs = ((rtc_read(CMOS_MEM_EXTMEM2_LOW) << 16) | (rtc_read(CMOS_MEM_EXTMEM2_HIGH) << 24)); @@ -571,8 +573,7 @@ struct QemuCfgFile { char name[56]; };
-void qemu_cfg_init(void) -{ +void qemu_cfg_preinit(void) { if (!runningOnQEMU()) return;
@@ -595,6 +596,12 @@ void qemu_cfg_init(void) dprintf(1, "QEMU fw_cfg DMA interface supported\n"); cfg_dma_enabled = 1; } +} + +void qemu_cfg_init(void) +{ + if (!qemu_cfg_enabled()) + return;
// Populate romfiles for legacy fw_cfg entries qemu_cfg_legacy(); diff --git a/src/fw/paravirt.h b/src/fw/paravirt.h index a14d83e..2c4792a 100644 --- a/src/fw/paravirt.h +++ b/src/fw/paravirt.h @@ -54,6 +54,7 @@ int qemu_cfg_dma_enabled(void); void qemu_preinit(void); void qemu_platform_setup(void); void qemu_cfg_init(void); +void qemu_cfg_preinit(void);
u16 qemu_get_present_cpus_count(void); int qemu_cfg_write_file(void *src, struct romfile_s *file, u32 offset, u32 len);
If there is a QEMU FW CFG variable for memory available then always use it instead of CMOS. We cannot extract the values from E820 tables yet as that code assumes a working malloc.
Signed-off-by: Rob Bradford robert.bradford@intel.com --- src/fw/paravirt.c | 37 ++++++++++++++++++++++++------------- 1 file changed, 24 insertions(+), 13 deletions(-)
diff --git a/src/fw/paravirt.c b/src/fw/paravirt.c index 7044378..26e0132 100644 --- a/src/fw/paravirt.c +++ b/src/fw/paravirt.c @@ -105,6 +105,8 @@ static void qemu_detect(void) kvm_detect(); }
+static void qemu_cfg_read_entry(void *buf, int e, int len); + void qemu_preinit(void) { @@ -126,22 +128,31 @@ qemu_preinit(void)
qemu_cfg_preinit();
- // On emulators, get memory size from nvram. - u32 rs = ((rtc_read(CMOS_MEM_EXTMEM2_LOW) << 16) - | (rtc_read(CMOS_MEM_EXTMEM2_HIGH) << 24)); - if (rs) - rs += 16 * 1024 * 1024; - else - rs = (((rtc_read(CMOS_MEM_EXTMEM_LOW) << 10) - | (rtc_read(CMOS_MEM_EXTMEM_HIGH) << 18)) - + 1 * 1024 * 1024); - RamSize = rs; - e820_add(0, rs, E820_RAM); + // Prefer QEMU FW CFG entry over CMOS for initial RAM sizes + if (qemu_cfg_enabled()) { + qemu_cfg_read_entry(&RamSize, 0x03, sizeof(RamSize)); + if (RamSize > 0) { + e820_add(0, RamSize, E820_RAM); + dprintf(1, "RamSize: 0x%08x [fw_cfg]\n", RamSize); + } + } + + if (RamSize == 0) { + u32 rs = ((rtc_read(CMOS_MEM_EXTMEM2_LOW) << 16) + | (rtc_read(CMOS_MEM_EXTMEM2_HIGH) << 24)); + if (rs) + rs += 16 * 1024 * 1024; + else + rs = (((rtc_read(CMOS_MEM_EXTMEM_LOW) << 10) + | (rtc_read(CMOS_MEM_EXTMEM_HIGH) << 18)) + + 1 * 1024 * 1024); + RamSize = rs; + e820_add(0, rs, E820_RAM); + dprintf(1, "RamSize: 0x%08x [cmos]\n", RamSize); + }
/* reserve 256KB BIOS area at the end of 4 GB */ e820_add(0xfffc0000, 256*1024, E820_RESERVED); - - dprintf(1, "RamSize: 0x%08x [cmos]\n", RamSize); }
#define MSR_IA32_FEATURE_CONTROL 0x0000003a
On Thu, Nov 29, 2018 at 05:37:45PM +0000, Rob Bradford wrote:
If there is a QEMU FW CFG variable for memory available then always use it instead of CMOS. We cannot extract the values from E820 tables yet as that code assumes a working malloc.
Failed in testing.
Try a guest with 4G RAM, seabios falls back to cmos.
Try a guest with q35 and 7G RAM, seabios thinks it has 3G of low mem even though it actually has 2G only.
- // Prefer QEMU FW CFG entry over CMOS for initial RAM sizes
- if (qemu_cfg_enabled()) {
qemu_cfg_read_entry(&RamSize, 0x03, sizeof(RamSize));
if (RamSize > 0) {
e820_add(0, RamSize, E820_RAM);
dprintf(1, "RamSize: 0x%08x [fw_cfg]\n", RamSize);
}
- }
You are loosing the high bits here, RamSize is a 32bit variable.
Another problem is that you only get the total amout of memory here, not the mapping.
I think there is no way around scanning the e820 table (etc/e830 fw_cfg file) for the ram entry with the zero start address to figure the amout of ram you have below 4G.
cheers, Gerd
On Wed, 2018-12-05 at 10:18 +0100, Gerd Hoffmann wrote:
On Thu, Nov 29, 2018 at 05:37:45PM +0000, Rob Bradford wrote:
If there is a QEMU FW CFG variable for memory available then always use it instead of CMOS. We cannot extract the values from E820 tables yet as that code assumes a working malloc.
Failed in testing.
Thanks for trying it out.
Try a guest with 4G RAM, seabios falls back to cmos.
Try a guest with q35 and 7G RAM, seabios thinks it has 3G of low mem even though it actually has 2G only.
Ah, that is a really big problem I missed. I played around with various size combinations but I was only interested in what was happening in the booted kernel.
Another problem is that you only get the total amout of memory here, not the mapping.
I think there is no way around scanning the e820 table (etc/e830 fw_cfg file) for the ram entry with the zero start address to figure the amout of ram you have below 4G.
The two options I see are: either modify the E820 code to make it usable without the assumption that malloc is avalable or alternatively add another FW config variable that has the low mem size - which is all you care about in terms of early firmware as you're going to get all the fine detail from the E820 table.
I'll take a look at the former as I think that would be preferable.
Cheers,
Rob
cheers, Gerd
Hi,
The two options I see are: either modify the E820 code to make it usable without the assumption that malloc is avalable
Or add a simplified version for early boot. You only need to find one entry and get one value from it, this should be doable with all state on the stack.
Modifying the e820 code isn't really possible I think, it also creates the e820 table for the OS, which must be stored somewhere. So this is hardly doable without working malloc.
HTH, Gerd
If there is a CPU count from QEMU FW CFG then use that over the value from the CMOS.
Signed-off-by: Rob Bradford robert.bradford@intel.com --- src/fw/paravirt.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/fw/paravirt.c b/src/fw/paravirt.c index 26e0132..a9e5f2c 100644 --- a/src/fw/paravirt.c +++ b/src/fw/paravirt.c @@ -419,12 +419,12 @@ qemu_get_present_cpus_count(void) u16 smp_count = 0; if (qemu_cfg_enabled()) { qemu_cfg_read_entry(&smp_count, QEMU_CFG_NB_CPUS, sizeof(smp_count)); + if (smp_count > 0) { + return smp_count; + } } - u16 cmos_cpu_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1; - if (smp_count < cmos_cpu_count) { - smp_count = cmos_cpu_count; - } - return smp_count; + + return rtc_read(CMOS_BIOS_SMP_COUNT) + 1; }
struct e820_reservation {
On Thu, Nov 29, 2018 at 05:37:41PM +0000, Rob Bradford wrote:
The virt machine in the NEMU project has the ambition to be a platform that has no emulated legacy hardware. This patch series enables support for that machine type in Seabios.
What exactly does "no legacy hardware" mean? Does that include lpc devices such as pit and pic?
What will seabios use as time source with -M virt? Seems there is no pmtimer ...
The impact that this has for a Seabios port is that the CMOS is not available to query details of CPUs or memory configuration; instead this patch series modifies the code that queries those details to prefer those from QEMU FW CFG over CMOS.
Doing that makes sense anyway, I'd say these three patches can be merged. The other two should wait until the qemu patches are merged, to be sure the pci ids are the final ones.
cheers, Gerd
Hi Gerd, thanks for you prompt reply,
On Fri, 2018-11-30 at 14:48 +0100, Gerd Hoffmann wrote:
On Thu, Nov 29, 2018 at 05:37:41PM +0000, Rob Bradford wrote:
The virt machine in the NEMU project has the ambition to be a platform that has no emulated legacy hardware. This patch series enables support for that machine type in Seabios.
What exactly does "no legacy hardware" mean? Does that include lpc devices such as pit and pic?
That is correct, the host Linux kernel does provide some emulation of devices as part of KVM though.
What will seabios use as time source with -M virt? Seems there is no pmtimer ...
That is correct. There does not appear to negative side effects with lack of a time source when running Seabios against virt
(In the OVMF port I implement MicroSleep/NanoSleep with a KVM pvclock and TSC based solution but i'm currently evaluating if this can be a no-op as sleeping in a firmware is usually part of programming physical hardware which is something that is not required with the virt machine type.)
The impact that this has for a Seabios port is that the CMOS is not available to query details of CPUs or memory configuration; instead this patch series modifies the code that queries those details to prefer those from QEMU FW CFG over CMOS.
Doing that makes sense anyway, I'd say these three patches can be merged. The other two should wait until the qemu patches are merged, to be sure the pci ids are the final ones.
This PCI ID (0x8086 / 0x0d57) is from our internal registry which is allocated exclusively for the use of the PCI host bridge on the virt platform.
cheers, Gerd
Cheers,
Rob
On Fri, Nov 30, 2018 at 02:46:57PM +0000, Rob Bradford wrote:
Hi Gerd, thanks for you prompt reply,
On Fri, 2018-11-30 at 14:48 +0100, Gerd Hoffmann wrote:
On Thu, Nov 29, 2018 at 05:37:41PM +0000, Rob Bradford wrote:
The virt machine in the NEMU project has the ambition to be a platform that has no emulated legacy hardware. This patch series enables support for that machine type in Seabios.
What exactly does "no legacy hardware" mean? Does that include lpc devices such as pit and pic?
That is correct, the host Linux kernel does provide some emulation of devices as part of KVM though.
Which is configurable though (-M kernel_irqchip=on|off|split).
What will seabios use as time source with -M virt? Seems there is no pmtimer ...
That is correct. There does not appear to negative side effects with lack of a time source when running Seabios against virt
Well, if there is no pmtimer seabios will use tsc, calibrated using pit (see src/hw/timer.c).
That works ok most of the time, but sometimes it doesn't. When booting a guest on a loaded host it may happen that the vcpu gets scheduled away while running the calibration loop and the calibration can be *way* off because of that.
pmtimer is simple and reliable (fixed frequency, so no calibration needed). So, with that not being available having some other reliable time source would be very useful. kvmclock maybe?
(In the OVMF port I implement MicroSleep/NanoSleep with a KVM pvclock and TSC based solution but i'm currently evaluating if this can be a no-op as sleeping in a firmware is usually part of programming physical hardware which is something that is not required with the virt machine type.)
Programming the virtual hardware maybe doesn't need exact timing. But you still need working timeouts for anything which can trigger actual I/O on the host. Read a block from disk. Wait for a DHCP response.
Or the three seconds we wait for a keypress with "-boot menu=on".
So, I don't think there is a way around having a reliable time source even on virtual hardware. Also note that you might have physical hardware in a virtual machine (via pci passthrough for example).
Doing that makes sense anyway, I'd say these three patches can be merged. The other two should wait until the qemu patches are merged, to be sure the pci ids are the final ones.
This PCI ID (0x8086 / 0x0d57) is from our internal registry which is allocated exclusively for the use of the PCI host bridge on the virt platform.
Good.
General rule is still that we usually merge seabios support after the qemu patches actually landed in qemu master.
cheers, Gerd
Hi Gerd,
On Mon, 2018-12-03 at 08:27 +0100, Gerd Hoffmann wrote:
On Fri, Nov 30, 2018 at 02:46:57PM +0000, Rob Bradford wrote:
Hi Gerd, thanks for you prompt reply,
On Fri, 2018-11-30 at 14:48 +0100, Gerd Hoffmann wrote:
On Thu, Nov 29, 2018 at 05:37:41PM +0000, Rob Bradford wrote:
The virt machine in the NEMU project has the ambition to be a platform that has no emulated legacy hardware. This patch series enables support for that machine type in Seabios.
What exactly does "no legacy hardware" mean? Does that include lpc devices such as pit and pic?
That is correct, the host Linux kernel does provide some emulation of devices as part of KVM though.
Which is configurable though (-M kernel_irqchip=on|off|split).
What will seabios use as time source with -M virt? Seems there is no pmtimer ...
That is correct. There does not appear to negative side effects with lack of a time source when running Seabios against virt
Well, if there is no pmtimer seabios will use tsc, calibrated using pit (see src/hw/timer.c).
That works ok most of the time, but sometimes it doesn't. When booting a guest on a loaded host it may happen that the vcpu gets scheduled away while running the calibration loop and the calibration can be *way* off because of that.
pmtimer is simple and reliable (fixed frequency, so no calibration needed). So, with that not being available having some other reliable time source would be very useful. kvmclock maybe?
On the virt machine we don't use the kernel provided PIT emulation as we don't call the KVM_CREATE_PIT[2] ioctls. Looking through the debug output I can indeed see that Seabios is using the TSC/PIT code path. However as we don't have a PIT present it's understandably getting the frequency calculation wrong:
As can be seen from the debug output...
""" tsc calibrate start=61956912 end=61961793 diff=4881 CPU Mhz=2 init timer """
As you can these numbers are wrong :-) so I definitely need to find a better solution. As I mentioned below in the OVMF port I added a version using KVM clock to calculate the TSC, that is definitely workable for Seabios too. However i'm also thinking about proposing adding the (counter only) ACPI timer in to the virt platform. As that would not add much code and significantly simplify our firmware changes.
(In the OVMF port I implement MicroSleep/NanoSleep with a KVM pvclock and TSC based solution but i'm currently evaluating if this can be a no-op as sleeping in a firmware is usually part of programming physical hardware which is something that is not required with the virt machine type.)
Programming the virtual hardware maybe doesn't need exact timing. But you still need working timeouts for anything which can trigger actual I/O on the host. Read a block from disk. Wait for a DHCP response.
Or the three seconds we wait for a keypress with "-boot menu=on".
So, I don't think there is a way around having a reliable time source even on virtual hardware. Also note that you might have physical hardware in a virtual machine (via pci passthrough for example).
This feedback makes sense.
Doing that makes sense anyway, I'd say these three patches can be merged. The other two should wait until the qemu patches are merged, to be sure the pci ids are the final ones.
This PCI ID (0x8086 / 0x0d57) is from our internal registry which is allocated exclusively for the use of the PCI host bridge on the virt platform.
Good.
General rule is still that we usually merge seabios support after the qemu patches actually landed in qemu master.
Understood. What do you think about patches 3-5 which don't have any virt machine specific details. Do you still think they could be merged?
Cheers,
Rob
cheers, Gerd