I'm posting it to get an oppinion on one of possible approaches on where to map a hotplug memory.
This patch assumes that a space for hotplug memory is located right after RamSizeOver4G region and QEMU will provide romfile to specify where it ends so that BIOS could know from what base to start 64-bit PCI devices mapping.
Signed-off-by: Igor Mammedov imammedo@redhat.com --- src/fw/pciinit.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index b29db99..62f8d4e 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -18,6 +18,8 @@ #include "paravirt.h" // RamSize #include "string.h" // memset #include "util.h" // pci_setup +#include "byteorder.h" // le64_to_cpu +#include "romfile.h" // romfile_loadint
#define PCI_DEVICE_MEM_MIN 0x1000 #define PCI_BRIDGE_IO_MIN 0x1000 @@ -764,6 +766,8 @@ static void pci_bios_map_devices(struct pci_bus *busses) { if (pci_bios_init_root_regions(busses)) { struct pci_region r64_mem, r64_pref; + u64 base64 = le64_to_cpu(romfile_loadint("etc/mem64-end", + 0x100000000ULL + RamSizeOver4G)); r64_mem.list.first = NULL; r64_pref.list.first = NULL; pci_region_migrate_64bit_entries(&busses[0].r[PCI_REGION_TYPE_MEM], @@ -779,7 +783,7 @@ static void pci_bios_map_devices(struct pci_bus *busses) u64 align_mem = pci_region_align(&r64_mem); u64 align_pref = pci_region_align(&r64_pref);
- r64_mem.base = ALIGN(0x100000000LL + RamSizeOver4G, align_mem); + r64_mem.base = ALIGN(base64, align_mem); r64_pref.base = ALIGN(r64_mem.base + sum_mem, align_pref); pcimem64_start = r64_mem.base; pcimem64_end = r64_pref.base + sum_pref;
On Mi, 2013-10-09 at 14:23 +0200, Igor Mammedov wrote:
I'm posting it to get an oppinion on one of possible approaches on where to map a hotplug memory.
This patch assumes that a space for hotplug memory is located right after RamSizeOver4G region and QEMU will provide romfile to specify where it ends so that BIOS could know from what base to start 64-bit PCI devices mapping.
We should think about both pci hotplug and memory hotplug while being at it.
Today the 64bit pci window is mapped right above high memory and is sized (in acpi tables) according to what is needed to map the devices present at boot.
Effect is that there is no extra address space for 64bit bars of hotplugged pci devices. And the window is also in the way when it comes to memory hotplug.
Given that some windows versions don't like the large 64bit windows we should make the window size configurable.
The window location can either be made configurable too, or we simply place it at the top of the address space, with "address space" being what the cpu can address according to cpuinfo.
Current qemu reports this by default:
$ cat /proc/cpuinfo model name : QEMU Virtual CPU version 1.5.3 address sizes : 40 bits physical, 48 bits virtual
40 address lines allow 1TB, so we would place the window just below 1TB.
Comments?
Gerd
On Wed, 09 Oct 2013 15:12:08 +0200 Gerd Hoffmann kraxel@redhat.com wrote:
On Mi, 2013-10-09 at 14:23 +0200, Igor Mammedov wrote:
I'm posting it to get an oppinion on one of possible approaches on where to map a hotplug memory.
This patch assumes that a space for hotplug memory is located right after RamSizeOver4G region and QEMU will provide romfile to specify where it ends so that BIOS could know from what base to start 64-bit PCI devices mapping.
We should think about both pci hotplug and memory hotplug while being at it.
Today the 64bit pci window is mapped right above high memory and is sized (in acpi tables) according to what is needed to map the devices present at boot.
Effect is that there is no extra address space for 64bit bars of hotplugged pci devices. And the window is also in the way when it comes to memory hotplug.
Given that some windows versions don't like the large 64bit windows we should make the window size configurable.
So far from QEMU side it's partially (only memory region mapping and not ACPI window) configurable via {i440FX-pcihost|q35-pcihost}.pci-hole64-size property
The window location can either be made configurable too, or we simply place it at the top of the address space, with "address space" being what the cpu can address according to cpuinfo.
An earlier attempt by Michael to push complete PCI window placement info via "etc/pci-info" romfile to Seabios was rejected in favor of letting Seabios to program windows at hardcoded(32-bit/behind high mem) locations with a 64-bit window size (in ACPI) that covers all present devices but doesn't account for future PCI hotplug either.
That behavior maintained in his "ACPI in QEMU" series, see: http://patchwork.ozlabs.org/patch/281032/ acpi_get_pci_info()->i440fx_pcihost_get_pci_hole64_end()->pci_bus_get_w64_range() which is then embedded in ACPI table. So end result stays the same as before (no usable 64-bit PCI window for hotlug).
But 64-bit PCI window size, which is capped by QEMU to insane legacy 62 bits (memory region size), is a bit of orthogonal to freeing space for memory hotplug before it.
Current qemu reports this by default:
$ cat /proc/cpuinfo model name : QEMU Virtual CPU version 1.5.3 address sizes : 40 bits physical, 48 bits virtual
40 address lines allow 1TB, so we would place the window just below 1TB.
Comments?
More to the point if OS supports/enforces 1Tb physical address space,the RAM and 64-bit PCI hole are going to contend for it, QEMU could abort on startup if they both do not fit in CPU supported address space but I don't see what else it could do.
Proposed patch favors RAM vs 64-bit PCI hole and moves the hole behind the possible RAM, which in present state of QEMU potentially leaves the rest of address space up to 62 bits for hole. It has drawback that one can't get a working VM if QEMU is started in memory hotlug mode with old BIOS + PCI devices that require 64-bit bars, otherwise it's backward compatible.
PS: As for remedying BSODs because of huge CRS sizes of particular RAM device/PCI window, it might be solved by splitting one big chunk in several smaller, at least it works for RAM device.
Gerd
Hi,
So far from QEMU side it's partially (only memory region mapping and not ACPI window) configurable via {i440FX-pcihost|q35-pcihost}.pci-hole64-size property
/me looks.
Hmm, so the pci-hole64 memory region basically covers all non-memory area, leaving no free space.
The window location can either be made configurable too, or we simply place it at the top of the address space, with "address space" being what the cpu can address according to cpuinfo.
An earlier attempt by Michael to push complete PCI window placement info via "etc/pci-info" romfile to Seabios was rejected in favor of letting Seabios to program windows at hardcoded(32-bit/behind high mem) locations with a 64-bit window size (in ACPI) that covers all present devices but doesn't account for future PCI hotplug either.
Correct. The ACPI tables should reflect what SeaBIOS has programmed, to avoid nasty dependencies between seabios and qemu.
The same should apply to pci-hole64 IMO.
That behavior maintained in his "ACPI in QEMU" series, see: http://patchwork.ozlabs.org/patch/281032/ acpi_get_pci_info()->i440fx_pcihost_get_pci_hole64_end()->pci_bus_get_w64_range() which is then embedded in ACPI table. So end result stays the same as before (no usable 64-bit PCI window for hotlug).
Yes. And if we change seabios to do something else qemu nicely adapts to that, without requiring us to update things in lockstep.
But 64-bit PCI window size, which is capped by QEMU to insane legacy 62 bits (memory region size), is a bit of orthogonal to freeing space for memory hotplug before it.
Yep. So seabios should leave some free address space for memory hotplug. And if we change seabios to map the 64bit pci bars somewhere else we should also allow for a larger 64bit pci window to get some address space for pci hotplug.
If we can do that without hints from the qemu I'd prefer that.
40 address lines allow 1TB, so we would place the window just below 1TB.
Comments?
More to the point if OS supports/enforces 1Tb physical address space,the RAM and 64-bit PCI hole are going to contend for it, QEMU could abort on startup if they both do not fit in CPU supported address space but I don't see what else it could do.
Yes.
Proposed patch favors RAM vs 64-bit PCI hole and moves the hole behind the possible RAM, which in present state of QEMU potentially leaves the rest of address space up to 62 bits for hole.
So you'd end up with the 64bit hole being above the address space the virtual cpu claims to support. Not exactly nice either. Maybe things work nevertheless, maybe not ...
Both cases can easily be fixed by just using a cpu with enough physical address lines to fit everything in, so I don't think we should bother too much about this corner case.
Just in case this wasn't clear: my idea is that seabios figures the address space size at runtime, so the 1TB would NOT be hard-coded, it just served as example with the current default qemu cpu.
So with my idea the address space would have all RAM at the bottom (well, starting at 4g). All PCI devices at the top. Free space for hotplug inbetween. RAM can grow up. PCI space can grow down.
Note that qemu can make 64bit pci window in the acpi tables larger than what is actually used by the mapped bars, to make room for hotplugging, without any help from seabios (once the acpi table generation patches are merged). So with the current seabios (bars mapped above memory) it can set the end address higher. When seabios starts mapping the pci bars high it can set the start address lower.
Anyone has a use case not handled by this approach?
It has drawback that one can't get a working VM if QEMU is started in memory hotlug mode with old BIOS + PCI devices that require 64-bit bars, otherwise it's backward compatible.
Yes. Updating seabios will be needed to use memory hotplug together with 64bit pci no matter how we tackle the issue.
On Thu, Oct 10, 2013 at 12:56:23PM +0200, Gerd Hoffmann wrote:
Hi,
So far from QEMU side it's partially (only memory region mapping and not ACPI window) configurable via {i440FX-pcihost|q35-pcihost}.pci-hole64-size property
/me looks.
Hmm, so the pci-hole64 memory region basically covers all non-memory area, leaving no free space.
This is kind of derived from the PIIX spec although of course it did not discuss 64 bit memory.
The window location can either be made configurable too, or we simply place it at the top of the address space, with "address space" being what the cpu can address according to cpuinfo.
An earlier attempt by Michael to push complete PCI window placement info via "etc/pci-info" romfile to Seabios was rejected in favor of letting Seabios to program windows at hardcoded(32-bit/behind high mem) locations with a 64-bit window size (in ACPI) that covers all present devices but doesn't account for future PCI hotplug either.
Correct. The ACPI tables should reflect what SeaBIOS has programmed, to avoid nasty dependencies between seabios and qemu.
The same should apply to pci-hole64 IMO.
That behavior maintained in his "ACPI in QEMU" series, see: http://patchwork.ozlabs.org/patch/281032/ acpi_get_pci_info()->i440fx_pcihost_get_pci_hole64_end()->pci_bus_get_w64_range() which is then embedded in ACPI table. So end result stays the same as before (no usable 64-bit PCI window for hotlug).
Yes. And if we change seabios to do something else qemu nicely adapts to that, without requiring us to update things in lockstep.
But 64-bit PCI window size, which is capped by QEMU to insane legacy 62 bits (memory region size), is a bit of orthogonal to freeing space for memory hotplug before it.
Yep. So seabios should leave some free address space for memory hotplug. And if we change seabios to map the 64bit pci bars somewhere else we should also allow for a larger 64bit pci window to get some address space for pci hotplug.
If we can do that without hints from the qemu I'd prefer that.
I think the simplest way to do all this is simply to tell seabios that we have more memory. seabios already programs 64 bit BARs higher than memory.
No new interface seems necessary.
40 address lines allow 1TB, so we would place the window just below 1TB.
Comments?
More to the point if OS supports/enforces 1Tb physical address space,the RAM and 64-bit PCI hole are going to contend for it, QEMU could abort on startup if they both do not fit in CPU supported address space but I don't see what else it could do.
Yes.
Proposed patch favors RAM vs 64-bit PCI hole and moves the hole behind the possible RAM, which in present state of QEMU potentially leaves the rest of address space up to 62 bits for hole.
So you'd end up with the 64bit hole being above the address space the virtual cpu claims to support. Not exactly nice either. Maybe things work nevertheless, maybe not ...
Both cases can easily be fixed by just using a cpu with enough physical address lines to fit everything in, so I don't think we should bother too much about this corner case.
Just in case this wasn't clear: my idea is that seabios figures the address space size at runtime, so the 1TB would NOT be hard-coded, it just served as example with the current default qemu cpu.
So with my idea the address space would have all RAM at the bottom (well, starting at 4g). All PCI devices at the top. Free space for hotplug inbetween. RAM can grow up. PCI space can grow down.
Note that qemu can make 64bit pci window in the acpi tables larger than what is actually used by the mapped bars, to make room for hotplugging, without any help from seabios (once the acpi table generation patches are merged). So with the current seabios (bars mapped above memory) it can set the end address higher. When seabios starts mapping the pci bars high it can set the start address lower.
Anyone has a use case not handled by this approach?
I think the issue is with legacy guests. E.g. if VCPU claims to support 50 bit of memory do we put high PCI memory at 1 << 50? If yes old guests which expect at most 40 bit will not be able to use it.
It has drawback that one can't get a working VM if QEMU is started in memory hotlug mode with old BIOS + PCI devices that require 64-bit bars, otherwise it's backward compatible.
Yes. Updating seabios will be needed to use memory hotplug together with 64bit pci no matter how we tackle the issue.
Hi,
I think the simplest way to do all this is simply to tell seabios that we have more memory. seabios already programs 64 bit BARs higher than memory.
Hmm? As I understand Igor just wants some address space for memory hotplug. So there wouldn't be memory there (yet). And telling seabios there is although there isn't will make seabios place wrong info into the e820 tables. Not going to fly.
I think the issue is with legacy guests. E.g. if VCPU claims to support 50 bit of memory do we put high PCI memory at 1 << 50? If yes old guests which expect at most 40 bit will not be able to use it.
Hmm. Sure such guests exist? Note this is physical address lines, not virtual address space (where you might need an additional level of pagetables to fully use it, which is not something we could expect old guests being able to handle).
cheers, Gerd
On Thu, Oct 10, 2013 at 02:14:16PM +0200, Gerd Hoffmann wrote:
Hi,
I think the simplest way to do all this is simply to tell seabios that we have more memory. seabios already programs 64 bit BARs higher than memory.
Hmm? As I understand Igor just wants some address space for memory hotplug. So there wouldn't be memory there (yet). And telling seabios there is although there isn't will make seabios place wrong info into the e820 tables. Not going to fly.
True. Maybe we should get some smbios stuff from qemu too.
I think the issue is with legacy guests. E.g. if VCPU claims to support 50 bit of memory do we put high PCI memory at 1 << 50? If yes old guests which expect at most 40 bit will not be able to use it.
Hmm. Sure such guests exist?
I wouldn't be surprised. At least some windows guests crash if you try to tell them your system has too much physical memory (e.g. 2^48).
Note this is physical address lines, not virtual address space (where you might need an additional level of pagetables to fully use it, which is not something we could expect old guests being able to handle).
cheers, Gerd
Hi,
I think the issue is with legacy guests. E.g. if VCPU claims to support 50 bit of memory do we put high PCI memory at 1 << 50? If yes old guests which expect at most 40 bit will not be able to use it.
Hmm. Sure such guests exist?
I wouldn't be surprised. At least some windows guests crash if you try to tell them your system has too much physical memory (e.g. 2^48).
Ok, so there is not really a way around making the location configurable. The size isn't needed, qemu can handle this on it's own.
Guess we can just go with Igor's approach then. "etc/mem64-end" is a pretty bad name to say "please map 64bit pci bars here" though.
cheers, Gerd
On Thu, 10 Oct 2013 14:42:07 +0200 Gerd Hoffmann kraxel@redhat.com wrote:
Hi,
I think the issue is with legacy guests. E.g. if VCPU claims to support 50 bit of memory do we put high PCI memory at 1 << 50? If yes old guests which expect at most 40 bit will not be able to use it.
Hmm. Sure such guests exist?
I wouldn't be surprised. At least some windows guests crash if you try to tell them your system has too much physical memory (e.g. 2^48).
Ok, so there is not really a way around making the location configurable. The size isn't needed, qemu can handle this on it's own.
Guess we can just go with Igor's approach then. "etc/mem64-end" is a pretty bad name to say "please map 64bit pci bars here" though.
reasoning bind was to tell BIOS where RAM ends and let it decide what to do with this information.
But we could do other way around and use "etc/pci-info" that was proposed earlier by Michael, it is already committed into QEMU and provides start/end of 32/64-bit PCI windows in QEMU view. We could use pci-info.w64.start as base for 64-bit bars. If it's good enough, I'll amend my patch to use it.
cheers, Gerd
Hi,
Guess we can just go with Igor's approach then. "etc/mem64-end" is a pretty bad name to say "please map 64bit pci bars here" though.
reasoning bind was to tell BIOS where RAM ends and let it decide what to do with this information.
But we could do other way around and use "etc/pci-info" that was proposed earlier by Michael, it is already committed into QEMU and provides start/end of 32/64-bit PCI windows in QEMU view. We could use pci-info.w64.start as base for 64-bit bars.
We need only the single value from pci-info, I'd suggest to drop pci-info in favor of a file you can read using romfile_loadint.
cheers, Gerd
On Thu, 10 Oct 2013 15:21:55 +0200 Gerd Hoffmann kraxel@redhat.com wrote:
Hi,
Guess we can just go with Igor's approach then. "etc/mem64-end" is a pretty bad name to say "please map 64bit pci bars here" though.
reasoning bind was to tell BIOS where RAM ends and let it decide what to do with this information.
But we could do other way around and use "etc/pci-info" that was proposed earlier by Michael, it is already committed into QEMU and provides start/end of 32/64-bit PCI windows in QEMU view. We could use pci-info.w64.start as base for 64-bit bars.
We need only the single value from pci-info, I'd suggest to drop pci-info in favor of a file you can read using romfile_loadint.
Ok, then would "etc/pcimem64-start" be suitable or maybe you have a suggestion?
Michael, while at it, could we safely rip out "etc/pci-info" from QEMU? It is disabled by pc_compat_1_6() in 1.6 but will appear in 1.7 again if we don't remove it.
cheers, Gerd
Hi,
Ok, then would "etc/pcimem64-start" be suitable or maybe you have a suggestion?
Looks good to me.
cheers, Gerd
On Thu, 10 Oct 2013 15:21:32 +0300 "Michael S. Tsirkin" mst@redhat.com wrote:
On Thu, Oct 10, 2013 at 02:14:16PM +0200, Gerd Hoffmann wrote:
Hi,
I think the simplest way to do all this is simply to tell seabios that we have more memory. seabios already programs 64 bit BARs higher than memory.
Hmm? As I understand Igor just wants some address space for memory hotplug. So there wouldn't be memory there (yet). And telling seabios there is although there isn't will make seabios place wrong info into the e820 tables. Not going to fly.
True. Maybe we should get some smbios stuff from qemu too.
I think the issue is with legacy guests. E.g. if VCPU claims to support 50 bit of memory do we put high PCI memory at 1 << 50? If yes old guests which expect at most 40 bit will not be able to use it.
Hmm. Sure such guests exist?
I wouldn't be surprised. At least some windows guests crash if you try to tell them your system has too much physical memory (e.g. 2^48).
confirmed, the same happened when memory device was mapped too high, can't recall windows version tough.
Note this is physical address lines, not virtual address space (where you might need an additional level of pagetables to fully use it, which is not something we could expect old guests being able to handle).
cheers, Gerd
On Wed, Oct 09, 2013 at 02:23:04PM +0200, Igor Mammedov wrote:
I'm posting it to get an oppinion on one of possible approaches on where to map a hotplug memory.
This patch assumes that a space for hotplug memory is located right after RamSizeOver4G region and QEMU will provide romfile to specify where it ends so that BIOS could know from what base to start 64-bit PCI devices mapping.
Signed-off-by: Igor Mammedov imammedo@redhat.com
Well there are two things bios does with RamSizeOver4G: determine where to map PCI devices, and fill in smbios.
I wonder whether QEMU should fill smbios from qemu too, that would let us side-step the issue and just make RamSizeOver4G larger.
Let's see how the ACPI patchset fares first ...
src/fw/pciinit.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index b29db99..62f8d4e 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -18,6 +18,8 @@ #include "paravirt.h" // RamSize #include "string.h" // memset #include "util.h" // pci_setup +#include "byteorder.h" // le64_to_cpu +#include "romfile.h" // romfile_loadint
#define PCI_DEVICE_MEM_MIN 0x1000 #define PCI_BRIDGE_IO_MIN 0x1000 @@ -764,6 +766,8 @@ static void pci_bios_map_devices(struct pci_bus *busses) { if (pci_bios_init_root_regions(busses)) { struct pci_region r64_mem, r64_pref;
u64 base64 = le64_to_cpu(romfile_loadint("etc/mem64-end",
0x100000000ULL + RamSizeOver4G)); r64_mem.list.first = NULL; r64_pref.list.first = NULL; pci_region_migrate_64bit_entries(&busses[0].r[PCI_REGION_TYPE_MEM],
@@ -779,7 +783,7 @@ static void pci_bios_map_devices(struct pci_bus *busses) u64 align_mem = pci_region_align(&r64_mem); u64 align_pref = pci_region_align(&r64_pref);
r64_mem.base = ALIGN(0x100000000LL + RamSizeOver4G, align_mem);
r64_mem.base = ALIGN(base64, align_mem); r64_pref.base = ALIGN(r64_mem.base + sum_mem, align_pref); pcimem64_start = r64_mem.base; pcimem64_end = r64_pref.base + sum_pref;
-- 1.8.3.1