Hi,
This series aims to fix an issue of booting a guest with many PVSCSI disk controllers (>=16). Currently, booting such a guest result in exhausting available memory in high-zone which leads to guest crash.
The 1st patch improve PVSCSI memory consumption by removing an unnecessary contraint of allocating a page-aligned memory.
The 2nd patch modifies SeaBIOS high-zone allocator to fallback to low-zone in case high-zone is exhausted. This is possible because high-zone is used only by 32-bit code which don't care if allocation is from high-zone or low-zone and just uses high-zone because it has more space. However, in case there is no more memory in high-zone, it makes sense to utilize remaining memory in low-zone to still be able to boot guest successfully.
Regards, -Liran
In contrast to other allocations made by pvscsi_init_rings(), ring_desc is only used internally by SeaBIOS (not passed to device-controller) and there is not restriction which force it to be page aligned.
Reviewed-by: Mark Kanda mark.kanda@oracle.com Signed-off-by: Liran Alon liran.alon@oracle.com --- src/hw/pvscsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/hw/pvscsi.c b/src/hw/pvscsi.c index e0ea33cd751c..9d7d68d80964 100644 --- a/src/hw/pvscsi.c +++ b/src/hw/pvscsi.c @@ -167,7 +167,7 @@ pvscsi_init_rings(void *iobase, struct pvscsi_ring_dsc_s **ring_dsc) { struct PVSCSICmdDescSetupRings cmd = {0,};
- struct pvscsi_ring_dsc_s *dsc = memalign_high(PAGE_SIZE, sizeof(*dsc)); + struct pvscsi_ring_dsc_s *dsc = malloc_high(sizeof(*dsc)); if (!dsc) { warn_noalloc(); return;
On Tue, Nov 13, 2018 at 05:53:40PM +0200, Liran Alon wrote:
In contrast to other allocations made by pvscsi_init_rings(), ring_desc is only used internally by SeaBIOS (not passed to device-controller) and there is not restriction which force it to be page aligned.
Thanks. This patch is fine, but we're planning to make a SeaBIOS release in a few days - I'd look to commit this change post release.
-Kevin
On 13 Nov 2018, at 20:10, Kevin O'Connor kevin@koconnor.net wrote:
On Tue, Nov 13, 2018 at 05:53:40PM +0200, Liran Alon wrote:
In contrast to other allocations made by pvscsi_init_rings(), ring_desc is only used internally by SeaBIOS (not passed to device-controller) and there is not restriction which force it to be page aligned.
Thanks. This patch is fine, but we're planning to make a SeaBIOS release in a few days - I'd look to commit this change post release.
-Kevin
I would prefer it will enter this release but I’m fine if it doesn’t make it. As long as it is eventually committed.
Thanks, -Liran
On 13 Nov 2018, at 20:10, Kevin O'Connor kevin@koconnor.net wrote:
On Tue, Nov 13, 2018 at 05:53:40PM +0200, Liran Alon wrote:
In contrast to other allocations made by pvscsi_init_rings(), ring_desc is only used internally by SeaBIOS (not passed to device-controller) and there is not restriction which force it to be page aligned.
Thanks. This patch is fine, but we're planning to make a SeaBIOS release in a few days - I'd look to commit this change post release.
-Kevin
Gentle ping on committing this now after v1.12.0 was released.
-Liran
On Tue, Nov 13, 2018 at 05:53:40PM +0200, Liran Alon wrote:
In contrast to other allocations made by pvscsi_init_rings(), ring_desc is only used internally by SeaBIOS (not passed to device-controller) and there is not restriction which force it to be page aligned.
Thanks. I committed this change.
-Kevin
When 32-bit code needs to allocate memory which remains permanent after completion of the initialization phase, it doesn't care if it is allocated from high-zone or low-zone.
Currently all these cases just attempt to allocate from high-zone as low-zone memory is scarce.
However, in cases we don't have enough space left in high-zone, it makes sense to use low-zone as a fallback.
This commit fixes an issue of a VM which failed to boot when configured with 16 PVSCSI disk controllers. The VM failed to boot because pvscsi_init_rings() failed to allocate ring_dsc when high-zone didn't had enough space for it.
Reviewed-by: Mark Kanda mark.kanda@oracle.com Signed-off-by: Liran Alon liran.alon@oracle.com --- src/malloc.h | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/src/malloc.h b/src/malloc.h index 960a7f8000f8..d8bf5092dce0 100644 --- a/src/malloc.h +++ b/src/malloc.h @@ -31,7 +31,10 @@ static inline void *malloc_low(u32 size) { return _malloc(&ZoneLow, size, MALLOC_MIN_ALIGN); } static inline void *malloc_high(u32 size) { - return _malloc(&ZoneHigh, size, MALLOC_MIN_ALIGN); + void *ret = _malloc(&ZoneHigh, size, MALLOC_MIN_ALIGN); + if (ret) + return ret; + return malloc_low(size); } static inline void *malloc_fseg(u32 size) { return _malloc(&ZoneFSeg, size, MALLOC_MIN_ALIGN); @@ -52,7 +55,10 @@ static inline void *memalign_low(u32 align, u32 size) { return _malloc(&ZoneLow, size, align); } static inline void *memalign_high(u32 align, u32 size) { - return _malloc(&ZoneHigh, size, align); + void *ret = _malloc(&ZoneHigh, size, align); + if (ret) + return ret; + return memalign_low(align, size); } static inline void *memalign_tmplow(u32 align, u32 size) { return _malloc(&ZoneTmpLow, size, align);
On Tue, Nov 13, 2018 at 05:53:41PM +0200, Liran Alon wrote:
When 32-bit code needs to allocate memory which remains permanent after completion of the initialization phase, it doesn't care if it is allocated from high-zone or low-zone.
Currently all these cases just attempt to allocate from high-zone as low-zone memory is scarce.
However, in cases we don't have enough space left in high-zone, it makes sense to use low-zone as a fallback.
Thanks, but I fear this change would have poor secondary effects. There is only a tiny amount of ram available in the permanent "low" area of ram (typically less that 64K). If the code has exhausted the permanent "high" ram (typically 256K) then it is likely to rapidly exhaust the low ram and cause a worse failure when code that requires a tiny amount of "low" space fails.
Typical work arounds are to extend the amount of high ram available (see BUILD_MAX_HIGHTABLE) or to use malloc_tmphigh() and manually reserve the space in the e820 map.
-Kevin
On 13 Nov 2018, at 20:16, Kevin O'Connor kevin@koconnor.net wrote:
On Tue, Nov 13, 2018 at 05:53:41PM +0200, Liran Alon wrote:
When 32-bit code needs to allocate memory which remains permanent after completion of the initialization phase, it doesn't care if it is allocated from high-zone or low-zone.
Currently all these cases just attempt to allocate from high-zone as low-zone memory is scarce.
However, in cases we don't have enough space left in high-zone, it makes sense to use low-zone as a fallback.
Thanks, but I fear this change would have poor secondary effects. There is only a tiny amount of ram available in the permanent "low" area of ram (typically less that 64K). If the code has exhausted the permanent "high" ram (typically 256K) then it is likely to rapidly exhaust the low ram and cause a worse failure when code that requires a tiny amount of "low" space fails.
I disagree with “likely”.
As I see it, the caller which allocates from high-zone doesn’t care if it is satisfied eventually from high-zone or low-zone and if we got to the rare scenario where we exhausted all high-zone memory and low-zone memory can save us, why not try? This could save the guest of being able to boot on some scenarios (As the scenario I have encountered when attaching many PVSCSI controllers). And that seems better to me than to fail on allocation on SeaBIOS which is a hard thing to debug for the common user which is not a SeaBIOS developer.
Also, why will failures on allocation on low-zone should be worse than failures on allocation on high-zone? From the code I see that all allocations failures eventually call warn_noalloc() which should make it very easy to debug for a SeaBIOS developer. This is true both for allocations from high-zone and allocations from low-zone.
Typical work arounds are to extend the amount of high ram available (see BUILD_MAX_HIGHTABLE) or to use malloc_tmphigh() and manually reserve the space in the e820 map.
I think this patch is good regardless if modifying BUILD_MAX_HIGHTABLE can be served as a workaround. (Also if it is a typical workaround, why isn’t it configurable as a Makefile parameter?)
I find it odd to allocate from malloc_tmphigh() and manually reserve the space in the e820 map. As it basically undermines the definition of the Temporary High-Zone which should not be reserved from OS and therefore must not be accessed after SeaBIOS initialisation phase.
Also, it is kinda hard to put the finger on which code you would change to use malloc_tmphigh() like this all the time. Will you do this change for example in pvscsi.c code? It seems odd to me that pvscsi code should deal with such low-level memory handling instead of all this being transparently handled in the malloc_*() functions.
-Liran
-Kevin
Hi,
There is only a tiny amount of ram available in the permanent "low" area of ram (typically less that 64K). If the code has exhausted the permanent "high" ram (typically 256K) then it is likely to rapidly exhaust the low ram and cause a worse failure when code that requires a tiny amount of "low" space fails.
I disagree with “likely”.
So it currently fails with 16 controllers. This patch brings it how far? 20 controllers? It's not like this patch scales things up much.
Typical work arounds are to extend the amount of high ram available (see BUILD_MAX_HIGHTABLE) or to use malloc_tmphigh() and manually reserve the space in the e820 map.
I think this patch is good regardless if modifying BUILD_MAX_HIGHTABLE can be served as a workaround. (Also if it is a typical workaround, why isn’t it configurable as a Makefile parameter?)
Making this configurable (via kconfig) makes sense indeed.
I find it odd to allocate from malloc_tmphigh() and manually reserve the space in the e820 map. As it basically undermines the definition of the Temporary High-Zone which should not be reserved from OS and therefore must not be accessed after SeaBIOS initialisation phase.
pmm (see src/pmm.c) does this for large allocations.
Also, it is kinda hard to put the finger on which code you would change to use malloc_tmphigh() like this all the time. Will you do this change for example in pvscsi.c code?
The logic in pmm could be moved to malloc.c. Probably needs some refinements then, the current code is meant to handle rare and very large (megabytes) allocations. In case we run out of space on small allocations it probably makes sense to grab a block from the tmphigh zone, reserve it in the e820 map and add it to the high zone pool instead. So effectively the high zone can grow dynamically.
cheers, Gerd
On 14 Nov 2018, at 8:28, Gerd Hoffmann kraxel@redhat.com wrote:
Hi,
There is only a tiny amount of ram available in the permanent "low" area of ram (typically less that 64K). If the code has exhausted the permanent "high" ram (typically 256K) then it is likely to rapidly exhaust the low ram and cause a worse failure when code that requires a tiny amount of "low" space fails.
I disagree with “likely”.
So it currently fails with 16 controllers. This patch brings it how far? 20 controllers? It's not like this patch scales things up much.
That’s true.
Typical work arounds are to extend the amount of high ram available (see BUILD_MAX_HIGHTABLE) or to use malloc_tmphigh() and manually reserve the space in the e820 map.
I think this patch is good regardless if modifying BUILD_MAX_HIGHTABLE can be served as a workaround. (Also if it is a typical workaround, why isn’t it configurable as a Makefile parameter?)
Making this configurable (via kconfig) makes sense indeed.
Ok. Will create such a patch.
I find it odd to allocate from malloc_tmphigh() and manually reserve the space in the e820 map. As it basically undermines the definition of the Temporary High-Zone which should not be reserved from OS and therefore must not be accessed after SeaBIOS initialisation phase.
pmm (see src/pmm.c) does this for large allocations.
Also, it is kinda hard to put the finger on which code you would change to use malloc_tmphigh() like this all the time. Will you do this change for example in pvscsi.c code?
The logic in pmm could be moved to malloc.c. Probably needs some refinements then, the current code is meant to handle rare and very large (megabytes) allocations. In case we run out of space on small allocations it probably makes sense to grab a block from the tmphigh zone, reserve it in the e820 map and add it to the high zone pool instead. So effectively the high zone can grow dynamically.
If you agree to do the logic you suggested inside malloc_high() internally without caller being bothered with these details, then I agree this is a solution that scales better and I will create such a patch.
Thanks! -Liran
cheers, Gerd