Hello coreboot community,
One of coreboot's goals (according to the home page) is to be lightning fast! I'm currently working on improving the boot time for the AMD Cezanne platform. One place where we spend a decent amount of time is reading from SPI flash. We have the SPI speed/modes set to the optimal settings for our platforms, but there is still room for improvement.
One feature the AMD platforms provide is a SPI DMA controller. It allows performing upto 64 KiB reads from SPI to RAM without the CPU being involved. This opens up the door for parallelizing SPI reads and coreboot execution. This effectively makes SPI read times 0 as far as the BSP is concerned (assuming you can preload early enough).
Another difference between the latest AMD SoCs (Picasso, Cezanne), is that RAM is available in bootblock. This means that we can preload compressed stages into RAM before they are required. This will require a larger RAM footprint.
The question is, how do we model these asynchronous operations, how is data ownership handled, and how does the BSP know the operation is done?
I have proposed a Futures API [here](https://review.coreboot.org/c/coreboot/+/56047) that I think addresses all the questions. I think it is simple to understand, hard for consumers to misuse, and pretty straightforward to implement.
The patch train also contains adding a `readat_async` to `rdev`, adding `cbfs_load_async`, APOB (i.e., MRC) preloading, and compressed payload preloading. This so far has saved over 30ms. I still have the goal of preloading the VBIOS, uCode, and ramstage. I want to do all of this while adding minimal complexity to the synchronous path.
I'm curious to see what the community thinks, and welcome any feedback.
Thanks, Raul
Raul Rangel wrote:
I'm currently working on improving the boot time for the AMD Cezanne platform.
..
Another difference between the latest AMD SoCs (Picasso, Cezanne), is that RAM is available in bootblock.
As I have understood, the PSP has both trained RAM and copied firmware from SPI to RAM when x86 comes out of reset.
Is that accurate, false, or only partially accurate?
If not fully accurate then how are you accessing the SPI chip? I guess it can't be memory mapped?
One place where we spend a decent amount of time is reading from SPI flash. We have the SPI speed/modes set to the optimal settings for our platforms, but there is still room for improvement.
Please provide numbers?
The question is, how do we model these asynchronous operations, how is data ownership handled, and how does the BSP know the operation is done?
I haven't yet reveiewed your API proposal, but find it an absolutely horrible idea to create a *general* API for asynchronous operations in coreboot, because - as you recognize - it can easily be misused to great detriment of the codebase, which already suffers chronically from such trivial problems as copy-paste:itis. Don't do it.
There is zero incentive for developers to improve the source beyond making it work for their deadline; more complexity *will* create more problems. (I understand that you have good intentions proposing this change!)
I'm curious to see what the community thinks, and welcome any feedback.
A special purpose DMA API is another matter to me, because it's very well defined. It could still be useful beyond x86 and I think a blocking "wait-for-DMA-to-finish" API is easy to understand, easy to implement and to use, and sufficient since runtime flow is serial, especially when measuring.
//Peter
On Wed, Jul 7, 2021 at 12:49 PM Peter Stuge peter@stuge.se wrote:
Raul Rangel wrote:
I'm currently working on improving the boot time for the AMD Cezanne platform.
..
Another difference between the latest AMD SoCs (Picasso, Cezanne), is that RAM is available in bootblock.
As I have understood, the PSP has both trained RAM and copied firmware from SPI to RAM when x86 comes out of reset.
Is that accurate, false, or only partially accurate?
It's partially accurate. The PSP will load bootblock into RAM. So coreboot still needs to access the SPI flash to load everything else.
One place where we spend a decent amount of time is reading from SPI flash. We have the SPI speed/modes set to the optimal settings for our platforms, but there is still room for improvement.
Please provide numbers?
Sorry, that sentence didn't come out how I wanted. I was just saying that we could improve the boot time by exploring other avenues.
The question is, how do we model these asynchronous operations, how is data ownership handled, and how does the BSP know the operation is done?
I haven't yet reveiewed your API proposal, but find it an absolutely horrible idea to create a *general* API for asynchronous operations in coreboot, because - as you recognize - it can easily be misused to great detriment of the codebase, which already suffers chronically from such trivial problems as copy-paste:itis. Don't do it.
There is zero incentive for developers to improve the source beyond making it work for their deadline; more complexity *will* create more problems. (I understand that you have good intentions proposing this change!)
There has been a lot of work in refactoring the AMD codebases and fixing things throughout the stack. We have reduced a good amount of copy/paste when bringing up a new AMD SoC. So I don't agree that we are all writing abandonware. Please have a look at the proposal before making such hasty judgments.
I'm curious to see what the community thinks, and welcome any feedback.
A special purpose DMA API is another matter to me, because it's very well defined. It could still be useful beyond x86 and I think a blocking "wait-for-DMA-to-finish" API is easy to understand, easy to implement and to use, and sufficient since runtime flow is serial, especially when measuring.
That was my original thought, but there are plenty of other things that can be modeled as asynchronous operations. Stuffing it into the rdev API felt wrong.
Thanks, Raul
p.s.,
Just to add some more data. I performed an experiment where I switched the payload compression from LZMA to LZ4. Since we no longer pay the SPI read cost (thanks async!), we essentially load the payload for free:
## Payload LZMZ compression (current) ``` Name Offset Type Size fallback/payload 0x224dc0 simple elf 131417 (128 KiB) ``` ``` 90:starting to load payload 1,123,962 (13) 15:starting LZMA decompress 1,123,967 (5) 16:finished LZMA decompress 1,131,201 (7,234) ```
## Payload LZ4 compression ``` Name Offset Type Size fallback/payload 0x224dc0 simple elf 173233 (169 KiB) ```
``` 90:starting to load payload 1,130,877 (12) 17:starting LZ4 decompress 1,130,882 (5) 18:finished LZ4 decompress 1,131,091 (209) ```
``` Payload increase: KiB: 169 - 128 = 41 KiB %: (169 - 128) / 128 = 0.3203125
Decompression decrease: us: 209 - 7234 = -7025 %: (209 - 7234) / 7234 = -0.9711086535803152 ```
Pulling out some of the discussions that we already had inside Google to this wider audience, I am strongly against parallelization mechanisms that pull this much extra complexity into existing code (particularly the already very complicated CBFS framework). I think https://review.coreboot.org/c/coreboot/+/56049 is a perfect example of the maintainability drawbacks of this approach, requiring to weave large new chunks of code throughout the whole stack downwards, from CBFS into the rdev layer and finally into the platform SPI driver, just to be able to asynchronously execute a single read operation. The CBFS patch isn't even complete, there will be more code needed to support other primitives (e.g. cbfs_map()), decompression, etc. And it's duplicating a lot of the existing flow from the synchronous code in _cbfs_alloc().
I think if we want a parallelization mechanism, we should converge onto a single one that can solve a variety of current and future use cases, and can be applied optionally where individual platforms need it without having to grow tentacles into every other part of coreboot. And I think the much better model for that would be cooperative multithreading (e.g. setting up a second stack and execution context and then having the CPU ping-pong back and forth between them while the other one is idle and waiting on hardware somewhere). In fact coreboot already has all this implemented in src/lib/threads.c (CONFIG_COOP_MULTITASKING), although it seems that nobody has really used it in a while.
The clear advantage of cooperative multithreading is that almost none of the code, especially not most of these in-between layers like CBFS and rdev, need to be modified to support it. You can create a thread and send it off to, say, begin loading the payload while the rest of the boot state machine is still running platform initialization code, and it can run through all the CBFS motions (including verification and whatever other optional features are enabled) without any of those having been written specifically with asynchronous operation in mind. You still need to make sure that the two threads don't both try to access shared resources at the same time, but you need to do that with futures too. In the file loading case, it would probably be enough to put a few critical sections in the SPI driver itself and there should be no need to pull it up into all the platform-agnostic general framework parts.
The cost of cooperative multithreading is mostly just that you need the space for a second stack. In ramstage that should be a given anyway, and in romstage it can probably still be easily done on most (particularly recent) platforms. For an optional performance enhancement feature, I think that trade-off makes sense.
On Wed, Jul 7, 2021 at 1:18 PM Raul Rangel rrangel@chromium.org wrote:
On Wed, Jul 7, 2021 at 12:49 PM Peter Stuge peter@stuge.se wrote:
Raul Rangel wrote:
I'm currently working on improving the boot time for the AMD Cezanne platform.
..
Another difference between the latest AMD SoCs (Picasso, Cezanne), is that RAM is available in bootblock.
As I have understood, the PSP has both trained RAM and copied firmware from SPI to RAM when x86 comes out of reset.
Is that accurate, false, or only partially accurate?
It's partially accurate. The PSP will load bootblock into RAM. So coreboot still needs to access the SPI flash to load everything else.
One place where we spend a decent amount of time is reading from SPI flash. We have the SPI speed/modes set to the optimal settings for our platforms, but there is still room for improvement.
Please provide numbers?
Sorry, that sentence didn't come out how I wanted. I was just saying that we could improve the boot time by exploring other avenues.
The question is, how do we model these asynchronous operations, how is data ownership handled, and how does the BSP know the operation is done?
I haven't yet reveiewed your API proposal, but find it an absolutely horrible idea to create a *general* API for asynchronous operations in coreboot, because - as you recognize - it can easily be misused to great detriment of the codebase, which already suffers chronically from such trivial problems as copy-paste:itis. Don't do it.
There is zero incentive for developers to improve the source beyond making it work for their deadline; more complexity *will* create more problems. (I understand that you have good intentions proposing this change!)
There has been a lot of work in refactoring the AMD codebases and fixing things throughout the stack. We have reduced a good amount of copy/paste when bringing up a new AMD SoC. So I don't agree that we are all writing abandonware. Please have a look at the proposal before making such hasty judgments.
I'm curious to see what the community thinks, and welcome any feedback.
A special purpose DMA API is another matter to me, because it's very well defined. It could still be useful beyond x86 and I think a blocking "wait-for-DMA-to-finish" API is easy to understand, easy to implement and to use, and sufficient since runtime flow is serial, especially when measuring.
That was my original thought, but there are plenty of other things that can be modeled as asynchronous operations. Stuffing it into the rdev API felt wrong.
Thanks, Raul
p.s.,
Just to add some more data. I performed an experiment where I switched the payload compression from LZMA to LZ4. Since we no longer pay the SPI read cost (thanks async!), we essentially load the payload for free:
## Payload LZMZ compression (current)
Name Offset Type Size fallback/payload 0x224dc0 simple elf 131417 (128 KiB)
90:starting to load payload 1,123,962 (13) 15:starting LZMA decompress 1,123,967 (5) 16:finished LZMA decompress 1,131,201 (7,234)
## Payload LZ4 compression
Name Offset Type Size fallback/payload 0x224dc0 simple elf 173233 (169 KiB)
90:starting to load payload 1,130,877 (12) 17:starting LZ4 decompress 1,130,882 (5) 18:finished LZ4 decompress 1,131,091 (209)
Payload increase: KiB: 169 - 128 = 41 KiB %: (169 - 128) / 128 = 0.3203125 Decompression decrease: us: 209 - 7234 = -7025 %: (209 - 7234) / 7234 = -0.9711086535803152
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
So I've spun a patch train that uses threads: https://review.coreboot.org/c/coreboot/+/56225/1
Let me know what you think.
Thanks!
On Wed, Jul 7, 2021 at 7:24 PM Julius Werner jwerner@chromium.org wrote:
Pulling out some of the discussions that we already had inside Google to this wider audience, I am strongly against parallelization mechanisms that pull this much extra complexity into existing code (particularly the already very complicated CBFS framework). I think https://review.coreboot.org/c/coreboot/+/56049 is a perfect example of the maintainability drawbacks of this approach, requiring to weave large new chunks of code throughout the whole stack downwards, from CBFS into the rdev layer and finally into the platform SPI driver, just to be able to asynchronously execute a single read operation. The CBFS patch isn't even complete, there will be more code needed to support other primitives (e.g. cbfs_map()), decompression, etc. And it's duplicating a lot of the existing flow from the synchronous code in _cbfs_alloc().
I think if we want a parallelization mechanism, we should converge onto a single one that can solve a variety of current and future use cases, and can be applied optionally where individual platforms need it without having to grow tentacles into every other part of coreboot. And I think the much better model for that would be cooperative multithreading (e.g. setting up a second stack and execution context and then having the CPU ping-pong back and forth between them while the other one is idle and waiting on hardware somewhere). In fact coreboot already has all this implemented in src/lib/threads.c (CONFIG_COOP_MULTITASKING), although it seems that nobody has really used it in a while.
The clear advantage of cooperative multithreading is that almost none of the code, especially not most of these in-between layers like CBFS and rdev, need to be modified to support it. You can create a thread and send it off to, say, begin loading the payload while the rest of the boot state machine is still running platform initialization code, and it can run through all the CBFS motions (including verification and whatever other optional features are enabled) without any of those having been written specifically with asynchronous operation in mind. You still need to make sure that the two threads don't both try to access shared resources at the same time, but you need to do that with futures too. In the file loading case, it would probably be enough to put a few critical sections in the SPI driver itself and there should be no need to pull it up into all the platform-agnostic general framework parts.
The cost of cooperative multithreading is mostly just that you need the space for a second stack. In ramstage that should be a given anyway, and in romstage it can probably still be easily done on most (particularly recent) platforms. For an optional performance enhancement feature, I think that trade-off makes sense.
On Wed, Jul 7, 2021 at 1:18 PM Raul Rangel rrangel@chromium.org wrote:
On Wed, Jul 7, 2021 at 12:49 PM Peter Stuge peter@stuge.se wrote:
Raul Rangel wrote:
I'm currently working on improving the boot time for the AMD Cezanne platform.
..
Another difference between the latest AMD SoCs (Picasso, Cezanne), is that RAM is available in bootblock.
As I have understood, the PSP has both trained RAM and copied firmware from SPI to RAM when x86 comes out of reset.
Is that accurate, false, or only partially accurate?
It's partially accurate. The PSP will load bootblock into RAM. So coreboot still needs to access the SPI flash to load everything else.
One place where we spend a decent amount of time is reading from SPI flash. We have the SPI speed/modes set to the optimal settings for our platforms, but there is still room for improvement.
Please provide numbers?
Sorry, that sentence didn't come out how I wanted. I was just saying that we could improve the boot time by exploring other avenues.
The question is, how do we model these asynchronous operations, how is data ownership handled, and how does the BSP know the operation is done?
I haven't yet reveiewed your API proposal, but find it an absolutely horrible idea to create a *general* API for asynchronous operations in coreboot, because - as you recognize - it can easily be misused to great detriment of the codebase, which already suffers chronically from such trivial problems as copy-paste:itis. Don't do it.
There is zero incentive for developers to improve the source beyond making it work for their deadline; more complexity *will* create more problems. (I understand that you have good intentions proposing this change!)
There has been a lot of work in refactoring the AMD codebases and fixing things throughout the stack. We have reduced a good amount of copy/paste when bringing up a new AMD SoC. So I don't agree that we are all writing abandonware. Please have a look at the proposal before making such hasty judgments.
I'm curious to see what the community thinks, and welcome any feedback.
A special purpose DMA API is another matter to me, because it's very well defined. It could still be useful beyond x86 and I think a blocking "wait-for-DMA-to-finish" API is easy to understand, easy to implement and to use, and sufficient since runtime flow is serial, especially when measuring.
That was my original thought, but there are plenty of other things that can be modeled as asynchronous operations. Stuffing it into the rdev API felt wrong.
Thanks, Raul
p.s.,
Just to add some more data. I performed an experiment where I switched the payload compression from LZMA to LZ4. Since we no longer pay the SPI read cost (thanks async!), we essentially load the payload for free:
## Payload LZMZ compression (current)
Name Offset Type Size fallback/payload 0x224dc0 simple elf 131417 (128 KiB)
90:starting to load payload 1,123,962 (13) 15:starting LZMA decompress 1,123,967 (5) 16:finished LZMA decompress 1,131,201 (7,234)
## Payload LZ4 compression
Name Offset Type Size fallback/payload 0x224dc0 simple elf 173233 (169 KiB)
90:starting to load payload 1,130,877 (12) 17:starting LZ4 decompress 1,130,882 (5) 18:finished LZ4 decompress 1,131,091 (209)
Payload increase: KiB: 169 - 128 = 41 KiB %: (169 - 128) / 128 = 0.3203125 Decompression decrease: us: 209 - 7234 = -7025 %: (209 - 7234) / 7234 = -0.9711086535803152
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
Thanks, I think it looks pretty great! That solves exactly the concerns I had with the other implementation.
On Mon, Jul 12, 2021 at 3:06 PM Raul Rangel rrangel@chromium.org wrote:
So I've spun a patch train that uses threads: https://review.coreboot.org/c/coreboot/+/56225/1
Let me know what you think.
Thanks!
On Wed, Jul 7, 2021 at 7:24 PM Julius Werner jwerner@chromium.org wrote:
Pulling out some of the discussions that we already had inside Google to this wider audience, I am strongly against parallelization mechanisms that pull this much extra complexity into existing code (particularly the already very complicated CBFS framework). I think https://review.coreboot.org/c/coreboot/+/56049 is a perfect example of the maintainability drawbacks of this approach, requiring to weave large new chunks of code throughout the whole stack downwards, from CBFS into the rdev layer and finally into the platform SPI driver, just to be able to asynchronously execute a single read operation. The CBFS patch isn't even complete, there will be more code needed to support other primitives (e.g. cbfs_map()), decompression, etc. And it's duplicating a lot of the existing flow from the synchronous code in _cbfs_alloc().
I think if we want a parallelization mechanism, we should converge onto a single one that can solve a variety of current and future use cases, and can be applied optionally where individual platforms need it without having to grow tentacles into every other part of coreboot. And I think the much better model for that would be cooperative multithreading (e.g. setting up a second stack and execution context and then having the CPU ping-pong back and forth between them while the other one is idle and waiting on hardware somewhere). In fact coreboot already has all this implemented in src/lib/threads.c (CONFIG_COOP_MULTITASKING), although it seems that nobody has really used it in a while.
The clear advantage of cooperative multithreading is that almost none of the code, especially not most of these in-between layers like CBFS and rdev, need to be modified to support it. You can create a thread and send it off to, say, begin loading the payload while the rest of the boot state machine is still running platform initialization code, and it can run through all the CBFS motions (including verification and whatever other optional features are enabled) without any of those having been written specifically with asynchronous operation in mind. You still need to make sure that the two threads don't both try to access shared resources at the same time, but you need to do that with futures too. In the file loading case, it would probably be enough to put a few critical sections in the SPI driver itself and there should be no need to pull it up into all the platform-agnostic general framework parts.
The cost of cooperative multithreading is mostly just that you need the space for a second stack. In ramstage that should be a given anyway, and in romstage it can probably still be easily done on most (particularly recent) platforms. For an optional performance enhancement feature, I think that trade-off makes sense.
On Wed, Jul 7, 2021 at 1:18 PM Raul Rangel rrangel@chromium.org wrote:
On Wed, Jul 7, 2021 at 12:49 PM Peter Stuge peter@stuge.se wrote:
Raul Rangel wrote:
I'm currently working on improving the boot time for the AMD Cezanne platform.
..
Another difference between the latest AMD SoCs (Picasso, Cezanne), is that RAM is available in bootblock.
As I have understood, the PSP has both trained RAM and copied firmware from SPI to RAM when x86 comes out of reset.
Is that accurate, false, or only partially accurate?
It's partially accurate. The PSP will load bootblock into RAM. So coreboot still needs to access the SPI flash to load everything else.
One place where we spend a decent amount of time is reading from SPI flash. We have the SPI speed/modes set to the optimal settings for our platforms, but there is still room for improvement.
Please provide numbers?
Sorry, that sentence didn't come out how I wanted. I was just saying that we could improve the boot time by exploring other avenues.
The question is, how do we model these asynchronous operations, how is data ownership handled, and how does the BSP know the operation is done?
I haven't yet reveiewed your API proposal, but find it an absolutely horrible idea to create a *general* API for asynchronous operations in coreboot, because - as you recognize - it can easily be misused to great detriment of the codebase, which already suffers chronically from such trivial problems as copy-paste:itis. Don't do it.
There is zero incentive for developers to improve the source beyond making it work for their deadline; more complexity *will* create more problems. (I understand that you have good intentions proposing this change!)
There has been a lot of work in refactoring the AMD codebases and fixing things throughout the stack. We have reduced a good amount of copy/paste when bringing up a new AMD SoC. So I don't agree that we are all writing abandonware. Please have a look at the proposal before making such hasty judgments.
I'm curious to see what the community thinks, and welcome any feedback.
A special purpose DMA API is another matter to me, because it's very well defined. It could still be useful beyond x86 and I think a blocking "wait-for-DMA-to-finish" API is easy to understand, easy to implement and to use, and sufficient since runtime flow is serial, especially when measuring.
That was my original thought, but there are plenty of other things that can be modeled as asynchronous operations. Stuffing it into the rdev API felt wrong.
Thanks, Raul
p.s.,
Just to add some more data. I performed an experiment where I switched the payload compression from LZMA to LZ4. Since we no longer pay the SPI read cost (thanks async!), we essentially load the payload for free:
## Payload LZMZ compression (current)
Name Offset Type Size fallback/payload 0x224dc0 simple elf 131417 (128 KiB)
90:starting to load payload 1,123,962 (13) 15:starting LZMA decompress 1,123,967 (5) 16:finished LZMA decompress 1,131,201 (7,234)
## Payload LZ4 compression
Name Offset Type Size fallback/payload 0x224dc0 simple elf 173233 (169 KiB)
90:starting to load payload 1,130,877 (12) 17:starting LZ4 decompress 1,130,882 (5) 18:finished LZ4 decompress 1,131,091 (209)
Payload increase: KiB: 169 - 128 = 41 KiB %: (169 - 128) / 128 = 0.3203125 Decompression decrease: us: 209 - 7234 = -7025 %: (209 - 7234) / 7234 = -0.9711086535803152
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org