Dear coreboot folks,
With 128 GB of RAM consisting of eight 16 GB modules, coreboot takes over a minute to get to the payload on the Asus KGPE-D16 even without serial console enabled [1]. This is not much faster than the vendor firmware.
Please note that the timings below are incorrect. Comparing it with the timings from SeaBIOS’ `script/read_serial.py`, I’d say the time values need to be multiplied by two. (Somebody mentioned, that the reason might be `cbmem -t` using some scaling factor to convert the time stamps to seconds, and that might be wrong on the board.)
``` $ more asus/kgpe-d16/4.5-1093-g308aeff/2017-03-01T16_03_07Z/coreboot_timestamps.txt 21 entries total:
0:1st timestamp 24,384 1:start of rom stage 25,061 (676) 2:before ram initialization 913,502 (888,441) 3:after ram initialization 35,548,889 (34,635,386) 4:end of romstage 35,642,960 (94,070) 8:starting to load ramstage 35,647,351 (4,391) 15:starting LZMA decompress (ignore for x86) 35,647,872 (520) 16:finished LZMA decompress (ignore for x86) 35,695,864 (47,991) 9:finished loading ramstage 35,696,312 (447) 10:start of ramstage 35,696,893 (581) 30:device enumeration 35,696,897 (3) 40:device configuration 36,639,627 (942,730) 50:device enable 36,644,848 (5,221) 60:device initialization 36,646,012 (1,163) 70:device setup done 37,044,848 (398,836) 75:cbmem post 37,044,850 (1) 80:write tables 37,044,851 (1) 85:finalize chips 37,053,950 (9,099) 90:load payload 37,324,647 (270,697) 15:starting LZMA decompress (ignore for x86) 37,325,042 (395) 16:finished LZMA decompress (ignore for x86) 37,349,321 (24,278) 99:selfboot jump 37,349,328 (7) ```
I think most of the time is spent in RAM initialization.
1. Do board owners with similar amount of memory (independent of the board) have similar numbers? 2. What are the ways to improve that? Is it possible? For example, can the modules be probed in parallel (if that isn’t done already)?
Thanks,
Paul
[1] https://review.coreboot.org/cgit/board-status.git/commit/?id=4b4b7ab5865b15a...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 03/02/2017 01:18 PM, Paul Menzel via coreboot wrote:
Dear coreboot folks,
With 128 GB of RAM consisting of eight 16 GB modules, coreboot takes over a minute to get to the payload on the Asus KGPE-D16 even without serial console enabled [1]. This is not much faster than the vendor firmware.
Please note that the timings below are incorrect. Comparing it with the timings from SeaBIOS’ `script/read_serial.py`, I’d say the time values need to be multiplied by two. (Somebody mentioned, that the reason might be `cbmem -t` using some scaling factor to convert the time stamps to seconds, and that might be wrong on the board.)
$ more asus/kgpe-d16/4.5-1093-g308aeff/2017-03-01T16_03_07Z/coreboot_timestamps.txt 21 entries total: 0:1st timestamp 24,384 1:start of rom stage 25,061 (676) 2:before ram initialization 913,502 (888,441) 3:after ram initialization 35,548,889 (34,635,386) 4:end of romstage 35,642,960 (94,070) 8:starting to load ramstage 35,647,351 (4,391) 15:starting LZMA decompress (ignore for x86) 35,647,872 (520) 16:finished LZMA decompress (ignore for x86) 35,695,864 (47,991) 9:finished loading ramstage 35,696,312 (447) 10:start of ramstage 35,696,893 (581) 30:device enumeration 35,696,897 (3) 40:device configuration 36,639,627 (942,730) 50:device enable 36,644,848 (5,221) 60:device initialization 36,646,012 (1,163) 70:device setup done 37,044,848 (398,836) 75:cbmem post 37,044,850 (1) 80:write tables 37,044,851 (1) 85:finalize chips 37,053,950 (9,099) 90:load payload 37,324,647 (270,697) 15:starting LZMA decompress (ignore for x86) 37,325,042 (395) 16:finished LZMA decompress (ignore for x86) 37,349,321 (24,278) 99:selfboot jump 37,349,328 (7)
I think most of the time is spent in RAM initialization.
- Do board owners with similar amount of memory (independent of the board) have similar numbers?
- What are the ways to improve that? Is it possible? For example, can the modules be probed in parallel (if that isn’t done already)?
Thanks,
Paul
[1] https://review.coreboot.org/cgit/board-status.git/commit/?id=4b4b7ab5865b15a...
The issue isn't probing; the delays are introduced in both DRAM training (DDR3 training is quite complex and involves repeatedly streaming pseudorandom data to/from the modules at full speed) and in the mandatory clearing of the ECC check bits.
The only way to feasibly decrease boot time would be to run the DRAM training on each CPU package (and possibly memory controller, though I don't think that's a good idea) in parallel. This, in turn, couples with previous discussions on whether coreboot, and in particular coreboot's romstage, should even be attempting to provide a multi-tasking environment; i.e. does the added complexity provide a significant enough benefit to justify the maintenance overhead?
- -- Timothy Pearson Raptor Engineering +1 (415) 727-8645 (direct line) +1 (512) 690-0200 (switchboard) https://www.raptorengineering.com
Paul Menzel via coreboot coreboot@coreboot.org writes:
I think most of the time is spent in RAM initialization.
- Do board owners with similar amount of memory (independent of the board) have similar numbers?
- What are the ways to improve that? Is it possible? For example, can the modules be probed in parallel (if that isn’t done already)?
I'm not the right person to answer this since I don't know this code/hardware that well, but on modern Intel hardware native code uses the MRC cache to store dram training results and restore those on next boots (and resume from suspend) if no change in dimm configuration was detected.
Maybe something like this could also be applied here (or maybe it's already the case since it includes code to access spi flash)?
Thanks,
Paul
Kind regards
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 03/02/2017 01:30 PM, Arthur Heymans wrote:
Paul Menzel via coreboot coreboot@coreboot.org writes:
I think most of the time is spent in RAM initialization.
- Do board owners with similar amount of memory (independent of the board) have similar numbers?
- What are the ways to improve that? Is it possible? For example, can the modules be probed in parallel (if that isn’t done already)?
I'm not the right person to answer this since I don't know this code/hardware that well, but on modern Intel hardware native code uses the MRC cache to store dram training results and restore those on next boots (and resume from suspend) if no change in dimm configuration was detected.
Maybe something like this could also be applied here (or maybe it's already the case since it includes code to access spi flash)?
Yes, this is already implemented as an option, and it does a fairly decent job of reducing training overhead to almost nothing, but the ECC clear overhead remains. Ideally both training and ECC clear would be parallelized, but as before romstage is a very limited environment and I'm not sure the cost / benefit ratio is there to implement this feature right now. I'd feel somewhat more confident if there was more support for parallel tasking in coreboot in general, instead of having to create a northbridge-specific system like the old K8 raminit.
- -- Timothy Pearson Raptor Engineering +1 (415) 727-8645 (direct line) +1 (512) 690-0200 (switchboard) https://www.raptorengineering.com
Dear Arthur, dear Timothy,
Am Donnerstag, den 02.03.2017, 13:38 -0600 schrieb Timothy Pearson:
On 03/02/2017 01:30 PM, Arthur Heymans wrote:
Paul Menzel writes:
I think most of the time is spent in RAM initialization.
- Do board owners with similar amount of memory (independent of the board) have similar numbers?
- What are the ways to improve that? Is it possible? For example, can the modules be probed in parallel (if that isn’t done already)?
I'm not the right person to answer this since I don't know this code/hardware that well, but on modern Intel hardware native code uses the MRC cache to store dram training results and restore those on next boots (and resume from suspend) if no change in dimm configuration was detected.
Maybe something like this could also be applied here (or maybe it's already the case since it includes code to access spi flash)?
Yes, this is already implemented as an option, and it does a fairly decent job of reducing training overhead to almost nothing,
Interesting. What option is that?
Also, besides the file `s3nv` I don’t see anything else in CBFS. Where is the training data cached?
[…]
Thanks,
Paul
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 03/02/2017 02:17 PM, Paul Menzel wrote:
Dear Arthur, dear Timothy,
Am Donnerstag, den 02.03.2017, 13:38 -0600 schrieb Timothy Pearson:
On 03/02/2017 01:30 PM, Arthur Heymans wrote:
Paul Menzel writes:
I think most of the time is spent in RAM initialization.
- Do board owners with similar amount of memory (independent of the board) have similar numbers?
- What are the ways to improve that? Is it possible? For example, can the modules be probed in parallel (if that isn’t done already)?
I'm not the right person to answer this since I don't know this code/hardware that well, but on modern Intel hardware native code uses the MRC cache to store dram training results and restore those on next boots (and resume from suspend) if no change in dimm configuration was detected.
Maybe something like this could also be applied here (or maybe it's already the case since it includes code to access spi flash)?
Yes, this is already implemented as an option, and it does a fairly decent job of reducing training overhead to almost nothing,
Interesting. What option is that?
Also, besides the file `s3nv` I don’t see anything else in CBFS. Where is the training data cached?
That's it. The cache is mandatory for S3 resume, and optional at boot.
That being said, the pathways are present but are deactivated due to historical instability and are not tied in to an nvram variable at this time. If you want to test, you'll need to edit the source file "src/northbridge/amd/amdmct/mct_ddr3/mct_d.c"; near line 2730 you'll see a FIXME comment and this line:
allow_config_restore = 0;
Comment that line out and recompile to test. I strongly suggest running memtest across multiple warm and cold boots (and reboots) before determining the functionality is stable enough for use.
- -- Timothy Pearson Raptor Engineering +1 (415) 727-8645 (direct line) +1 (512) 690-0200 (switchboard) https://www.raptorengineering.com