Hi all,
I'm working on a fam10 tree for supermicro h8dmr. I'm using CBFS.
It boots, but I'm struggling with some extreme slowness during boot. In particular, the memset function in src/lib/memset.c takes *minutes* to clear 1.2MB of ram. A little further CBFS does a memcpy which takes another 20 or 30 seconds:
Stage: load fallback/coreboot_ram @ 2097152/1245184 bytes, enter @ 200000
....LOOOOOONG pause....
Stage: after memset on-stack variables at 00ffbec8 and 00ffbed4 cbfs_decompress: algo: 0 cbfs_decompress: uncompressed
....another lengthy pause....
cbfs_decompress: memcpy from 0xffbecc to 0xffbed0 for 0x2d304 bytes done Stage: done loading.
The first, lengthly pause is new; it is apparently caused by something introduced between r4368 and r4440.
The second pause was there already in r4368.
I understand this may have something to do with MTRRs - looking at the logs it seems MTRRs are not set up until well after CBFS has dealt with coreboot_ram.
This box has 32GB of ram, in case that makes a difference.
Any suggestions?
Thanks, Ward.
Following up on this - Patrick helped in IRC this evening, and we came to the conclusion that it's probably *not* an MTRR issue, since we figured out the code seems to set MTRRs properly.
We found out after adding an extra MTRR over the flash chip, which did not change anything.
The system boots fairly normally after the slowdowns, and appears to work normally. It sets three MTRRs further in the bootup process:
reg00: base=0x00000000 ( 0MB), size=32768MB: write-back, count=1 reg01: base=0x800000000 (32768MB), size= 512MB: write-back, count=1 reg02: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
Any thoughts on something else I should look at to debug this?
Thanks, Ward.
On Sun, Jul 19, 2009 at 09:23:21PM -0400, Ward Vandewege wrote:
Hi all,
I'm working on a fam10 tree for supermicro h8dmr. I'm using CBFS.
It boots, but I'm struggling with some extreme slowness during boot. In particular, the memset function in src/lib/memset.c takes *minutes* to clear 1.2MB of ram. A little further CBFS does a memcpy which takes another 20 or 30 seconds:
Stage: load fallback/coreboot_ram @ 2097152/1245184 bytes, enter @ 200000
....LOOOOOONG pause....
Stage: after memset on-stack variables at 00ffbec8 and 00ffbed4 cbfs_decompress: algo: 0 cbfs_decompress: uncompressed
....another lengthy pause....
cbfs_decompress: memcpy from 0xffbecc to 0xffbed0 for 0x2d304 bytes done Stage: done loading.
The first, lengthly pause is new; it is apparently caused by something introduced between r4368 and r4440.
The second pause was there already in r4368.
I understand this may have something to do with MTRRs - looking at the logs it seems MTRRs are not set up until well after CBFS has dealt with coreboot_ram.
This box has 32GB of ram, in case that makes a difference.
Any suggestions?
Thanks, Ward.
Following up on this - Patrick helped in IRC this evening, and we came to the conclusion that it's probably *not* an MTRR issue, since we figured out the code seems to set MTRRs properly.
I wonder what else could cause it to be so slow? It's especially surprising for the memset, which is pretty simple. Does it use movnti for that?
We found out after adding an extra MTRR over the flash chip, which did not change anything.
Did you disable and re-enable the cache so that the settings take effect?
I guess I would: 1. Add some little benchmark loops reading/writing different areas a. read ROM & time it b. read from RAM (cached area) and time it c. read from RAM (non-cached area) d. write to RAM (cached area) ... 2. disable MTRRs to see if it would go even slower.
Sorry that's not much help, but I don't have a fam10 box to try things on.
Thanks, Myles
On Tue, Jul 21, 2009 at 06:25:38AM -0600, Myles Watson wrote:
Following up on this - Patrick helped in IRC this evening, and we came to the conclusion that it's probably *not* an MTRR issue, since we figured out the code seems to set MTRRs properly.
I wonder what else could cause it to be so slow? It's especially surprising for the memset, which is pretty simple. Does it use movnti for that?
It's actually just a plain byte-by-byte assignment in c, see src/lib/memset.c.
We found out after adding an extra MTRR over the flash chip, which did not change anything.
Did you disable and re-enable the cache so that the settings take effect?
Hmm, we tried adding it here
src/cpu/amd/car/clear_init_ram.c
in function set_init_ram_access, which already sets an mtrr.
This gets called just before CAR is disabled I think.
And then we found the mtrr set in
src/cpu/amd/car/cache_as_ram.inc
which looks like it *should* do the right thing. But that's assembler of course. I don't suppose there's a way to print debug info from right there?
I guess I would:
- Add some little benchmark loops reading/writing different areas a. read ROM & time it b. read from RAM (cached area) and time it c. read from RAM (non-cached area) d. write to RAM (cached area) ...
- disable MTRRs to see if it would go even slower.
Sorry that's not much help, but I don't have a fam10 box to try things on.
Thanks - will see if I can try some of these things.
Thanks, Ward.
On Tue, Jul 21, 2009 at 06:25:38AM -0600, Myles Watson wrote:
Following up on this - Patrick helped in IRC this evening, and we came
to
the conclusion that it's probably *not* an MTRR issue, since we figured
out
the code seems to set MTRRs properly.
I wonder what else could cause it to be so slow? It's especially
surprising
for the memset, which is pretty simple. Does it use movnti for that?
It's actually just a plain byte-by-byte assignment in c, see src/lib/memset.c.
It would be interesting to see if you make it 4 bytes at a time if it is 4x faster.
We found out after adding an extra MTRR over the flash chip, which did
not
change anything.
Did you disable and re-enable the cache so that the settings take
effect?
Hmm, we tried adding it here
src/cpu/amd/car/clear_init_ram.c
in function set_init_ram_access, which already sets an mtrr.
I always wondered about that one.
The thing that makes it hard to debug is that it will read back correctly even if it hasn't taken effect.
Thanks - will see if I can try some of these things.
Good luck, Myles
On Tue, Jul 21, 2009 at 08:23:22AM -0600, Myles Watson wrote:
It's actually just a plain byte-by-byte assignment in c, see src/lib/memset.c.
It would be interesting to see if you make it 4 bytes at a time if it is 4x faster.
We found out after adding an extra MTRR over the flash chip, which did
not
change anything.
Did you disable and re-enable the cache so that the settings take
effect?
Hmm, we tried adding it here
src/cpu/amd/car/clear_init_ram.c
in function set_init_ram_access, which already sets an mtrr.
I always wondered about that one.
The thing that makes it hard to debug is that it will read back correctly even if it hasn't taken effect.
Thanks - will see if I can try some of these things.
Good luck,
So - you're not going to believe this. Compiler issue. I was compiling with gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3 on 32 bit.
I noticed that about one in every 10 burn/boot cycles or so, the slowness would not be there.
So I switched back to gcc-3.4 (GCC) 3.4.6 (Ubuntu 3.4.6-8ubuntu2) on 32 bit, and it's gone altogether, every time.
Is anyone else using gcc 4.3 (32 bit) to compile coreboot?
Thanks! Ward.