Eric,
I tried to call cache_on() first than the cpufixeup() in cpu.c to fix the "slow ecc clear" problem. It still works that way. Is there any reason we can't do this ?
Ollie
Li-Ta Lo ollie@lanl.gov writes:
Eric,
I tried to call cache_on() first than the cpufixeup() in cpu.c to fix the "slow ecc clear" problem. It still works that way. Is there any reason we can't do this ?
Hmm. I don't see this. I have the cache on and things clear quite quickly. I may have a slightly different calling order then the standard tree.
Ollie can you compare what is checked into the tree with the last release I did for Lightning? It works there...
Eric
On Fri, 2004-03-26 at 01:06, Eric W. Biederman wrote:
Li-Ta Lo ollie@lanl.gov writes:
Eric,
I tried to call cache_on() first than the cpufixeup() in cpu.c to fix the "slow ecc clear" problem. It still works that way. Is there any reason we can't do this ?
Hmm. I don't see this. I have the cache on and things clear quite quickly. I may have a slightly different calling order then the standard tree.
Ollie can you compare what is checked into the tree with the last release I did for Lightning? It works there...
In LNXI tree it's like:
/* Turn on caching if we haven't already */ cache_on(mem);
display_cpuid(); mtrr_check(); #if 1 /* some cpus need a fixup done. This is the hook for doing that. */ cpufixup(mem); #endif
and in the CVS tree it's:
/* some cpus need a fixup done. This is the hook for doing that. */ cpufixup(mem);
/* Turn on caching if we haven't already */ cache_on(mem);
display_cpuid(); mtrr_check();
Ollie
On Fri, 2004-03-26 at 01:06, Eric W. Biederman wrote:
Li-Ta Lo ollie@lanl.gov writes:
Eric,
I tried to call cache_on() first than the cpufixeup() in cpu.c to fix the "slow ecc clear" problem. It still works that way. Is there any reason we can't do this ?
Hmm. I don't see this. I have the cache on and things clear quite quickly. I may have a slightly different calling order then the standard tree.
Ollie can you compare what is checked into the tree with the last release I did for Lightning? It works there...
BTW, why are you using inline asm to clearing the memory ? Isn't it just one line in C to clear the memory ? Why not use string instruction instead of mov ?
Ollie
Eric
Li-Ta Lo ollie@lanl.gov writes:
On Fri, 2004-03-26 at 01:06, Eric W. Biederman wrote:
Li-Ta Lo ollie@lanl.gov writes:
Eric,
I tried to call cache_on() first than the cpufixeup() in cpu.c to fix the "slow ecc clear" problem. It still works that way. Is there any reason we can't do this ?
Hmm. I don't see this. I have the cache on and things clear quite quickly. I may have a slightly different calling order then the standard tree.
Ollie can you compare what is checked into the tree with the last release I did for Lightning? It works there...
BTW, why are you using inline asm to clearing the memory ?
Mostly it was just a copy from the romcc code. And I have a little distrust of compilers when I want a tight loop.
Isn't it just one line in C to clear the memory ? Why not use string instruction instead of mov ?
A string mov instruction is slower.
Eric
On Fri, 2004-03-26 at 17:46, Eric W. Biederman wrote:
Li-Ta Lo ollie@lanl.gov writes:
On Fri, 2004-03-26 at 01:06, Eric W. Biederman wrote:
Li-Ta Lo ollie@lanl.gov writes:
Eric,
I tried to call cache_on() first than the cpufixeup() in cpu.c to fix the "slow ecc clear" problem. It still works that way. Is there any reason we can't do this ?
Hmm. I don't see this. I have the cache on and things clear quite quickly. I may have a slightly different calling order then the standard tree.
Ollie can you compare what is checked into the tree with the last release I did for Lightning? It works there...
BTW, why are you using inline asm to clearing the memory ?
Mostly it was just a copy from the romcc code. And I have a little distrust of compilers when I want a tight loop.
Isn't it just one line in C to clear the memory ? Why not use string instruction instead of mov ?
A string mov instruction is slower.
I just tried with a stosl version. It is not slower at all. We are using stosl in the asm code to clear stack and bss, why not use it here too.
Ollie
Li-Ta Lo ollie@lanl.gov writes:
On Fri, 2004-03-26 at 17:46, Eric W. Biederman wrote:
Li-Ta Lo ollie@lanl.gov writes:
Isn't it just one line in C to clear the memory ? Why not use string instruction instead of mov ?
A string mov instruction is slower.
I just tried with a stosl version. It is not slower at all.
Usually stosl is coded in microcode... I much prefer to work with real instructions. Since memory bandwidth is the bottleneck it should not much matter. stosl actually runs faster with the cache disabled (because it needs fewer instruction fetches) which is another good reason not to use it.
stosl is also not conductive to breaking the loop up or doing other odd things if something goes wrong. A mov based loop is more straight forward to modify if it comes to that.
A slightly better question is why am I not using a loop using the sse registers, which would probably be the ideal case. I don't have a good answer to that but I seem to recall some documented errata. Plus we are memory bandwidth limited not instruction dispatch limited so the exact loop does not matter a lot.
Given the memory bandwidth limitation it probably means we would not see a performance penalty if coded the loop in C. However this piece of code is the dominant speed factor when booting so I don't want to take chances.
We are using stosl in the asm code to clear stack and bss, why not use it here too.
20/30K versus gigabytes. stack and bss clearing simply do not matter speed wise.
Eric
On Fri, 26 Mar 2004, Li-Ta Lo wrote:
I just tried with a stosl version. It is not slower at all. We are using stosl in the asm code to clear stack and bss, why not use it here too.
Pure performance is not the only goal. Readability is. This code is in a C function, it is not essential that it be assembly, it should become C.
ron
On 26 Mar 2004, Eric W. Biederman wrote:
Mostly it was just a copy from the romcc code. And I have a little distrust of compilers when I want a tight loop.
for clarity and keep our other 3rd parties happy that's going to become C, hope nobody minds.
ron
ron minnich rminnich@lanl.gov writes:
On 26 Mar 2004, Eric W. Biederman wrote:
Mostly it was just a copy from the romcc code. And I have a little distrust of compilers when I want a tight loop.
for clarity and keep our other 3rd parties happy that's going to become C, hope nobody minds.
Fine, I think.
But please break it out into it's own separate inline function.
It has two very strong requirements. 1) That we never trigger a hardware read on the addresses were are clearing before we have triggered a hardware write. The fact I setup the area as uncached but write-combining ensures this. 2) That this code runs very fast. It needs to be able to run at 6.4GB/s when you have dual channel PC3200 DDR installed. This is one of the reasons we run it on a per cpu basis. The loop can only run at 4.0GB/s from the other cpu.
When we don't need the fine grained control of assembly to meet these requirements I don't have a problem with writing the code in C. When we do need the control we must be able to write it.
In part booting faster boots more correctly because the hardware transitions from some half setup state to completely setup more quickly.
Eric
On 28 Mar 2004, Eric W. Biederman wrote:
But please break it out into it's own separate inline function.
It has two very strong requirements.
- That we never trigger a hardware read on the addresses were are clearing before we have triggered a hardware write. The fact I setup the area as uncached but write-combining ensures this.
- That this code runs very fast. It needs to be able to run at 6.4GB/s when you have dual channel PC3200 DDR installed. This is one of the reasons we run it on a per cpu basis. The loop can only run at 4.0GB/s from the other cpu.
ok, if it stays assembly this explanation will be put in with it.
ron