Hi!
The BKDG rev. 3.08 for AMD Family 0Fh states that it is possible to use a CAR area with a size of 64K in section 13.16 "Cache Initialization For General Storage During Boot". It also says that during DRAM training CAR size must be reduced. For DDR training, 256 cache lines with L1 cache tag indexes 00h-FFh are reserved and must not be used as CAR. The text then refers to the AMD64 Arch Programmers Manual Vol. 2 for more details on L1 function. However, I couldn't find any explanation why L1 cache tag indexes 00h-FFh correspond to address space C0000h-C3FFFh when fixed size MTRRs are active.
Explanations would be very welcome.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
Hi!
The BKDG rev. 3.08 for AMD Family 0Fh states that it is possible to use a CAR area with a size of 64K in section 13.16 "Cache Initialization For General Storage During Boot". It also says that during DRAM training CAR size must be reduced. For DDR training, 256 cache lines with L1 cache tag indexes 00h-FFh are reserved and must not be used as CAR. The text then refers to the AMD64 Arch Programmers Manual Vol. 2 for more details on L1 function. However, I couldn't find any explanation why L1 cache tag indexes 00h-FFh correspond to address space C0000h-C3FFFh when fixed size MTRRs are active.
I may be misunderstanding your question but I don't think that tag indexes 00h-ffh have to correspond to C0000h-C3FFFh. I'm also not positive that they must be tag indexes 00h-ffh. I think that they could be on the end as long as the tags are contiguous. This comment refers DDR training needing the space to hold test patterns for dqs eye finding during memory training. See northbridge\amd\amdk8\raminit_f_dqs.c TrainDQSRdWrPos(). For coreboot, it looks like the test patterns are just pushed onto the stack. For AMD BIOS code, this is not the case and they are put into the cache at a set location. (I think that this is easier for the AGESA asm code to handle that way).
Marc
Marc, I have been trying (and failing) to find a nice writeup on what training is and how it works. Have you seen one? It would be nice to have a writeup for people to see.
thanks
ron
The BKDG is the only documentation. We are looking into other ways to document this because it is so complicated but I don't expect anything in the near future. Marc
ron minnich wrote:
Marc, I have been trying (and failing) to find a nice writeup on what training is and how it works. Have you seen one? It would be nice to have a writeup for people to see.
thanks
ron
On Jan 15, 2008 1:22 PM, Marc Jones Marc.Jones@amd.com wrote:
The BKDG is the only documentation. We are looking into other ways to document this because it is so complicated but I don't expect anything in the near future.
it occurs more places than just memory. For example, on some supercomputers, the networks use a training very similar to DDR training. I was trying to explain what training was to people but I could not, because my own understanding is so limited.
It's a shame that something so important is not documented in more places. I just can't find anything. Note that I'm NOT picking on AMD -- they have more docs than anyone!
thanks
ron
On 15.01.2008 18:54, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
The BKDG rev. 3.08 for AMD Family 0Fh states that it is possible to use a CAR area with a size of 64K in section 13.16 "Cache Initialization For General Storage During Boot". It also says that during DRAM training CAR size must be reduced. For DDR training, 256 cache lines with L1 cache tag indexes 00h-FFh are reserved and must not be used as CAR. The text then refers to the AMD64 Arch Programmers Manual Vol. 2 for more details on L1 function. However, I couldn't find any explanation why L1 cache tag indexes 00h-FFh correspond to address space C0000h-C3FFFh when fixed size MTRRs are active.
I may be misunderstanding your question but I don't think that tag indexes 00h-ffh have to correspond to C0000h-C3FFFh. I'm also not positive that they must be tag indexes 00h-ffh. I think that they could be on the end as long as the tags are contiguous.
Good to know. Can you make sure such a sentence gets added to the BKDG in its various versions?
This comment refers DDR training needing the space to hold test patterns for dqs eye finding during memory training. See northbridge\amd\amdk8\raminit_f_dqs.c TrainDQSRdWrPos().
Thanks. It seems I have to reread the code a few times to fully understand its structure. But I have spotted something peculiar in the code of TrainDQSRdWrPos() in src/northbridge/amd/amdk8/raminit_f_dqs.c
Errors = 0; channel = 0; while( (channel<2) && (!Errors)) { print_debug_dqs("\tTrainDQSRdWrPos: 1 channel ",channel, 1); for(DQSWrDelay = 0; DQSWrDelay < 48; DQSWrDelay++) { unsigned err; SetDQSDelayAllCSR(ctrl, channel, DQS_WRITEDIR, DQSWrDelay); print_debug_dqs("\t\tTrainDQSRdWrPos: 21 DQSWrDelay ", DQSWrDelay, 2); err= TrainReadDQS(ctrl, channel, pattern, buf_a, dqs_delay_a, sysinfo); print_debug_dqs("\t\tTrainDQSRdWrPos: 22 err ",err, 2); if(err == 0) break; -------------> Now we set "Errors" Errors |= err; } print_debug_dqs("\tTrainDQSRdWrPos: 3 DQSWrDelay ", DQSWrDelay, 1); if(DQSWrDelay < 48) { -------------> Now we overwrite "Errors" in case the for loop above ever had err == 0. Errors = TrainWriteDQS(ctrl, channel, pattern, buf_a, dqs_delay_a, sysinfo); print_debug_dqs("\tTrainDQSRdWrPos: 4 Errors ", Errors, 1); } channel++; if(!is_Width128){ //FIXME: 64MuxMode?? channel++; // skip channel if 64-bit mode } }
As I understand the logic of the snippet above, we look for a DQSWrDelay which does not give any errors with TrainReadDQS. Then we don't care about errors for other values of DQSWrDelay and use the current value of DQSWrDelay to run TrainWriteDQS. If TrainReadDQS failed for all values of DQSWrDelay, we return the bitwise OR of all error conditions we had for all values of DQSWrDelay. Does that really make sense?
For coreboot, it looks like the test patterns are just pushed onto the stack.
Indeed. So we are completely free to place CAR anywhere we want with any size we want (subject to L2 size restrictions).
For AMD BIOS code, this is not the case and they are put into the cache at a set location. (I think that this is easier for the AGESA asm code to handle that way).
I see.
Thanks for pointing me to the code. I shall add good comments to that code snippet once I have more time.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
Errors = 0; channel = 0; while( (channel<2) && (!Errors)) { print_debug_dqs("\tTrainDQSRdWrPos: 1 channel ",channel, 1); for(DQSWrDelay = 0; DQSWrDelay < 48; DQSWrDelay++) { unsigned err; SetDQSDelayAllCSR(ctrl, channel, DQS_WRITEDIR, DQSWrDelay); print_debug_dqs("\t\tTrainDQSRdWrPos: 21 DQSWrDelay ", DQSWrDelay, 2); err= TrainReadDQS(ctrl, channel, pattern, buf_a, dqs_delay_a, sysinfo); print_debug_dqs("\t\tTrainDQSRdWrPos: 22 err ",err, 2); if(err == 0) break; -------------> Now we set "Errors" Errors |= err; } print_debug_dqs("\tTrainDQSRdWrPos: 3 DQSWrDelay ", DQSWrDelay, 1); if(DQSWrDelay < 48) { -------------> Now we overwrite "Errors" in case the for loop above ever had err == 0. Errors = TrainWriteDQS(ctrl, channel, pattern, buf_a, dqs_delay_a, sysinfo); print_debug_dqs("\tTrainDQSRdWrPos: 4 Errors ", Errors, 1); } channel++; if(!is_Width128){ //FIXME: 64MuxMode?? channel++; // skip channel if 64-bit mode } }
As I understand the logic of the snippet above, we look for a DQSWrDelay which does not give any errors with TrainReadDQS. Then we don't care about errors for other values of DQSWrDelay and use the current value of DQSWrDelay to run TrainWriteDQS. If TrainReadDQS failed for all values of DQSWrDelay, we return the bitwise OR of all error conditions we had for all values of DQSWrDelay. Does that really make sense?
Any bit set means fail. Caller checks !=0. I think it is fine. I guess it could be translated to a pass/fail. If there are no passing case the reason doesn't really matter. The real errors are reported in TrainDQSPos(). Does that answer your question?
Marc
On 16.01.2008 21:58, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
As I understand the logic of the snippet above, we look for a DQSWrDelay which does not give any errors with TrainReadDQS. Then we don't care about errors for other values of DQSWrDelay and use the current value of DQSWrDelay to run TrainWriteDQS. If TrainReadDQS failed for all values of DQSWrDelay, we return the bitwise OR of all error conditions we had for all values of DQSWrDelay. Does that really make sense?
Any bit set means fail. Caller checks !=0. I think it is fine. I guess it could be translated to a pass/fail. If there are no passing case the reason doesn't really matter. The real errors are reported in TrainDQSPos(). Does that answer your question?
Yes, that indeed does explain the code. Thanks!
Regards, Carl-Daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi,
I just can explain in general what addresses are "forbidden".
L1 cache is 64KB 2-way associative with 64B block size.
This means 6 bits for data (to address the 64Bytes) and for the cache index we have 65536 / ( 2 * 64) = 512 rows so we have 9 bits for this, rest is TAG. So the 32 bit memory address would look like this when cut:
31st ..... TAG ..... 14th .... INDEX ..... 5th block addr.
So "forbidden" addresses are all from range which will hit first 256 cache lines, so address from bit 0 to bit INDEX - 1 (13rd bit)
(X means dontcare 4 bits, x is dontcare one bit)
xxxx x000 0000 00xx xxxx to xxxx x011 1111 11xx xxxx
Which for it turns out anything in the address range XXXX0000 - XXXX3FFF will hit first 256 cache lines. Marc suggest that this is SW issue and not a HW issue so perhaps we can ignore this ;) but anyway, it is always good to explain why this addresses - I hope there is no mistake in my calculations.
Thanks,
Rudolf
On 15.01.2008 20:36, Rudolf Marek wrote:
I just can explain in general what addresses are "forbidden".
L1 cache is 64KB 2-way associative with 64B block size.
This means 6 bits for data (to address the 64Bytes) and for the cache index we have 65536 / ( 2 * 64) = 512 rows so we have 9 bits for this, rest is TAG. So the 32 bit memory address would look like this when cut:
31st ..... TAG ..... 14th .... INDEX ..... 5th block addr.
So "forbidden" addresses are all from range which will hit first 256 cache lines, so address from bit 0 to bit INDEX - 1 (13rd bit)
(X means dontcare 4 bits, x is dontcare one bit)
xxxx x000 0000 00xx xxxx to xxxx x011 1111 11xx xxxx
Which for it turns out anything in the address range XXXX0000 - XXXX3FFF will hit first 256 cache lines. Marc suggest that this is SW issue and not a HW issue so perhaps we can ignore this ;) but anyway, it is always good to explain why this addresses - I hope there is no mistake in my calculations.
Great explanation, thanks! Something like this should appear in the BKDG.
Regards, Carl-Daniel