-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi all,
I examined a bit how does it works. Maybe if one can read this http://en.wikipedia.org/wiki/CPU_cache and then continue here :)
I was particularly curious because we do writeback - writeback copy of data from CAR to ram (to copy stack and sysinfo, which must cause L1 evictions), and also we do DQS memory training (which writes to RAM during CAR) and we use cache to cache ROM too.
This means not only L1 is used but we must be using L2 too. Here are some notes why I think it works :)
Here is what I found:
AMD L2 cache is exclusive, it means it only contains data evicted from L1 caches. In other words there is never same data in both caches. I could not find any info if it is valid for the Icache too. If the icache gets moved to L2 or not. It should but it does not seem to happen during CAR.
L1 Data cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L1 Instruction cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes.
512KB/core
L2 cache: Size: 1024Kb 16-way associative. lines per tag=1 line size=64 bytes.
Here is basic math how to calculate cache organization:
line size => tells how many bytes are stored in one cache line (exploits the spatial locality of data). Here it is 64 bytes so bits of address 5:0 are used.
Index: it tells how many cache lines do we have.
The level of associativity tells how many addresses which compete for same index can be stored in cache simultaneously.
For L1 we have: 64*1024 / 64 / 2 = 512 is the number of cache lines. We have 2 (assoc is 2) "arrays" each has 64 bytes/per line and total size is 64KB. The index is therefore on addresses 14:6. The rest of address is used as tag (tag identifies the actual location of data in memory together with the cache line index) One can say each 14:0 bit of address compete for same index. We have asoc level of 2 so each 16 bits of addr will fill whole cache.
For L2 here it is 512KB assoc is 16. We have 32KB / 64 indexes = 512 (lines) so addresses 14:6 build up the index. Rest is tag.
The CAR idea on AMD is just to use it and never cause an eviction from L2 cache to main memory (which is not functioning).
Step 0) enable cache and WB mtrrs for any ranges 1) all lines are invalid, validate them by dummy read exactly as big as max L1 cache. For instruction cache enough is a instr fetch. 2) The dummy read region can be now used to store data - it is simply an arbitrary address range 0-64KB max.
3) caching of ROM works too because:
a) MTRR for rom is set (currently only for part of it) it could be WP type but we use WB, no harm here because we do not modify any code ;) b) L1 instruction cache is filled from flash chip directly (remember L2 is exclusive cache on AMD) c) if L1 instr cache is not evicted into L2 then on cache miss it L1 line is simply invalidated and refilled from flash rom. I tried to check this using performance counters but there is not a counter for this. This is uninteresting case because it does not complicate anything.
c) if L1 instr cache gets evicted into L2, (which I dont know if is true),
then we can run into following
I) no L1 data cache lines was evicted into L2 - again not interesting case because nothing gets wrong.
II) we have some L1 data cache evicted into L2. This really happens in our CAR! print_debug("Copying data from cache to RAM -- switching to use RAM as stack..."); memcopy((void *)((CONFIG_RAMTOP)-CONFIG_DCACHE_RAM_SIZE), (void *)CONFIG_DCACHE_RAM_BASE, CONFIG_DCACHE_RAM_SIZE);
It happens here because we do copy from CAR region to RAM while CAR is still running. Both regions are WB so we must evict some L1 cache lines for sure, and performance counters confirm this. You may say this is not an issue because RAM is running normally, but for example while we resume from S3 we cannot overwrite random memory with out CAR... I think this evictions so far happens only here and still things works nice here is why:
We have at most 64KB or dirty data, we can spread it into L2 nicely and still have a lot of free space even on systems where we have 128KB L2. In this case no evictions into system because we can have the data still in L2.
Now lets go back, what if CPU instruction cache gets evicted into L2? Here it would cause problems because in L2 would be L1 data cache data and random L1 instr cache code competing for the space.
I think here it works because dirty data is evicted with lowest priority. I think if all lanes of cache are full, the lane with "clean" data is invalidated first. This saves the day for us because it guarantees that our L1 data will not fall off the cache never ever - only if we exceed the L2 cache size with dirty data.
We examined so far the ROM caching and oversized L1 handling. But the memory training uses writes to not yet initialized RAM. How it works here?
I checked and the memory write uses the instruction which bypasses caches. The read uses cache, but it invalidates the cache line afterwards. Again because we have at most L1size of dirty data and L2 is big enough it does not spoil the party and no stuff gets evicted back to non functioning memory.
Last thing which worries me are speculated fills which can be do by CPU. I think they are disabled because the bit for proble FILLs is 0. The fam11 which has better documented L2 for general storage needs to have some other bits toggled not to do some extra speculations. Fam 10h describes only L1 car and older fams also the L1 only CAR. In our code we practically use L2 in all cases.
What we could do is to program a performance counter for L2 writebacks to system at the beginning of CAR and in CAR disable check if it is still zero. This will tell if we did something nasty.
We could also avoid the WB-WB copy of the CAR area. I tried with WB-UC copy and we have then 0 evictions from L1 which is fine (i did some experiments in January see AMD CAR questions email).
Uhh its long email took like hour to write, please tell if you think that it works really this way.
Thanks, Rudolf