[coreboot] AMD CAR II

Fri May 7 20:10:53 CEST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I examined a bit how does it works. Maybe if one can read this
http://en.wikipedia.org/wiki/CPU_cache and then continue here :)

I was particularly curious because we do writeback - writeback copy of data from
CAR to ram (to copy stack and sysinfo, which must cause L1 evictions), and also
we do DQS memory training (which writes to RAM during CAR) and we use cache to
cache ROM too.

This means not only L1 is used but we must be using L2 too. Here are some notes
why I think it works :)

Here is what I found:

AMD L2 cache is exclusive, it means it only contains data evicted from L1
caches. In other words there is never same data in both caches. I could not find
any info if it is valid for the Icache too. If the icache gets moved to L2 or
not. It should but it does not seem to happen during CAR.

L1 Data cache:
	Size: 64Kb	2-way associative.
	lines per tag=1	line size=64 bytes.
L1 Instruction cache:
	Size: 64Kb	2-way associative.
	lines per tag=1	line size=64 bytes.

512KB/core

L2 cache:
	Size: 1024Kb	16-way associative.
	lines per tag=1	line size=64 bytes.

Here is basic math how to calculate cache organization:

line size => tells how many bytes are stored in one cache line (exploits the
spatial locality of data). Here it is 64 bytes so bits of address 5:0 are used.

Index: it tells how many cache lines do we have.

The level of associativity tells how many addresses which compete for same index
can be stored in cache simultaneously.

For L1 we have: 64*1024 / 64 / 2 = 512 is the number of cache lines. We have 2
(assoc is 2) "arrays" each has 64 bytes/per line and total size is 64KB. The
index is therefore on addresses 14:6. The rest of address is used as tag (tag
identifies the actual location of data in memory together with the cache line
index) One can say each 14:0 bit of address compete for same index. We have asoc
level of 2 so each 16 bits of addr will fill whole cache.

For L2 here it is 512KB assoc is 16. We have 32KB / 64 indexes = 512 (lines)
so addresses 14:6 build up the index. Rest is tag.

The CAR idea on AMD is just to use it and never cause an eviction from L2 cache
to main memory (which is not functioning).

Step 0) enable cache and WB mtrrs for any ranges
1) all lines are invalid, validate them by dummy read exactly as big as max L1
cache. For instruction cache enough is a instr fetch.
2) The dummy read region can be now used to store data - it is simply an
arbitrary address range 0-64KB max.

3) caching of ROM works too because:

a) MTRR for rom is set (currently only for part of it) it could be WP type but
we use WB, no harm here because we do not modify any code ;)
b) L1 instruction cache is filled from flash chip directly (remember L2 is
exclusive cache on AMD)
c) if L1 instr cache is not evicted into L2 then on cache miss it L1 line is
simply invalidated and refilled from flash rom. I tried to check this using
performance counters but there is not a counter for this. This is uninteresting
case because it does not complicate anything.

c) if L1 instr cache gets evicted into L2, (which I dont know if is true),

then we can run into following

I) no L1 data cache lines was evicted into L2 - again not interesting case
because nothing gets wrong.

II) we have some L1 data cache evicted into L2. This really happens in our CAR!
print_debug("Copying data from cache to RAM -- switching to use RAM as stack...");
memcopy((void *)((CONFIG_RAMTOP)-CONFIG_DCACHE_RAM_SIZE), (void
*)CONFIG_DCACHE_RAM_BASE, CONFIG_DCACHE_RAM_SIZE);

It happens here because we do copy from CAR region to RAM while CAR is still
running. Both regions are WB so we must evict some L1 cache lines for sure, and
performance counters confirm this. You may say this is not an issue because RAM
is running normally, but for example while we resume from S3 we cannot overwrite
random memory with out CAR... I think this evictions so far happens only here
and still things works nice here is why:

We have at most 64KB or dirty data, we can spread it into L2 nicely and still
have a lot of free space even on systems where we have 128KB L2. In this case no
evictions into system  because we can have the data still in L2.

Now lets go back, what if CPU instruction cache gets evicted into L2? Here it
would cause problems because in L2 would be L1 data cache data and random L1
instr cache code competing for the space.

I think here it works because dirty data is evicted with lowest priority. I
think if all lanes of cache are full, the lane with "clean" data is invalidated
  first. This saves the day for us because it guarantees that our L1 data will
not fall off the cache never ever - only if we exceed the L2 cache size with
dirty data.

We examined so far the ROM caching and oversized L1 handling. But the memory
training uses writes to not yet initialized RAM. How it works here?

I checked and the memory write uses the instruction which bypasses caches. The
read uses cache, but it invalidates the cache line afterwards. Again because we
have at most L1size of dirty data and L2 is big enough it does not spoil the
party and no stuff gets evicted back to non functioning memory.

Last thing which worries me are speculated fills which can be do by CPU. I think
they are disabled because the bit for proble FILLs is 0. The fam11 which has
better documented L2 for general storage needs to have some other bits toggled
not to do some extra speculations. Fam 10h describes only L1 car and older fams
also the L1 only CAR. In our code we practically use L2 in all cases.

What we could do is to program a performance counter for L2 writebacks to system
at the beginning of CAR and in CAR disable check if it is still zero. This will
tell if we did something nasty.

We could also avoid the WB-WB copy of the CAR area. I tried with WB-UC copy and
we have then 0 evictions from L1 which is fine (i did some experiments in
January see AMD CAR questions email).

Uhh its long email took like hour to write, please tell if you think that it
works really this way.

Thanks,
Rudolf

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkvkV60ACgkQ3J9wPJqZRNXaYgCglBFGuv2PtaR7yI/xxpVgvFBu
vjwAn1ZPp1AArEih9CyO1T44tz/o97LR
=ce4w
-----END PGP SIGNATURE-----