I keep trying to get Cache-As-Ram working with a dual Xeon P4/HT board, and have a few simple(?) questions about cache behaviour. I may have misunderstood some aspects of the cache logic here.
As a test setup, my coreboot.rom image now includes the cache_as_ram.inc from intel/model_6ex and I link it with a ROMCC compiled romstage for easier debugging. Except for the cache-based stack, I got this to work nicely.
The Xeon P4/HT CPUs installed on the mainboard have 8 kB L1, 512 kB L2 and 1024 kB L3. All levels share code and data, cache-line is 64 bytes. Below I have ignored the existence of L3.
Problem 1: L2 does not store cache-line.
I may have a case that L2 cache is currently not enabled at all. The reference code uses MSR 0x11e to explicitly enable L2, but this MSR does not exist for Xeon P4 and actually halts CPU. I did not find controls besides CR0 and L3 disable MSR that effect cache.
As a test procedure, I have defined a 16 kB cache region over non-existing MMIO on the system bus, just below the FWH decode range. DCACHE_RAM_SIZE=0x4000 and DCACHE_RAM_BASE=0xffafc000. All reads there return 0xFF's and any writes are ignored.
MTRR is setup for write-back. State of CR0.CD (cache disable bit) seems to have no effect on this test.
For every dword in the range, starting from BASE, I first read them from system bus, hoping to get valid cache-lines in L2 to hit later on. Then I write each dword in the range with its address. When reading back, again starting from BASE, the last 8kB (except for one cache line) return the contents I wrote.
My conclusion of this is, that when a modified cache-line is de-allocated from L1, there is a write miss on L2 and the write is lost on the system bus. Is this allowed or typical behaviour, as under normal operation the cache-line would be stored in DRAM?
Is the minimum amount of RAM required for romstage 32 kB? (STACK_SIZE = 0x8000)? So L1 alone cannot handle it?
Problem 2: Can I skip cache-fill from L2?
This question is only relevant, if I cannot enable L2 and the 8kB L1 would be enough for Cache-As-Ram (which I doubt).
I have dirty (exclusive?) cache-lines stored in L1 and code reaches execution of an instruction that requires writing dirty lines to system bus. Examples of such opcodes : inb, outb, mov ->cr0, wrmsr->MTRR.
I assume that with one of these opcodes, any modified cache-lines are written on the system bus. The cache-line data remains valid in L1, but the state is probably changed (exclusive -> shared ?). The first write access on such a line will cause a fill from next level cache (L2) or system bus. In my case L2 might be disabled, so cache-line then contains all 0xFF's except for the data from our store instruction. Further writes on the same line do not cause new fill.
Is there a way to avoid the fill from L2 on the first write, and re-use the valid data on the (shared?) line? I thought the no-fill mode (CR0.NW) would do this for me.
Problem 3: Cache re-allocation policy?
Cache-lines for the stack must remain in L1 while the XIP ROM lines can be thrown out whenever necessary. Generally, do dirty cache-lines remain in L1 as long as there are non-dirty cache-lines that require less effort to re-allocate?
Kyösti
Hi,
Just a quick notes. There a peformance MSRs which have L2 fills counters. I did want to know on AMD how the fills/misses/etc works so I used that counters to see what is actually going on.
http://oprofile.sourceforge.net/docs
Has a list of events.
Problem 3: Cache re-allocation policy?
Cache-lines for the stack must remain in L1 while the XIP ROM lines can be thrown out whenever necessary. Generally, do dirty cache-lines remain in L1 as long as there are non-dirty cache-lines that require less effort to re-allocate?
I think non-dirty stuff which is not modified is just discarded (XIP ROM)or moved to L2.
The CR.NW mode differs accross Intel CPUs just check documentation what is it doing in your case (The architecture manuals). To get CAR working I would simply follow the BIOS with serialICE.
Thanks Rudolf
On Tue, 2011-11-08 at 09:33 +0100, Rudolf Marek wrote:
Hi,
Just a quick notes. There a peformance MSRs which have L2 fills counters. I did want to know on AMD how the fills/misses/etc works so I used that counters to see what is actually going on.
http://oprofile.sourceforge.net/docs
Has a list of events.
Good tip! I should be able to use events BSQ_cache_reference and BSQ_allocation to count cache hits and misses.
The CR.NW mode differs accross Intel CPUs just check documentation what is it doing in your case (The architecture manuals). To get CAR working I would simply follow the BIOS with serialICE.
Well, for Xeon CR0.NW is a don't care bit and it's always in no-fill mode.
A part of my Problem 2 was that also inb()/outb() flushes modified cache-lines from L1 to (disabled?) L2 and system bus. I think with SerialICE I could not store a single valid cache-line in L1, as this would involve serial-io between any two store dword instructions.
I just realised that attempts to use compressed ramstage on this mainboard have always halted on un-LZMA, while for payload un-LZMA is OK. Until now I thought it was a compiler issue with ROMCC and GCC creating slighty different machine code, but maybe I should push microcode update before any cache use. While there are quite a few errata for these CPUs, I did not identify any of those with the symptoms I saw.
Thanks, Kyösti