Hi,
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache. Instrumentation indicates somehting like ~180 i-stream cacheline requests for _every_ d-stream cacheline request during decompression. The following patch copies the unrv2b function to DRAM before performing the actual decompression, reducing the run-time from 12.5s to 50ms in my environment. The patch is somewhat rough, and assumes that (1) the unrv2b function is less than 1k in size, and (2) that placing a copy of the function just after the decompress destination is unproblematic. It works well for me, though.
Signed-off-by: Arne Georg Gleditsch arne.gleditsch@numascale.com
Hi,
Awesome improvement!
Arne Georg Gleditsch wrote:
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.
But why does this improve when the algorithm runs out of RAM?
Could ROM accesses be a factor?
The following patch copies the unrv2b function to DRAM before performing the actual decompression, reducing the run-time from 12.5s to 50ms in my environment. The patch is somewhat rough, and assumes that (1) the unrv2b function is less than 1k in size, and (2) that placing a copy of the function just after the decompress destination is unproblematic. It works well for me, though.
Signed-off-by: Arne Georg Gleditsch arne.gleditsch@numascale.com
Acked-by: Peter Stuge peter@stuge.se
Peter Stuge peter@stuge.se writes:
Arne Georg Gleditsch wrote:
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.
But why does this improve when the algorithm runs out of RAM?
Could ROM accesses be a factor?
Yes, at least in the sense that running from ROM means we're running from an uncached memory region. I assume you'd notice much the same running from uncached DRAM as well. copy_and_run is one of the few things that run after cache-as-ram has been disabled but before our code has been copied to proper DRAM, so it is especially sensitive regarding code footprint.
Arne Georg Gleditsch wrote:
Peter Stuge peter@stuge.se writes:
Arne Georg Gleditsch wrote:
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.
But why does this improve when the algorithm runs out of RAM?
Could ROM accesses be a factor?
Yes, at least in the sense that running from ROM means we're running from an uncached memory region. I assume you'd notice much the same running from uncached DRAM as well. copy_and_run is one of the few things that run after cache-as-ram has been disabled but before our code has been copied to proper DRAM, so it is especially sensitive regarding code footprint.
I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.
Marc
Marc Jones Marc.Jones@amd.com writes:
I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.
I considered that, but this being v2 I went with the unsophisticated approach; presuming it would be easier to debug and less intrusive a change. It might be wrong on that. of course, but when the naive modifications turned out to be sufficient I didn't investigate further. (I believe the time window where this is a problem is fairly small?)
How is the transition from CAR to regular cached DRAM done i v3? (I must admit to not being terribly hip with the v3 code base; is it approaching a stage where it might make sense to add something like the tyan s2912 port to it?)
Marc Jones wrote:
Arne Georg Gleditsch wrote:
Peter Stuge peter@stuge.se writes:
Arne Georg Gleditsch wrote:
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.
But why does this improve when the algorithm runs out of RAM?
Could ROM accesses be a factor?
Yes, at least in the sense that running from ROM means we're running from an uncached memory region. I assume you'd notice much the same running from uncached DRAM as well. copy_and_run is one of the few things that run after cache-as-ram has been disabled but before our code has been copied to proper DRAM, so it is especially sensitive regarding code footprint.
I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.
What CPU/chipset is this? On quite some ROM stays cacheable all the time, afaik
Stefan Reinauer stepan@coresystems.de writes:
I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.
What CPU/chipset is this? On quite some ROM stays cacheable all the time, afaik
I've seen this performance issue on my Tyan test rig; s2912 mainboard, mcp55 southbridge and 83xx Opterons. From the code in the serengeti_cheetah_fam10 target I'd expect the same behavior to manifest there.
Arne Georg Gleditsch wrote:
Stefan Reinauer stepan@coresystems.de writes:
I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.
What CPU/chipset is this? On quite some ROM stays cacheable all the time, afaik
I've seen this performance issue on my Tyan test rig; s2912 mainboard, mcp55 southbridge and 83xx Opterons. From the code in the serengeti_cheetah_fam10 target I'd expect the same behavior to manifest there.
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.
Marc
On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.
Marc
So what's the reason for that?
Stefan Reinauer wrote:
On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.
Marc
So what's the reason for that?
off the top of my head:
1. ROM access speed: lpc and or spi may be run at different speeds. (faster on some platforms) 2. domain crossed - ht -> maybe pci/pcie -> pci/pcie -> lpc ( less is better) 3. prefetch and code alignment. Being a few bytes off can cause and prefetching to thrash.
It shouldn't thrash as bad if the ROM is being cached but I am not sure that it is or that it is setup correctly.
Marc
On 30.10.2008 23:33, Marc Jones wrote:
Stefan Reinauer wrote:
On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.
Marc
So what's the reason for that?
off the top of my head:
- ROM access speed: lpc and or spi may be run at different speeds.
(faster on some platforms) 2. domain crossed - ht -> maybe pci/pcie -> pci/pcie -> lpc ( less is better) 3. prefetch and code alignment. Being a few bytes off can cause and prefetching to thrash.
It shouldn't thrash as bad if the ROM is being cached but I am not sure that it is or that it is setup correctly.
A system using SPI for its ROM at the usual speed without caching has a per-byte latency of ~3 microseconds. That translates to per-instruction latencies of roughly 6 microseconds for simple instructions accessing data in the ROM. That's roughly 500 times slower than an instruction in I-Cache accessing data in RAM.
Regards, Carl-Daniel
Carl-Daniel Hailfinger c-d.hailfinger.devel.2006@gmx.net writes:
On 30.10.2008 23:33, Marc Jones wrote:
Stefan Reinauer wrote:
On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.
Marc
So what's the reason for that?
off the top of my head:
- ROM access speed: lpc and or spi may be run at different speeds.
(faster on some platforms) 2. domain crossed - ht -> maybe pci/pcie -> pci/pcie -> lpc ( less is better) 3. prefetch and code alignment. Being a few bytes off can cause and prefetching to thrash.
It shouldn't thrash as bad if the ROM is being cached but I am not sure that it is or that it is setup correctly.
A system using SPI for its ROM at the usual speed without caching has a per-byte latency of ~3 microseconds. That translates to per-instruction latencies of roughly 6 microseconds for simple instructions accessing data in the ROM. That's roughly 500 times slower than an instruction in I-Cache accessing data in RAM.
Historically all of this was handled by setting up an MTRR over the ROM chip. I think the type was write-through, and I think the config option was uncompressed ROM size.
So if you are seeing slow decompression performance I'm guessing you can fix it with just a few config fiddles. But look at the early mtrr setup to be certain.
Eric