unrv2b delay (v2)

List overview All Threads
Download

newer

older

r1000 - in coreboot-v3:...

r999 -...

Arne Georg Gleditsch

27 Oct 2008 27 Oct '08

1:24 p.m.

Hi,

The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache. Instrumentation indicates somehting like ~180 i-stream cacheline requests for _every_ d-stream cacheline request during decompression. The following patch copies the unrv2b function to DRAM before performing the actual decompression, reducing the run-time from 12.5s to 50ms in my environment. The patch is somewhat rough, and assumes that (1) the unrv2b function is less than 1k in size, and (2) that placing a copy of the function just after the decompress destination is unproblematic. It works well for me, though.

Signed-off-by: Arne Georg Gleditsch arne.gleditsch@numascale.com

-- Arne.

Attachments:

s2912_unrv2b_rel.diff (text/x-diff — 1.6 KB)

Show replies by date

Peter Stuge

29 Oct 29 Oct

1:56 a.m.

Hi,

Awesome improvement!

Arne Georg Gleditsch wrote:

...

The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.

But why does this improve when the algorithm runs out of RAM?

Could ROM accesses be a factor?

...

The following patch copies the unrv2b function to DRAM before performing the actual decompression, reducing the run-time from 12.5s to 50ms in my environment. The patch is somewhat rough, and assumes that (1) the unrv2b function is less than 1k in size, and (2) that placing a copy of the function just after the decompress destination is unproblematic. It works well for me, though.

Signed-off-by: Arne Georg Gleditsch arne.gleditsch@numascale.com

Acked-by: Peter Stuge peter@stuge.se

Arne Georg Gleditsch

10:38 a.m.

Peter Stuge peter@stuge.se writes:

...

Arne Georg Gleditsch wrote:

...
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.

But why does this improve when the algorithm runs out of RAM?

Could ROM accesses be a factor?

Yes, at least in the sense that running from ROM means we're running from an uncached memory region. I assume you'd notice much the same running from uncached DRAM as well. copy_and_run is one of the few things that run after cache-as-ram has been disabled but before our code has been copied to proper DRAM, so it is especially sensitive regarding code footprint.

-- Arne.

Marc Jones

5:27 p.m.

Arne Georg Gleditsch wrote:

...

Peter Stuge peter@stuge.se writes:

...
Arne Georg Gleditsch wrote:

...
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.

But why does this improve when the algorithm runs out of RAM?

Could ROM accesses be a factor?

Yes, at least in the sense that running from ROM means we're running from an uncached memory region. I assume you'd notice much the same running from uncached DRAM as well. copy_and_run is one of the few things that run after cache-as-ram has been disabled but before our code has been copied to proper DRAM, so it is especially sensitive regarding code footprint.

I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.

Marc

-- Marc Jones Senior Firmware Engineer (970) 226-9684 Office mailto:Marc.Jones@amd.com http://www.amd.com/embeddedprocessors

Arne Georg Gleditsch

5:40 p.m.

Marc Jones Marc.Jones@amd.com writes:

...

I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.

I considered that, but this being v2 I went with the unsophisticated approach; presuming it would be easier to debug and less intrusive a change. It might be wrong on that. of course, but when the naive modifications turned out to be sufficient I didn't investigate further. (I believe the time window where this is a problem is fairly small?)

How is the transition from CAR to regular cached DRAM done i v3? (I must admit to not being terribly hip with the v3 code base; is it approaching a stage where it might make sense to add something like the tyan s2912 port to it?)

-- Arne.

Stefan Reinauer

9:41 p.m.

Marc Jones wrote:

...

Arne Georg Gleditsch wrote:

...
Peter Stuge peter@stuge.se writes:

...
Arne Georg Gleditsch wrote:

...
The unrv2b uncompression algorithm appears to behave very badly in the absence of a proper cache.

But why does this improve when the algorithm runs out of RAM?

Could ROM accesses be a factor?

Yes, at least in the sense that running from ROM means we're running from an uncached memory region. I assume you'd notice much the same running from uncached DRAM as well. copy_and_run is one of the few things that run after cache-as-ram has been disabled but before our code has been copied to proper DRAM, so it is especially sensitive regarding code footprint.

I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.

What CPU/chipset is this? On quite some ROM stays cacheable all the time, afaik

-- coresystems GmbH • Brahmsstr. 16 • D-79104 Freiburg i. Br. Tel.: +49 761 7668825 • Fax: +49 761 7664613 Email: info@coresystems.de • http://www.coresystems.de/ Registergericht: Amtsgericht Freiburg • HRB 7656 Geschäftsführer: Stefan Reinauer • Ust-IdNr.: DE245674866

Arne Georg Gleditsch

30 Oct 30 Oct

8:51 a.m.

Stefan Reinauer stepan@coresystems.de writes:

...

...
I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.

What CPU/chipset is this? On quite some ROM stays cacheable all the time, afaik

I've seen this performance issue on my Tyan test rig; s2912 mainboard, mcp55 southbridge and 83xx Opterons. From the code in the serengeti_cheetah_fam10 target I'd expect the same behavior to manifest there.

-- Arne.

Marc Jones

11:03 p.m.

Arne Georg Gleditsch wrote:

...

Stefan Reinauer stepan@coresystems.de writes:

...
...
I have thought about this a while back and have wanted to make a change. Disabling CAR should fixup the stack etc but for performance reasons we should setup/leave ROM and RAM caching enabled on the BSP. If you are interested in looking at that I think it would be great.

What CPU/chipset is this? On quite some ROM stays cacheable all the time, afaik

I've seen this performance issue on my Tyan test rig; s2912 mainboard, mcp55 southbridge and 83xx Opterons. From the code in the serengeti_cheetah_fam10 target I'd expect the same behavior to manifest there.

I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.

Marc

-- Marc Jones Senior Firmware Engineer (970) 226-9684 Office mailto:Marc.Jones@amd.com http://www.amd.com/embeddedprocessors

Stefan Reinauer

11:19 p.m.

On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:

...

...
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.

Marc

So what's the reason for that?

Marc Jones

11:33 p.m.

Stefan Reinauer wrote:

...

On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:

...
...
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.

Marc

So what's the reason for that?

off the top of my head:

1. ROM access speed: lpc and or spi may be run at different speeds. (faster on some platforms) 2. domain crossed - ht -> maybe pci/pcie -> pci/pcie -> lpc ( less is better) 3. prefetch and code alignment. Being a few bytes off can cause and prefetching to thrash.

It shouldn't thrash as bad if the ROM is being cached but I am not sure that it is or that it is setup correctly.

Marc

-- Marc Jones Senior Firmware Engineer (970) 226-9684 Office mailto:Marc.Jones@amd.com http://www.amd.com/embeddedprocessors

Carl-Daniel Hailfinger

11:52 p.m.

On 30.10.2008 23:33, Marc Jones wrote:

...

Stefan Reinauer wrote:

...
On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:

...
...
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.

Marc

So what's the reason for that?

off the top of my head:

ROM access speed: lpc and or spi may be run at different speeds.

(faster on some platforms) 2. domain crossed - ht -> maybe pci/pcie -> pci/pcie -> lpc ( less is better) 3. prefetch and code alignment. Being a few bytes off can cause and prefetching to thrash.

It shouldn't thrash as bad if the ROM is being cached but I am not sure that it is or that it is setup correctly.

A system using SPI for its ROM at the usual speed without caching has a per-byte latency of ~3 microseconds. That translates to per-instruction latencies of roughly 6 microseconds for simple instructions accessing data in the ROM. That's roughly 500 times slower than an instruction in I-Cache accessing data in RAM.

Regards, Carl-Daniel

-- http://www.hailfinger.org/

ebiederm＠xmission.com

12 Nov 12 Nov

3:36 a.m.

Carl-Daniel Hailfinger c-d.hailfinger.devel.2006@gmx.net writes:

...

On 30.10.2008 23:33, Marc Jones wrote:

...
Stefan Reinauer wrote:

...
On 30.10.2008, at 15:03, Marc Jones marc.jones@amd.com wrote:

...
...
I think that this is in all K8 and fam10 disable_car code. The copy to memory is probably good. Running from the rom and decompressing from the rom is going to thrash. It may affect some cpu/chipset combinations more than others.

Marc

So what's the reason for that?

off the top of my head:

ROM access speed: lpc and or spi may be run at different speeds.

(faster on some platforms) 2. domain crossed - ht -> maybe pci/pcie -> pci/pcie -> lpc ( less is better) 3. prefetch and code alignment. Being a few bytes off can cause and prefetching to thrash.

It shouldn't thrash as bad if the ROM is being cached but I am not sure that it is or that it is setup correctly.

A system using SPI for its ROM at the usual speed without caching has a per-byte latency of ~3 microseconds. That translates to per-instruction latencies of roughly 6 microseconds for simple instructions accessing data in the ROM. That's roughly 500 times slower than an instruction in I-Cache accessing data in RAM.

Historically all of this was handled by setting up an MTRR over the ROM chip. I think the type was write-through, and I think the config option was uncompressed ROM size.

So if you are seeing slow decompression performance I'm guessing you can fix it with just a few config fiddles. But look at the early mtrr setup to be certain.

Eric

6022

days inactive

6038

days old

coreboot@coreboot.org

11 comments

7 participants

tags (0)

participants (7)

Arne Georg Gleditsch
Carl-Daniel Hailfinger
ebiederm＠xmission.com
Marc Jones
Marc Jones
Peter Stuge
Stefan Reinauer