Now that we have addressed the CAR/MMIO ROM copy issue (thanks Scott and Arne), I wanted to revisit the other change that Arne proposed for memset.
http://article.gmane.org/gmane.linux.bios/57707
Some people didn't like the architecture specific code for memset. I don't have a problem with it, but it did bring up a few questions.
1. Why isn't gcc generating the rep stosb itself? This should be a standard optimization.
2. Why do we have functions for memset, memcpy, etc, when gcc has builtins that could be used? Is this a remnant from romcc?
Marc
On 10.09.2010 00:45, Marc Jones wrote:
Now that we have addressed the CAR/MMIO ROM copy issue (thanks Scott and Arne), I wanted to revisit the other change that Arne proposed for memset.
http://article.gmane.org/gmane.linux.bios/57707
Some people didn't like the architecture specific code for memset. I don't have a problem with it, but it did bring up a few questions.
- Why isn't gcc generating the rep stosb itself? This should be a
standard optimization.
- Why do we have functions for memset, memcpy, etc, when gcc has
builtins that could be used? Is this a remnant from romcc?
Please note that the AMD and Intel manuals explicitly mention that some instructions which would be used by a fast memcpy will bypass the cache, and as such are unsuitable to copy CAR contents to RAM.
Regards, Carl-Daniel
-----Original Message----- From: coreboot-bounces@coreboot.org [mailto:coreboot-bounces@coreboot.org] On Behalf Of Carl-Daniel Hailfinger Sent: Thursday, September 09, 2010 06:10 PM To: Marc Jones Cc: Coreboot Subject: Re: [coreboot] rfc - gcc builtins and memset memcpy memmove memcmp
]On 10.09.2010 00:45, Marc Jones wrote: ]> Now that we have addressed the CAR/MMIO ROM copy issue (thanks Scott ]> and Arne), I wanted to revisit the other change that Arne proposed for ]> memset. ]> ]> http://article.gmane.org/gmane.linux.bios/57707 ]> ]> Some people didn't like the architecture specific code for memset. I ]> don't have a problem with it, but it did bring up a few questions. ]> ]> 1. Why isn't gcc generating the rep stosb itself? This should be a ]> standard optimization. ]> ]> 2. Why do we have functions for memset, memcpy, etc, when gcc has ]> builtins that could be used? Is this a remnant from romcc? ]> ] ]Please note that the AMD and Intel manuals explicitly mention that some ]instructions which would be used by a fast memcpy will bypass the cache, ]and as such are unsuitable to copy CAR contents to RAM.
Another area to watch for when using optimized memory functions is SMI handling code. The processor saves most registers before running SMI code, but not the XMM registers.
]Regards, ]Carl-Daniel ] ]-- ]http://www.hailfinger.org/
Here are some benchmark results for an AMD family 10h desktop processor. This system uses DDR-2. This shows that for AMD processors rep movs and rep stos performance is not great, though better than a byte loop. Note that this test is for operations on large blocks. For small blocks rep or byte loop is probably best due to less setup overhead.
======= dram, memset, aligned data ======= rep stos 3,614 MB/s optimized memset 6,220 MB/s compiler memset 3,622 MB/s byte loop memset 871 MB/s
======= dram, memset, unaligned data ======= rep stos 3,101 MB/s optimized memset 6,229 MB/s compiler memset 3,621 MB/s byte loop memset 871 MB/s
======= dram, memcpy, aligned data ======= rep movs 1,754 MB/s optimized memcpy 3,979 MB/s compiler memcpy 1,865 MB/s unaligned xmm method 1 2,293 MB/s byte loop memcpy 855 MB/s aligned NT xmm unrolled 4 3,233 MB/s aligned NT xmm 2,924 MB/s aligned NT xmm unrolled 4 reverse 3,235 MB/s aligned NT xmm unrolled 4 prefetch 3,965 MB/s aligned NT xmm unrolled 8 3,217 MB/s aligned NT xmm unrolled 8 reverse 2,641 MB/s aligned NT xmm unrolled 8 prefetch 3,954 MB/s aligned NT xmm unrolled 16 2,920 MB/s aligned NT xmm unrolled 16 prefetch 3,861 MB/s
======= dram, memcpy, unaligned data ======= rep movs 1,509 MB/s optimized memcpy 3,997 MB/s compiler memcpy 1,628 MB/s unaligned xmm method 1 2,242 MB/s byte loop memcpy 854 MB/s
In this report: http://article.gmane.org/gmane.linux.bios/57707, Arne may have been encountering the ClLinesToNbDis issue (assuming the memset code was running from flash). Switching to rep movs would greatly improve performance because unlike a byte loop, rep movs loops in microcode which does not cause continuous flash memory accesses.
Thanks, Scott
Scott Duplichan wrote:
for AMD processors rep movs and rep stos performance is not great, though better than a byte loop.
Intel 386-Pentium is when I last did performance critical code like this. 32-bit mov,mov,add,add,dec ecx,jnz was the standard back then. rep for size optimization, but not performance.
//Peter
"Scott Duplichan" scott@notabs.org writes:
In this report: http://article.gmane.org/gmane.linux.bios/57707, Arne may have been encountering the ClLinesToNbDis issue (assuming the memset code was running from flash). Switching to rep movs would greatly improve performance because unlike a byte loop, rep movs loops in microcode which does not cause continuous flash memory accesses.
This was my assumption as well. After fixing the ClLinesToNbDis setting, I have removed the rep stosb code from my tree, and so far I've not observed the pathological memset behaviour that caused me to put it in in the first place. (As mentioned earlier this was never altogether deterministic, I'm assuming some critical part of the original memset loop needed to straddle cache lines or something for it to manifest.)
]-----Original Message----- ]From: coreboot-bounces@coreboot.org [mailto:coreboot-bounces@coreboot.org] On Behalf Of Arne Georg Gleditsch ]Sent: Saturday, September 11, 2010 06:01 AM ]To: Scott Duplichan ]Cc: 'Marc Jones'; 'Carl-Daniel Hailfinger'; 'Coreboot' ]Subject: Re: [coreboot] rfc - gcc builtins and memset memcpy memmove memcmp ] ]"Scott Duplichan" scott@notabs.org writes: ]> In this report: ]> http://article.gmane.org/gmane.linux.bios/57707, ]> Arne may have been encountering the ClLinesToNbDis issue ]> (assuming the memset code was running from flash). Switching ]> to rep movs would greatly improve performance because unlike ]> a byte loop, rep movs loops in microcode which does not cause ]> continuous flash memory accesses. ] ]This was my assumption as well. After fixing the ClLinesToNbDis ]setting, I have removed the rep stosb code from my tree, and so far I've ]not observed the pathological memset behaviour that caused me to put it ]in in the first place. (As mentioned earlier this was never altogether ]deterministic, I'm assuming some critical part of the original memset ]loop needed to straddle cache lines or something for it to manifest.)
Interesting point about memcpy straddling a cache line boundary. It got me thinking about what the DediProg em100 trace function shows when booting from SPI flash. With SPI, the SB initially reads a dword at a time. If the processor is not caching code, a byte loop memcpy would trigger multiple dword reads from the flash chip for every byte copied. If BIOS sets SB option PrefetchEnSPIFromHost, then the SB will switch to cache line reads, and cache the last line read. Since a byte loop memcpy fits in a cache line, it seems conceivable that memcpy performance would be good unless the function straddles a cache line boundary. I am not sure what the situation is with LPC flash.
Anyway, I noticed coreboot is not setting the AMD SB bit PrefetchEnSPIFromHost. For big payloads, setting this bit could cut boot time by eliminating overhead when reading big chunks from SPI flash memory.
Thanks, Scott
] ]-- ] Arne.
On Sat, Sep 11, 2010 at 9:34 AM, Scott Duplichan scott@notabs.org wrote:
]-----Original Message----- ]From: coreboot-bounces@coreboot.org [mailto:coreboot-bounces@coreboot.org] On Behalf Of Arne Georg Gleditsch ]Sent: Saturday, September 11, 2010 06:01 AM ]To: Scott Duplichan ]Cc: 'Marc Jones'; 'Carl-Daniel Hailfinger'; 'Coreboot' ]Subject: Re: [coreboot] rfc - gcc builtins and memset memcpy memmove memcmp ] ]"Scott Duplichan" scott@notabs.org writes: ]> In this report: ]> http://article.gmane.org/gmane.linux.bios/57707, ]> Arne may have been encountering the ClLinesToNbDis issue ]> (assuming the memset code was running from flash). Switching ]> to rep movs would greatly improve performance because unlike ]> a byte loop, rep movs loops in microcode which does not cause ]> continuous flash memory accesses. ] ]This was my assumption as well. After fixing the ClLinesToNbDis ]setting, I have removed the rep stosb code from my tree, and so far I've ]not observed the pathological memset behaviour that caused me to put it ]in in the first place. (As mentioned earlier this was never altogether ]deterministic, I'm assuming some critical part of the original memset ]loop needed to straddle cache lines or something for it to manifest.)
Interesting point about memcpy straddling a cache line boundary. It got me thinking about what the DediProg em100 trace function shows when booting from SPI flash. With SPI, the SB initially reads a dword at a time. If the processor is not caching code, a byte loop memcpy would trigger multiple dword reads from the flash chip for every byte copied. If BIOS sets SB option PrefetchEnSPIFromHost, then the SB will switch to cache line reads, and cache the last line read. Since a byte loop memcpy fits in a cache line, it seems conceivable that memcpy performance would be good unless the function straddles a cache line boundary. I am not sure what the situation is with LPC flash.
Anyway, I noticed coreboot is not setting the AMD SB bit PrefetchEnSPIFromHost. For big payloads, setting this bit could cut boot time by eliminating overhead when reading big chunks from SPI flash memory.
Oh, we should do that.
But, that doesn't really explain why gcc doesn't do a rep stos or rep mov (which should hit the cache)/ That should be an easy optimization for gcc. It also doesn't address why coreboot has a functions when we could use gcc intrinsic that should be optimized for the architecture they are built for.
Marc
Marc Jones wrote:
doesn't really explain why gcc doesn't do a rep stos or rep mov (which should hit the cache)/ That should be an easy optimization for gcc.
Except I don't think it's an optimization performance-wise. But if you enable -Os then I would expect it to use rep stosb.
It also doesn't address why coreboot has a functions when we could use gcc intrinsic that should be optimized for the architecture they are built for.
Good point! I guess we rolled our own to be less dependent on gcc. I think it would be OK to use gcc's implementations though.
//Peter
On Sat, Sep 11, 2010 at 1:59 PM, Peter Stuge peter@stuge.se wrote:
Marc Jones wrote:
doesn't really explain why gcc doesn't do a rep stos or rep mov (which should hit the cache)/ That should be an easy optimization for gcc.
Except I don't think it's an optimization performance-wise. But if you enable -Os then I would expect it to use rep stosb.
It is an optimzation over a byte copy, which is what the code does. We don't have optimized mem functions.
It also doesn't address why coreboot has a functions when we could use gcc intrinsic that should be optimized for the architecture they are built for.
Good point! I guess we rolled our own to be less dependent on gcc. I think it would be OK to use gcc's implementations though.
Good point, but most compilers have intrinsics for these functions, Do we need a compiler intrinsics layer?
Marc
Am 12.09.2010 22:16, schrieb Marc Jones:
Good point! I guess we rolled our own to be less dependent on gcc. I think it would be OK to use gcc's implementations though.
Good point, but most compilers have intrinsics for these functions, Do we need a compiler intrinsics layer?
I guess the original statement is to be read as "less dependent on particular instances of gcc" (ie. distro compilers, bugs in certain gcc versions).
coreboot already is way too gnu specific in its use of binutils and gcc features that supporting other compilers won't make sense. The gcc intrinsics should be relatively stable, so no "layer" is necessary, I think.
Regards, Patrick