-----Original Message----- From: coreboot-bounces@coreboot.org [mailto:coreboot-bounces@coreboot.org] On Behalf Of Carl-Daniel Hailfinger Sent: Thursday, September 09, 2010 06:10 PM To: Marc Jones Cc: Coreboot Subject: Re: [coreboot] rfc - gcc builtins and memset memcpy memmove memcmp
]On 10.09.2010 00:45, Marc Jones wrote: ]> Now that we have addressed the CAR/MMIO ROM copy issue (thanks Scott ]> and Arne), I wanted to revisit the other change that Arne proposed for ]> memset. ]> ]> http://article.gmane.org/gmane.linux.bios/57707 ]> ]> Some people didn't like the architecture specific code for memset. I ]> don't have a problem with it, but it did bring up a few questions. ]> ]> 1. Why isn't gcc generating the rep stosb itself? This should be a ]> standard optimization. ]> ]> 2. Why do we have functions for memset, memcpy, etc, when gcc has ]> builtins that could be used? Is this a remnant from romcc? ]> ] ]Please note that the AMD and Intel manuals explicitly mention that some ]instructions which would be used by a fast memcpy will bypass the cache, ]and as such are unsuitable to copy CAR contents to RAM.
Another area to watch for when using optimized memory functions is SMI handling code. The processor saves most registers before running SMI code, but not the XMM registers.
]Regards, ]Carl-Daniel ] ]-- ]http://www.hailfinger.org/
Here are some benchmark results for an AMD family 10h desktop processor. This system uses DDR-2. This shows that for AMD processors rep movs and rep stos performance is not great, though better than a byte loop. Note that this test is for operations on large blocks. For small blocks rep or byte loop is probably best due to less setup overhead.
======= dram, memset, aligned data ======= rep stos 3,614 MB/s optimized memset 6,220 MB/s compiler memset 3,622 MB/s byte loop memset 871 MB/s
======= dram, memset, unaligned data ======= rep stos 3,101 MB/s optimized memset 6,229 MB/s compiler memset 3,621 MB/s byte loop memset 871 MB/s
======= dram, memcpy, aligned data ======= rep movs 1,754 MB/s optimized memcpy 3,979 MB/s compiler memcpy 1,865 MB/s unaligned xmm method 1 2,293 MB/s byte loop memcpy 855 MB/s aligned NT xmm unrolled 4 3,233 MB/s aligned NT xmm 2,924 MB/s aligned NT xmm unrolled 4 reverse 3,235 MB/s aligned NT xmm unrolled 4 prefetch 3,965 MB/s aligned NT xmm unrolled 8 3,217 MB/s aligned NT xmm unrolled 8 reverse 2,641 MB/s aligned NT xmm unrolled 8 prefetch 3,954 MB/s aligned NT xmm unrolled 16 2,920 MB/s aligned NT xmm unrolled 16 prefetch 3,861 MB/s
======= dram, memcpy, unaligned data ======= rep movs 1,509 MB/s optimized memcpy 3,997 MB/s compiler memcpy 1,628 MB/s unaligned xmm method 1 2,242 MB/s byte loop memcpy 854 MB/s
In this report: http://article.gmane.org/gmane.linux.bios/57707, Arne may have been encountering the ClLinesToNbDis issue (assuming the memset code was running from flash). Switching to rep movs would greatly improve performance because unlike a byte loop, rep movs loops in microcode which does not cause continuous flash memory accesses.
Thanks, Scott