[coreboot] rfc - gcc builtins and memset memcpy memmove memcmp

Fri Sep 10 03:30:02 CEST 2010

-----Original Message-----
From: coreboot-bounces at coreboot.org [mailto:coreboot-bounces at coreboot.org] On Behalf Of Carl-Daniel Hailfinger
Sent: Thursday, September 09, 2010 06:10 PM
To: Marc Jones
Cc: Coreboot
Subject: Re: [coreboot] rfc - gcc builtins and memset memcpy memmove memcmp

]On 10.09.2010 00:45, Marc Jones wrote:
]> Now that we have addressed the CAR/MMIO ROM copy issue (thanks Scott
]> and Arne), I wanted to revisit the other change that Arne proposed for
]> memset.
]>
]> http://article.gmane.org/gmane.linux.bios/57707
]>
]> Some people didn't like the architecture specific code for memset. I
]> don't have a problem with it, but it did bring up a few questions.
]>
]> 1. Why isn't gcc generating the rep stosb itself? This should be a
]> standard optimization.
]>
]> 2. Why do we have functions for memset, memcpy, etc, when gcc has
]> builtins that could be used? Is this a remnant from romcc?
]>   
]
]Please note that the AMD and Intel manuals explicitly mention that some
]instructions which would be used by a fast memcpy will bypass the cache,
]and as such are unsuitable to copy CAR contents to RAM.

Another area to watch for when using optimized memory functions is
SMI handling code. The processor saves most registers before running
SMI code, but not the XMM registers.

]Regards,
]Carl-Daniel
]
]-- 
]http://www.hailfinger.org/

Here are some benchmark results for an AMD family 10h desktop
processor. This system uses DDR-2. This shows that for AMD
processors rep movs and rep stos performance is not great,
though better than a byte loop. Note that this test is for
operations on large blocks. For small blocks rep or byte loop
is probably best due to less setup overhead.

======= dram, memset, aligned data =======
rep stos                                         3,614 MB/s
optimized memset                                 6,220 MB/s
compiler memset                                  3,622 MB/s
byte loop memset                                   871 MB/s

======= dram, memset, unaligned data =======
rep stos                                         3,101 MB/s
optimized memset                                 6,229 MB/s
compiler memset                                  3,621 MB/s
byte loop memset                                   871 MB/s

======= dram, memcpy, aligned data =======
rep movs                                         1,754 MB/s
optimized memcpy                                 3,979 MB/s
compiler memcpy                                  1,865 MB/s
unaligned xmm method 1                           2,293 MB/s
byte loop memcpy                                   855 MB/s
aligned NT xmm unrolled 4                        3,233 MB/s
aligned NT xmm                                   2,924 MB/s
aligned NT xmm unrolled 4 reverse                3,235 MB/s
aligned NT xmm unrolled 4  prefetch              3,965 MB/s
aligned NT xmm unrolled 8                        3,217 MB/s
aligned NT xmm unrolled 8 reverse                2,641 MB/s
aligned NT xmm unrolled 8  prefetch              3,954 MB/s
aligned NT xmm unrolled 16                       2,920 MB/s
aligned NT xmm unrolled 16 prefetch              3,861 MB/s

======= dram, memcpy, unaligned data =======
rep movs                                         1,509 MB/s
optimized memcpy                                 3,997 MB/s
compiler memcpy                                  1,628 MB/s
unaligned xmm method 1                           2,242 MB/s
byte loop memcpy                                   854 MB/s

In this report:
http://article.gmane.org/gmane.linux.bios/57707,
Arne may have been encountering the ClLinesToNbDis issue
(assuming the memset code was running from flash). Switching
to rep movs would greatly improve performance because unlike
a byte loop, rep movs loops in microcode which does not cause
continuous flash memory accesses.

Thanks,
Scott