Myles Watson mylesgw@gmail.com writes:
I was having trouble with stack corruption. Using memset (C) instead of clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
TSC difference with clear_memory 0xFA884D TSC difference with memset 0x826742
That's odd. I just recently sent a patch to the list ("ulzma delay") that did pretty much the opposite, as I was seeing really bad performance for the C memset function on my Opteron (Istanbul) boxes. memset would take minutes to do what ran in a handful of ms using "rep stosb", by all accounts because of instruction cache thrashing.
I see clear_memory was using "stosl", but apart from that it looks very similar to the variant I ended up with to improve performance.
Could you see if you experience stack corruption with the "rep stosb" patch I posted for memset as well? I'd like to see that go in, but of course it's a problem if it results in a performance degradation on other platforms. Perhaps we could enable it only for the platforms where instruction footprint/fetches is known to be an issue, ie fam10?