Myles Watson mylesgw@gmail.com writes:
I was having trouble with stack corruption. Using memset (C) instead of clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
TSC difference with clear_memory 0xFA884D TSC difference with memset 0x826742
That's odd. I just recently sent a patch to the list ("ulzma delay") that did pretty much the opposite, as I was seeing really bad performance for the C memset function on my Opteron (Istanbul) boxes. memset would take minutes to do what ran in a handful of ms using "rep stosb", by all accounts because of instruction cache thrashing.
I see clear_memory was using "stosl", but apart from that it looks very similar to the variant I ended up with to improve performance.
Once caching works correctly for fam10, maybe you'll see similar performance numbers?
Could you see if you experience stack corruption with the "rep stosb" patch I posted for memset as well?
It's hard to tell if you experience stack corruption... unless it bites you. There are a lot of places on the stack where it won't matter if it gets corrupted. I don't have a good way to test that.
I'd like to see that go in, but of course it's a problem if it results in a performance degradation on other platforms. Perhaps we could enable it only for the platforms where instruction footprint/fetches is known to be an issue, ie fam10?
The best thing would be to fix caching on fam10. Of course if that's not feasible for some reason, then adding asm just for that architecture could be the way to go. In general, the more we can keep it straight C, the better for me.
Thanks, Myles