I was having trouble with stack corruption. Using memset (C) instead of clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
TSC difference with clear_memory 0xFA884D TSC difference with memset 0x826742
Anyway, this patch removes a couple of files that don't need to exist anymore, given that only K8 was using clear_memory.
SIgned-off-by: Myles Watson mylesgw@gmail.com
Thanks, Myles
Am 11.03.2010 22:14, schrieb Myles Watson:
I was having trouble with stack corruption. Using memset (C) instead of clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
TSC difference with clear_memory 0xFA884D TSC difference with memset 0x826742
Interesting that the C version is faster in this case.
Anyway, this patch removes a couple of files that don't need to exist anymore, given that only K8 was using clear_memory.
Yay :-)
SIgned-off-by: Myles Watson <mylesgw@gmail.com mailto:mylesgw@gmail.com>
Acked-by: Patrick Georgi patrick.georgi@coresystems.de
Am 11.03.2010 22:14, schrieb Myles Watson:
I was having trouble with stack corruption. Using memset (C) instead of clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
I should have said it got rid of the memory corruption I was seeing, too :)
TSC difference with clear_memory 0xFA884D TSC difference with memset 0x826742
Interesting that the C version is faster in this case.
Yeah. Maybe you could do a lot better with a different asm implementation. You'd hope that the compiler would have a pretty good routine for clearing memory, though. I didn't dig into it to see what the real difference was.
SIgned-off-by: Myles Watson <mylesgw@gmail.com
mailto:mylesgw@gmail.com> Acked-by: Patrick Georgi patrick.georgi@coresystems.de
Rev 5201.
Thanks, Myles
Myles Watson mylesgw@gmail.com writes:
I was having trouble with stack corruption. Using memset (C) instead of clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
TSC difference with clear_memory 0xFA884D TSC difference with memset 0x826742
That's odd. I just recently sent a patch to the list ("ulzma delay") that did pretty much the opposite, as I was seeing really bad performance for the C memset function on my Opteron (Istanbul) boxes. memset would take minutes to do what ran in a handful of ms using "rep stosb", by all accounts because of instruction cache thrashing.
I see clear_memory was using "stosl", but apart from that it looks very similar to the variant I ended up with to improve performance.
Could you see if you experience stack corruption with the "rep stosb" patch I posted for memset as well? I'd like to see that go in, but of course it's a problem if it results in a performance degradation on other platforms. Perhaps we could enable it only for the platforms where instruction footprint/fetches is known to be an issue, ie fam10?
Myles Watson mylesgw@gmail.com writes:
I was having trouble with stack corruption. Using memset (C) instead of clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
TSC difference with clear_memory 0xFA884D TSC difference with memset 0x826742
That's odd. I just recently sent a patch to the list ("ulzma delay") that did pretty much the opposite, as I was seeing really bad performance for the C memset function on my Opteron (Istanbul) boxes. memset would take minutes to do what ran in a handful of ms using "rep stosb", by all accounts because of instruction cache thrashing.
I see clear_memory was using "stosl", but apart from that it looks very similar to the variant I ended up with to improve performance.
Once caching works correctly for fam10, maybe you'll see similar performance numbers?
Could you see if you experience stack corruption with the "rep stosb" patch I posted for memset as well?
It's hard to tell if you experience stack corruption... unless it bites you. There are a lot of places on the stack where it won't matter if it gets corrupted. I don't have a good way to test that.
I'd like to see that go in, but of course it's a problem if it results in a performance degradation on other platforms. Perhaps we could enable it only for the platforms where instruction footprint/fetches is known to be an issue, ie fam10?
The best thing would be to fix caching on fam10. Of course if that's not feasible for some reason, then adding asm just for that architecture could be the way to go. In general, the more we can keep it straight C, the better for me.
Thanks, Myles