Hi,
In the same way as unrv2b used to, ulzma exhibits very bad instruction fetch behavior on my Opteron CPUs/Tyan S2912 board. I'm not entirely sure what causes this, and despite having spent significant time digging through it I'm not able to mend this by adjusting MTRRs or other cache settings. So I've taken the easy route, and patched ulzma to copy LzmaDecode to the stack before executing. This brings running time for fallback stage uncompress from nearly two minutes to 50ms here.
I'm also seeing weird performance behavior from memset -- this is not consistent, and just started appearing at some point in my development here. I assume this has something to do with alignment, but I was frankly not in the mood to start debugging this as well. I've included a patch that reduces memset to "rep stosb" under x86, which eliminates the worst-case behavior (in the order of minutes spent in cbfs_load_stage) I was seeing here.
Signed-off-by: Arne Georg Gleditsch arne.gleditsch@numascale.com