Attention is currently required from: Paul Menzel.
Julius Werner has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/80254?usp=email )
Change subject: commonlib: Add assembly optimization for ipchksum() on arm64 ......................................................................
Patch Set 3:
(2 comments)
Commit Message:
https://review.coreboot.org/c/coreboot/+/80254/comment/541599d8_0ad72bee : PS2, Line 10: to take advantage of larger load : sizes and the add-with-carry instruction
Excuse my ignorance, but why can’t the compilers do that themselves?
It's very hard to write C code for this in the first place because the C language has no means to describe the carry that gets shifted out of the end of a data type. The best I can come up with is to check it afterwards like this ``` for (size_t i = 0; i < size; i++) { uint64_t new_sum = result + *data++; if (new_sum < result) new_sum++; result = new_sum; } ``` and interestingly the compiler does make something clever out of that on both x86 ``` 10: 48 03 04 d7 add (%rdi,%rdx,8),%rax 14: 48 83 d0 00 adc $0x0,%rax 18: 48 83 c2 01 add $0x1,%rdx 1c: 48 39 d6 cmp %rdx,%rsi 1f: 75 ef jne 10 <sum+0x10> (File Offset: 0x50) ``` and Arm ``` 8: f8637804 ldr x4, [x0, x3, lsl #3] c: 91000463 add x3, x3, #0x1 10: ab040042 adds x2, x2, x4 14: 9a823442 cinc x2, x2, cs // cs = hs, nlast 18: eb03003f cmp x1, x3 1c: 54ffff61 b.ne 8 <sum+0x8> (File Offset: 0x48) // b.any ``` but it's not really as clever as my code. In general I found that for very tight loops when every instruction can make a significant performance difference, GCC still doesn't really get as good as handcrafted assembly. (Also, when writing code that so explicitly tries to bait a specific instruction, one could argue that just writing the assembly directly is actually more readable.)
https://review.coreboot.org/c/coreboot/+/80254/comment/f6d7ec59_1c24940a : PS2, Line 12: by more than 20x
Thank you for documenting this. […]
They're not really worth mentioning (a couple of microseconds) because the coreboot tables that get checksummed by this are pretty small (at least on the platform I tested). I did this mostly for fun tbh, not for serious gain. I think some other platforms are also using that checksum for other things so it might help them a bit more (we used to use it for MRC cache where the impact was actually significant, but now that was replaced with xxhash already).