Nico Huber has posted comments on this change by Rishika Raj. ( https://review.coreboot.org/c/coreboot/+/83540?usp=email )
Change subject: soc/intel/mtl: Increase CAR_STACK_SIZE by 31KB for coreboot compatibility ......................................................................
Patch Set 10:
(1 comment)
Commit Message:
https://review.coreboot.org/c/coreboot/+/83540/comment/dfd76e34_815bdc4e?usp... : PS5, Line 7: soc/intel/meteorlake: Increase CAR STACK_SIZE by 31KB to meet coreboot requirements Sorry for the delay.
I'm not sure how much static analysis can do there because coreboot does have a bunch of code paths that are affected by runtime values (and a few recursive functions), so I'm not sure if it's really possible to reliably determine stack use at build time. If you know some solution that can do that it would be great if you could integrate it into the CI, of course.
Right, I had only the romstage case in mind, hopefully that's without recursion. Haven't found "the" tool yet, only did a quick search.
We do use `-Wstack-usage=1536` on non-x86 boards and maybe we should do something like that on x86 as well, but I believe that only checks each function on its own.
GCC seems to provide enough information with `-fcallgraph-info=su` in a somewhat parseable `.ci` file. Probably doable with a simple awk script, but could take half a day and it seems worth to look for existing tools first. (Saying "enough" information, because it'd be an overestimation: it provides the maximum stack usage per function, not per path, i.e. if you have separate paths through the function with different stack usages, you'd get the sum of the local stack usage as if everything was allocated at the top, AIUI.)
There is code in `romstage_main()` to place and check stack canaries and I assume that it should fire if we have an overflow like we're suspecting, but it just writes a BIOS_DEBUG to the console and nothing else. My suspicion is that the FSP may only get into the path that really makes it use its whole 512K in very rare and unreproducible circumstances (e.g. related to memory training), and we've never managed to capture the boot that actually fails in our case on a UART. All we are seeing is corrupted vboot data structures persisted in the TPM that look like they would have been created by coreboot trying to write back vboot state from memory that has been overwritten with a bunch of 0xff.
For FSP I think something like one canary per 4KiB could make sense, so we could quickly get an idea how much it's usually using. Of course that will never tell us the worst case, but I think it's interesting enough to know. I'll probably look into that.
I am not against moving the UPD and other big things off the stack either if someone wants to do that. (I've also been thinking about doing some other things to harden against stack overflows, e.g. move the stack to the end of the CAR space and let it grow downwards into unallocated space, but I likely won't have time to finish that work any time soon unfortunately.) For now, we're just trying to do the most simple thing that should reasonably ensure we're not going to have a problem on the Meteorlake platform.