I'd like to propose that we set up a guideline for when coreboot should use die().
My belief, and proposal for the guideline is that coreboot should *ONLY* die in 2 situations:
1) The system cannot continue booting. Examples: * coreboot needs to jump somewhere to continue, but doesn’t have a location to jump to. * The system needs to reset to continue, but reset failed or isn’t implemented. * Memory init failed. (SPD can't be read, Failed to initialize memory controller, etc.)
2) There’s a known security issue if booting continues. Even this shouldn’t be a hard die(), because there should still be a manual process to boot the system so that the firmware can be updated. (we can work that out later) Examples: * HASH table verification failed! * Can't verify flash protections!
It looks like most die() calls are valid, but there are currently cases where coreboot dies for a “non-fatal” reason. The case I’m currently looking at is when running an FSP, and the FSP ID is not the expected ID. https://review.coreboot.org/c/coreboot/+/69412
My opinion is that coreboot should put out a warning, but jump to the FSP anyway. If it fails, it’s the FSP’s responsibility to return a failure code to coreboot to die. (More likely, it would die internally without returning, which is no worse than coreboot dying.
coreboot has no idea what the FSP is going to do and isn’t responsible for it. This is a case of “there looks like there’s a problem, proceed with caution”, but is being treated as a situation where coreboot cannot continue.
Other examples where maybe we shouldn’t be dying: * die("EHCI controller (00:1a.7) not listed in devicetree.\n"); We don’t need the EHCI controller to boot, at least not at this point.
* die("RAM init: Decoded SPD DRAM freq is slower than the controller minimum!"); So try with the controller minimum and die if that fails.
* die("Onboard SPDs must match each other"); Pick an SPD and try initializing memory with it. Die if that fails.
* die("IGD is disabled!"); So try to boot without graphics…
* die("Invalid GPIO pad number\n"); So put out a warning and skip the GPIO. An assert would be reasonable - a hard die() isn’t, in my opinion.
* die("BUG: Increase ehci_dbg_info reserve in CAR"); Use an assert, bail out of ehci_dbg.
coreboot doesn't even make ASSERT fatal by default because we don't have a good way to communicate errors to the user. This proposal aligns with that.
I've added this topic to the leadership meeting minutes for tomorrow. Let me know what you think.
Thanks! Martin