Hi

A lot of these examples date back from when it was hard to distinguish error messages from info in logs.
To die on improper configuration is the next best thing to a buildtime static assert to make sure configuration is correct.
Fixing a failed boot on hitting a die() is easier than figuring out why something is not working as intended.
For instance you indeed might succeed in booting with a certain dram freq but it fails later due to instabilities...
That is way harder to figure out than not being able to boot due to malformed SPD.

About not continuing to boot for security reasons: that really depends on the use case.
Not continuing a boot on an invalid hash is really a valid use case.

Now that errors are easier to distinguish from info in a log, a different strategy could be approached.
I personally think halting the boot is still not a bad tool to inform developers about some improper configuration that one
should fix sooner rather than later in order to avoid problems.

Arthur

On Tue, Dec 13, 2022 at 7:34 PM Martin Roth via coreboot <coreboot@coreboot.org> wrote:
I'd like to propose that we set up a guideline for when coreboot should use die().

My belief, and proposal for the guideline is that coreboot should *ONLY* die in 2 situations:

1) The system cannot continue booting.  Examples:
 * coreboot needs to jump somewhere to continue, but doesn’t have a location to jump to.
 * The system needs to reset to continue, but reset failed or isn’t implemented.
 * Memory init failed. (SPD can't be read, Failed to initialize memory controller, etc.)

2) There’s a known security issue if booting continues.  Even this shouldn’t be a hard die(), because there should still be a manual process to boot the system so that the firmware can be updated.  (we can work that out later) Examples:
 * HASH table verification failed!
 * Can't verify flash protections!


It looks like most die() calls are valid, but there are currently cases where coreboot dies for a “non-fatal” reason.  The case I’m currently looking at is when running an FSP, and the FSP ID is not the expected ID. https://review.coreboot.org/c/coreboot/+/69412

 My opinion is that coreboot should put out a warning, but jump to the FSP anyway.  If it fails, it’s the FSP’s responsibility to return a failure code to coreboot to die. (More likely, it would die internally without returning, which is no worse than coreboot dying.

coreboot has no idea what the FSP is going to do and isn’t responsible for it.  This is a case of “there looks like there’s a problem, proceed with caution”, but is being treated as a situation where coreboot cannot continue.


Other examples where maybe we shouldn’t be dying:
 * die("EHCI controller (00:1a.7) not listed in devicetree.\n");
     We don’t need the EHCI controller to boot, at least not at this point.

 * die("RAM init: Decoded SPD DRAM freq is slower than the controller minimum!");
     So try with the controller minimum and die if that fails.

 * die("Onboard SPDs must match each other");
     Pick an SPD and try initializing memory with it.  Die if that fails.

 * die("IGD is disabled!");
     So try to boot without graphics…

 * die("Invalid GPIO pad number\n");
     So put out a warning and skip the GPIO.  An assert would be reasonable - a hard die() isn’t, in my opinion.

 * die("BUG: Increase ehci_dbg_info reserve in CAR");
     Use an assert, bail out of ehci_dbg.


coreboot doesn't even make ASSERT fatal by default because we don't have a good way to communicate errors to the user.  This proposal aligns with that.


I've added this topic to the leadership meeting minutes for tomorrow.  Let me know what you think.

Thanks!
Martin
_______________________________________________
coreboot mailing list -- coreboot@coreboot.org
To unsubscribe send an email to coreboot-leave@coreboot.org