From the leadership meeting:
There was general agreement about creating a guideline to only die when coreboot cannot continue. This is strictly a guideline, and is up to the developers to do what makes sense.
This means that if it's possible to continue booting, we should do our best. When possible, print a warning and try to continue.
Notes:
[Angel] We agree with the reasoning, especially because what CB:55690 fixed was quite embarrassing: it broke booting on the T440p until CB;55687 fixed it by adding one line... However, we’d like to ask everyone to be careful when considering removing die()’s in RAM init. Memory init is arcane magic, and it can fail in the most subtle of ways (see CB:48550 for a rather embarrassing example, and CB:47680 for the fix to a long-standing Sandy/Ivy NRI bug that could cause instability). The two errors shown here are about the installed memory modules: if a computer worked before, these errors can only occur if the installed memory modules are changed (or if the SPD data got corrupted somehow, which is even worse). In our honest opinion, it’s best for the user to immediately let them know that their newly-installed memory isn’t going to work properly rather than to risk memory-related instability for the sake of booting. We’d prefer to keep the “controller minimum freq” die() because it stops the RAM from being overclocked (we could add a Kconfig option to continue anyway, though), but we agree with reworking the “SPD mismatch” code as described, and we even volunteer to do so if we can test it. On a somewhat related note, it’s a good idea to avoid using die() within memory init code, but to propagate errors to the caller. This allows retrying memory init with different parameters or disabling failing channels, which increases the chances of booting.
[Angel] TL;DR the 2nd example is fine, the rest are OK. Consider using WARN_ON and BUG_ON macros (assert, but allow handling the result in regular C code). For instance: if (BUG_ON(did_error_happen())) handle_error();
[Angel] TO DO: Add the aforementioned macros to coreboot.
[Angel] TO DO: Figure out how to enable FATAL_ASSERTS for 9elements QA.
[David] "Boots, not bricks" - Need a recovery path if at all possible. Use asserts - so fatal asserts will halt the system and if fatal-asserts are not set, the system will continue. * coreboot needs better document on asserts and error handling. * Julius will try to work on this. Angel can help. * Martin to add documentation to the header files so that information about assert() and die() for IDEs that support this feature.
* We need a way to report failures for security issues to later in the boot process when that information can be given to the user or to reboot to whatever recovery mode is implemented. This needs to be designed and discussed separately.
Use asserts - so fatal asserts will halt the system and if fatal-asserts are not set, the system will continue.
- coreboot needs better document on asserts and error handling.
- Julius will try to work on this. Angel can help.
- Martin to add documentation to the header files so that information about assert() and die() for IDEs that support this feature.
After thinking about this for a bit I thought it would make most sense to add it to the coding style (instead of a separate document), and combine it all into one section about general error handling guidelines. Here's a draft proposal, let me know what you think: https://review.coreboot.org/70775