I'd like to propose that we set up a guideline for when coreboot should use die().
My belief, and proposal for the guideline is that coreboot should *ONLY* die in 2 situations:
1) The system cannot continue booting. Examples: * coreboot needs to jump somewhere to continue, but doesn’t have a location to jump to. * The system needs to reset to continue, but reset failed or isn’t implemented. * Memory init failed. (SPD can't be read, Failed to initialize memory controller, etc.)
2) There’s a known security issue if booting continues. Even this shouldn’t be a hard die(), because there should still be a manual process to boot the system so that the firmware can be updated. (we can work that out later) Examples: * HASH table verification failed! * Can't verify flash protections!
It looks like most die() calls are valid, but there are currently cases where coreboot dies for a “non-fatal” reason. The case I’m currently looking at is when running an FSP, and the FSP ID is not the expected ID. https://review.coreboot.org/c/coreboot/+/69412
My opinion is that coreboot should put out a warning, but jump to the FSP anyway. If it fails, it’s the FSP’s responsibility to return a failure code to coreboot to die. (More likely, it would die internally without returning, which is no worse than coreboot dying.
coreboot has no idea what the FSP is going to do and isn’t responsible for it. This is a case of “there looks like there’s a problem, proceed with caution”, but is being treated as a situation where coreboot cannot continue.
Other examples where maybe we shouldn’t be dying: * die("EHCI controller (00:1a.7) not listed in devicetree.\n"); We don’t need the EHCI controller to boot, at least not at this point.
* die("RAM init: Decoded SPD DRAM freq is slower than the controller minimum!"); So try with the controller minimum and die if that fails.
* die("Onboard SPDs must match each other"); Pick an SPD and try initializing memory with it. Die if that fails.
* die("IGD is disabled!"); So try to boot without graphics…
* die("Invalid GPIO pad number\n"); So put out a warning and skip the GPIO. An assert would be reasonable - a hard die() isn’t, in my opinion.
* die("BUG: Increase ehci_dbg_info reserve in CAR"); Use an assert, bail out of ehci_dbg.
coreboot doesn't even make ASSERT fatal by default because we don't have a good way to communicate errors to the user. This proposal aligns with that.
I've added this topic to the leadership meeting minutes for tomorrow. Let me know what you think.
Thanks! Martin
Hi
A lot of these examples date back from when it was hard to distinguish error messages from info in logs. To die on improper configuration is the next best thing to a buildtime static assert to make sure configuration is correct. Fixing a failed boot on hitting a die() is easier than figuring out why something is not working as intended. For instance you indeed might succeed in booting with a certain dram freq but it fails later due to instabilities... That is way harder to figure out than not being able to boot due to malformed SPD.
About not continuing to boot for security reasons: that really depends on the use case. Not continuing a boot on an invalid hash is really a valid use case.
Now that errors are easier to distinguish from info in a log, a different strategy could be approached. I personally think halting the boot is still not a bad tool to inform developers about some improper configuration that one should fix sooner rather than later in order to avoid problems.
Arthur
On Tue, Dec 13, 2022 at 7:34 PM Martin Roth via coreboot < coreboot@coreboot.org> wrote:
I'd like to propose that we set up a guideline for when coreboot should use die().
My belief, and proposal for the guideline is that coreboot should *ONLY* die in 2 situations:
- The system cannot continue booting. Examples:
- coreboot needs to jump somewhere to continue, but doesn’t have a
location to jump to.
- The system needs to reset to continue, but reset failed or isn’t
implemented.
- Memory init failed. (SPD can't be read, Failed to initialize memory
controller, etc.)
- There’s a known security issue if booting continues. Even this
shouldn’t be a hard die(), because there should still be a manual process to boot the system so that the firmware can be updated. (we can work that out later) Examples:
- HASH table verification failed!
- Can't verify flash protections!
It looks like most die() calls are valid, but there are currently cases where coreboot dies for a “non-fatal” reason. The case I’m currently looking at is when running an FSP, and the FSP ID is not the expected ID. https://review.coreboot.org/c/coreboot/+/69412
My opinion is that coreboot should put out a warning, but jump to the FSP anyway. If it fails, it’s the FSP’s responsibility to return a failure code to coreboot to die. (More likely, it would die internally without returning, which is no worse than coreboot dying.
coreboot has no idea what the FSP is going to do and isn’t responsible for it. This is a case of “there looks like there’s a problem, proceed with caution”, but is being treated as a situation where coreboot cannot continue.
Other examples where maybe we shouldn’t be dying:
die("EHCI controller (00:1a.7) not listed in devicetree.\n"); We don’t need the EHCI controller to boot, at least not at this point.
die("RAM init: Decoded SPD DRAM freq is slower than the controller
minimum!"); So try with the controller minimum and die if that fails.
die("Onboard SPDs must match each other"); Pick an SPD and try initializing memory with it. Die if that fails.
die("IGD is disabled!"); So try to boot without graphics…
die("Invalid GPIO pad number\n"); So put out a warning and skip the GPIO. An assert would be
reasonable - a hard die() isn’t, in my opinion.
- die("BUG: Increase ehci_dbg_info reserve in CAR"); Use an assert, bail out of ehci_dbg.
coreboot doesn't even make ASSERT fatal by default because we don't have a good way to communicate errors to the user. This proposal aligns with that.
I've added this topic to the leadership meeting minutes for tomorrow. Let me know what you think.
Thanks! Martin _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
On Wed, Dec 14, 2022 at 3:42 AM Arthur Heymans arthur@aheymans.xyz wrote:
Hi
A lot of these examples date back from when it was hard to distinguish error messages from info in logs. To die on improper configuration is the next best thing to a buildtime static assert to make sure configuration is correct. Fixing a failed boot on hitting a die() is easier than figuring out why something is not working as intended. For instance you indeed might succeed in booting with a certain dram freq but it fails later due to instabilities... That is way harder to figure out than not being able to boot due to malformed SPD.
About not continuing to boot for security reasons: that really depends on the use case. Not continuing a boot on an invalid hash is really a valid use case.
Now that errors are easier to distinguish from info in a log, a different strategy could be approached. I personally think halting the boot is still not a bad tool to inform developers about some improper configuration that one should fix sooner rather than later in order to avoid problems.
While I agree with the sentiment, in practice this tends to affect end users more than developers, and without the ability to obtain serial/debug output, the end user is left with a brick rather than a device that boots but has an error in the log. I agree with Martin here, failing to boot the board should be an option of last resort, or at the very least, a configurable option that defaults to off
Arthur
On Tue, Dec 13, 2022 at 7:34 PM Martin Roth via coreboot < coreboot@coreboot.org> wrote:
I'd like to propose that we set up a guideline for when coreboot should use die().
My belief, and proposal for the guideline is that coreboot should *ONLY* die in 2 situations:
- The system cannot continue booting. Examples:
- coreboot needs to jump somewhere to continue, but doesn’t have a
location to jump to.
- The system needs to reset to continue, but reset failed or isn’t
implemented.
- Memory init failed. (SPD can't be read, Failed to initialize memory
controller, etc.)
- There’s a known security issue if booting continues. Even this
shouldn’t be a hard die(), because there should still be a manual process to boot the system so that the firmware can be updated. (we can work that out later) Examples:
- HASH table verification failed!
- Can't verify flash protections!
It looks like most die() calls are valid, but there are currently cases where coreboot dies for a “non-fatal” reason. The case I’m currently looking at is when running an FSP, and the FSP ID is not the expected ID. https://review.coreboot.org/c/coreboot/+/69412
My opinion is that coreboot should put out a warning, but jump to the FSP anyway. If it fails, it’s the FSP’s responsibility to return a failure code to coreboot to die. (More likely, it would die internally without returning, which is no worse than coreboot dying.
coreboot has no idea what the FSP is going to do and isn’t responsible for it. This is a case of “there looks like there’s a problem, proceed with caution”, but is being treated as a situation where coreboot cannot continue.
Other examples where maybe we shouldn’t be dying:
- die("EHCI controller (00:1a.7) not listed in devicetree.\n"); We don’t need the EHCI controller to boot, at least not at this
point.
- die("RAM init: Decoded SPD DRAM freq is slower than the controller
minimum!"); So try with the controller minimum and die if that fails.
die("Onboard SPDs must match each other"); Pick an SPD and try initializing memory with it. Die if that fails.
die("IGD is disabled!"); So try to boot without graphics…
die("Invalid GPIO pad number\n"); So put out a warning and skip the GPIO. An assert would be
reasonable - a hard die() isn’t, in my opinion.
- die("BUG: Increase ehci_dbg_info reserve in CAR"); Use an assert, bail out of ehci_dbg.
coreboot doesn't even make ASSERT fatal by default because we don't have a good way to communicate errors to the user. This proposal aligns with that.
I've added this topic to the leadership meeting minutes for tomorrow. Let me know what you think.
Thanks! Martin _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
I think these guidelines seem reasonable in principle but we'd have to see what that means in detail in each case. For example, some cases of "cannot continue to boot" are obvious (can't boot without DRAM), but there may be others where the problematic feature seems less critical, but it's still not easy to decide how to continue in the case that's currently covered with a die(). I would say as long as "continuing" is as simple as directly returning from the function or skipping a few lines, we can apply this rule -- but we shouldn't interpret it to mean that you have to implement a lot of extra code to work around a problem or roll back state to before it attempted to do the thing that failed, if we don't think that error path is likely to be useful in practice.
For security issues, I would like to be very strict and say that any situation that _might_ indicate that any configured-in security feature cannot provide one of the guarantees it was designed to provide _must_ die or otherwise abort booting or return to a safe fallback path (like vboot recovery). This is very important to those coreboot users that are really relying on those security features in their products (e.g. ChromeOS). If we really have users that want to enable security features but would prefer their device to boot less secure rather than not at all, then we'll have to introduce a Kconfig to distinguish that preference.
For the cases where we remove a die() because of this, we should add an assert() instead so that people can still use CONFIG_FATAL_ASSERTS=y to get back the old behavior.
On Wed, Dec 14, 2022 at 4:17 PM Matt DeVillier matt.devillier@gmail.com wrote:
On Wed, Dec 14, 2022 at 3:42 AM Arthur Heymans arthur@aheymans.xyz wrote:
Hi
A lot of these examples date back from when it was hard to distinguish error messages from info in logs. To die on improper configuration is the next best thing to a buildtime static assert to make sure configuration is correct. Fixing a failed boot on hitting a die() is easier than figuring out why something is not working as intended. For instance you indeed might succeed in booting with a certain dram freq but it fails later due to instabilities... That is way harder to figure out than not being able to boot due to malformed SPD.
About not continuing to boot for security reasons: that really depends on the use case. Not continuing a boot on an invalid hash is really a valid use case.
Now that errors are easier to distinguish from info in a log, a different strategy could be approached. I personally think halting the boot is still not a bad tool to inform developers about some improper configuration that one should fix sooner rather than later in order to avoid problems.
While I agree with the sentiment, in practice this tends to affect end users more than developers, and without the ability to obtain serial/debug output, the end user is left with a brick rather than a device that boots but has an error in the log. I agree with Martin here, failing to boot the board should be an option of last resort, or at the very least, a configurable option that defaults to off
Arthur
On Tue, Dec 13, 2022 at 7:34 PM Martin Roth via coreboot coreboot@coreboot.org wrote:
I'd like to propose that we set up a guideline for when coreboot should use die().
My belief, and proposal for the guideline is that coreboot should *ONLY* die in 2 situations:
- The system cannot continue booting. Examples:
- coreboot needs to jump somewhere to continue, but doesn’t have a location to jump to.
- The system needs to reset to continue, but reset failed or isn’t implemented.
- Memory init failed. (SPD can't be read, Failed to initialize memory controller, etc.)
- There’s a known security issue if booting continues. Even this shouldn’t be a hard die(), because there should still be a manual process to boot the system so that the firmware can be updated. (we can work that out later) Examples:
- HASH table verification failed!
- Can't verify flash protections!
It looks like most die() calls are valid, but there are currently cases where coreboot dies for a “non-fatal” reason. The case I’m currently looking at is when running an FSP, and the FSP ID is not the expected ID. https://review.coreboot.org/c/coreboot/+/69412
My opinion is that coreboot should put out a warning, but jump to the FSP anyway. If it fails, it’s the FSP’s responsibility to return a failure code to coreboot to die. (More likely, it would die internally without returning, which is no worse than coreboot dying.
coreboot has no idea what the FSP is going to do and isn’t responsible for it. This is a case of “there looks like there’s a problem, proceed with caution”, but is being treated as a situation where coreboot cannot continue.
Other examples where maybe we shouldn’t be dying:
die("EHCI controller (00:1a.7) not listed in devicetree.\n"); We don’t need the EHCI controller to boot, at least not at this point.
die("RAM init: Decoded SPD DRAM freq is slower than the controller minimum!"); So try with the controller minimum and die if that fails.
die("Onboard SPDs must match each other"); Pick an SPD and try initializing memory with it. Die if that fails.
die("IGD is disabled!"); So try to boot without graphics…
die("Invalid GPIO pad number\n"); So put out a warning and skip the GPIO. An assert would be reasonable - a hard die() isn’t, in my opinion.
die("BUG: Increase ehci_dbg_info reserve in CAR"); Use an assert, bail out of ehci_dbg.
coreboot doesn't even make ASSERT fatal by default because we don't have a good way to communicate errors to the user. This proposal aligns with that.
I've added this topic to the leadership meeting minutes for tomorrow. Let me know what you think.
Thanks! Martin _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
Thanks to everyone for their feedback. Let's pick this up in the leadership meeting that starts in 20 minutes.
I'll update this thread after the meeting.
Martin
On Wed, Dec 14, 2022 at 9:17 AM Julius Werner jwerner@chromium.org wrote:
I think these guidelines seem reasonable in principle but we'd have to see what that means in detail in each case. For example, some cases of "cannot continue to boot" are obvious (can't boot without DRAM), but there may be others where the problematic feature seems less critical, but it's still not easy to decide how to continue in the case that's currently covered with a die(). I would say as long as "continuing" is as simple as directly returning from the function or skipping a few lines, we can apply this rule -- but we shouldn't interpret it to mean that you have to implement a lot of extra code to work around a problem or roll back state to before it attempted to do the thing that failed, if we don't think that error path is likely to be useful in practice.
For security issues, I would like to be very strict and say that any situation that _might_ indicate that any configured-in security feature cannot provide one of the guarantees it was designed to provide _must_ die or otherwise abort booting or return to a safe fallback path (like vboot recovery). This is very important to those coreboot users that are really relying on those security features in their products (e.g. ChromeOS). If we really have users that want to enable security features but would prefer their device to boot less secure rather than not at all, then we'll have to introduce a Kconfig to distinguish that preference.
For the cases where we remove a die() because of this, we should add an assert() instead so that people can still use CONFIG_FATAL_ASSERTS=y to get back the old behavior.
On Wed, Dec 14, 2022 at 4:17 PM Matt DeVillier matt.devillier@gmail.com wrote:
On Wed, Dec 14, 2022 at 3:42 AM Arthur Heymans arthur@aheymans.xyz wrote:
Hi
A lot of these examples date back from when it was hard to distinguish error messages from info in logs. To die on improper configuration is the next best thing to a buildtime static assert to make sure configuration is correct. Fixing a failed boot on hitting a die() is easier than figuring out why something is not working as intended. For instance you indeed might succeed in booting with a certain dram freq but it fails later due to instabilities... That is way harder to figure out than not being able to boot due to malformed SPD.
About not continuing to boot for security reasons: that really depends on the use case. Not continuing a boot on an invalid hash is really a valid use case.
Now that errors are easier to distinguish from info in a log, a different strategy could be approached. I personally think halting the boot is still not a bad tool to inform developers about some improper configuration that one should fix sooner rather than later in order to avoid problems.
While I agree with the sentiment, in practice this tends to affect end users more than developers, and without the ability to obtain serial/debug output, the end user is left with a brick rather than a device that boots but has an error in the log. I agree with Martin here, failing to boot the board should be an option of last resort, or at the very least, a configurable option that defaults to off
Arthur
On Tue, Dec 13, 2022 at 7:34 PM Martin Roth via coreboot coreboot@coreboot.org wrote:
I'd like to propose that we set up a guideline for when coreboot should use die().
My belief, and proposal for the guideline is that coreboot should *ONLY* die in 2 situations:
- The system cannot continue booting. Examples:
- coreboot needs to jump somewhere to continue, but doesn’t have a location to jump to.
- The system needs to reset to continue, but reset failed or isn’t implemented.
- Memory init failed. (SPD can't be read, Failed to initialize memory controller, etc.)
- There’s a known security issue if booting continues. Even this shouldn’t be a hard die(), because there should still be a manual process to boot the system so that the firmware can be updated. (we can work that out later) Examples:
- HASH table verification failed!
- Can't verify flash protections!
It looks like most die() calls are valid, but there are currently cases where coreboot dies for a “non-fatal” reason. The case I’m currently looking at is when running an FSP, and the FSP ID is not the expected ID. https://review.coreboot.org/c/coreboot/+/69412
My opinion is that coreboot should put out a warning, but jump to the FSP anyway. If it fails, it’s the FSP’s responsibility to return a failure code to coreboot to die. (More likely, it would die internally without returning, which is no worse than coreboot dying.
coreboot has no idea what the FSP is going to do and isn’t responsible for it. This is a case of “there looks like there’s a problem, proceed with caution”, but is being treated as a situation where coreboot cannot continue.
Other examples where maybe we shouldn’t be dying:
die("EHCI controller (00:1a.7) not listed in devicetree.\n"); We don’t need the EHCI controller to boot, at least not at this point.
die("RAM init: Decoded SPD DRAM freq is slower than the controller minimum!"); So try with the controller minimum and die if that fails.
die("Onboard SPDs must match each other"); Pick an SPD and try initializing memory with it. Die if that fails.
die("IGD is disabled!"); So try to boot without graphics…
die("Invalid GPIO pad number\n"); So put out a warning and skip the GPIO. An assert would be reasonable - a hard die() isn’t, in my opinion.
die("BUG: Increase ehci_dbg_info reserve in CAR"); Use an assert, bail out of ehci_dbg.
coreboot doesn't even make ASSERT fatal by default because we don't have a good way to communicate errors to the user. This proposal aligns with that.
I've added this topic to the leadership meeting minutes for tomorrow. Let me know what you think.
Thanks! Martin _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org