hi Julius,
I very much appreciate you taking the time to elaborate on the issues we'd face trying to pair a "mismatched" RO/RW combination. Thoughtful, well articulated discussions like this are a big part of what makes our community a great one IMO.
-Matt
On Wed, Aug 21, 2024 at 6:29 PM Julius Werner jwerner@chromium.org wrote:
[resent with correct sender address]
Hi,
I wanted to pick up the discussion from the leadership meeting again here because I think it's easier to show my point with a few examples. If you want an old coreboot RO build to work together with a ToT RW build, then you don't just care about the vboot workbuffer matching... *everything* that is shared in any way between bootblock/verstage and romstage/ramstage/payload needs to be kept in sync or risks introducing dangerous and not necessarily immediately obvious errors.
One major issue is the pre-RAM memory layout. Take for example this simple one-line CL: https://review.coreboot.org/83680 . Would most reviewers expect that it makes RW firmware built after that point incompatible with RO firmware built before? Well, it does, because it increases the size of a memory region here
https://review.coreboot.org/plugins/gitiles/coreboot/+/refs/heads/main/src/a... which then moves the offsets of all subsequent memory regions downwards accordingly. Some of these regions contain buffers that were passed by the bootblock to later stages (e.g. CBFS_MCACHE, FMAP_CACHE), and when the romstage tries to reuse those buffers it suddenly finds different data at the same offsets.
We actually had to figure out how to support this for an RW update to one of our shipped Chromebooks, and it looks like this:
https://chromium-review.googlesource.com/c/chromiumos/third_party/coreboot/+... . It needs an ugly, complicated assembly hack to copy bytes around from the locations where RO firmware left them into the locations where RW firmware expects them. Since one of these regions is the stack, you can't even do that in C code because you might already be trampling over the data you're still trying to save once you enter a C function. And this patch is just tailored to this one particular case (and platform), it's not a universal solution — if we were resizing a different region, or swapping the order of two regions, that would require a completely different (maybe impossible) workaround.
Now, we could try to replace our simple memory region management with something more complicated that comes with a built-in tagged directory (kinda like a second CBMEM in pre-RAM). Then, if every single access to one of those regions went through that directory, it would automatically find the right location. We would need to write the code to parse that directory in assembly for every architecture, since the stack setup is one of the things that uses them. But even then, that doesn't solve the resize problem: if we e.g. find a rare stack overflow in the FSP that requires us to have a bigger romstage stack, then we can't just reuse the directory the bootblock has written, we need to actually change it to fix that bug and move all those regions around to fit the new requirements. And what if migrating from the old to the new layout on the fly is not possible? If we flip the order of two sections, there might not be enough free scratch space to temporarily save the contents of one section while we move the contents of the other into its place. (Note that car.ld is only the shared pre-RAM layout for Intel platforms. Every Arm and (recent) AMD platform has their own separate layout like this, and some of them tend to be very tight. We regularly have to land patches like https://review.coreboot.org/78970 to make some sections larger or smaller so that things still fit on certain platforms. Each of those changes would lead to one of these situations where suddenly the romstage needs to move everything into a different place from where the bootblock left it because it needs space for new things that the older code didn't account for.)
There are plenty of other dependencies between RO and RW stages, many of them very abstract and platform specific. For example, on Arm SoCs we often have a clock tree where a few shared top-level PLLs get multiplexed onto many peripheral devices that each have their own pre-divisors. The PLLs are usually set up by bootblock code (because some of them are needed very early), and the peripheral divisors are usually configured by the driver that initializes that peripheral. If you e.g. initialize an I2C controller in ramstage to 400KHz, the I2C driver can calculate what divider to set based on the target frequency and a #define constant of what the top-level PLL was set to. But what if for some reason the top-level PLL frequency needs to change? Today we'd submit a patch to change that #define and all the peripheral driver code will continue to calculate accurate divisor values. But if you then try to combine that RW ramstage with an RO bootblock built before that change, suddenly your divisor will be calculated with an incorrect value and your I2C clock will either be too fast or too slow, resulting in maybe the peripheral not working, or maybe your boot speed just regressing in a way nobody will ever track down, or maybe the bus running slightly too fast so that on 99.9% of devices it will still work but on 0.1% of devices you suddenly start having an awful heisenbug that will be a royal pain to track down.
There can be a near-infinite number of scenarios like this on various platforms where code running in a later stage depends on some hardware initialization that code running in an earlier stage did, and I see no practical way to track them all even if we did try to implement explicit backwards-compatibility for all of them. vboot is an optional feature in coreboot and most people aren't even using it. How are we going to get every developer/reviewer to care about whether a change they're about to make could subtly break compatibility for a feature they've never used?
And even if we did find a way to perfectly track every possible dependency and wanted to implement backwards-compatibility for all of them, what is the cost for doing so (both the effort of writing it, and the amount of extra romstage code we need to carry around in our binaries)? Not every incompatibility is easy to fix up once you detect it. Let's go back to the vboot workbuffer, for example: the workbuffer contains among other things a couple of data structures that have been read from the TPM and contain information relevant at various points in the boot flow. Right now, two of those (secdata_firmware and secdata_kernel) get loaded in verstage, while the third (secdata_fwmp) only gets loaded in our payload. But we have been discussing that we might want to (for various reasons) change this in the future so that secdata_fwmp is also loaded in verstage. Let's say we make that change, and we want to stay backwards compatible to RO firmware built before it. That means that early in romstage, before any new code that depends on secdata_fwmp runs, we'll need to run some code that checks the workbuffer version and fixes this incompatibility if it is too old. But in order to be able to load secdata_fwmp, we'll need to link the entire TPM driver stack into romstage which otherwise wouldn't be needed there on many platforms. So we are bloating our binary size in a potentially space-restricted pre-RAM stage just to be able to fix an issue that most people are never going to care about. Multiply that by the dozens of possible incompatibilities that will accumulate over the years.
I totally understand that it would be great if we could just build a system where you can just run any RO and RW together and have them work it out, but I really don't think that's feasible in practice. vboot was not designed to support that, for good reason. And if we tried to just get started on adding compatibility code for specific issues that people have on a specific platform right now to "see how far we get", I don't think it's going to work out in the long term and it will just lead to disruption and extra code complexity right now that's eventually not going to pay off for anyone. It may work for the one platform you care about for a while but sooner or later there'll be another compatibility break that you can't easily mitigate. And if people start to rely on this working then they may get angry if someone else makes a change that is going to break compatibility on their platform in a way that can't be fixed, which I think would be a bad thing, because we do need to be able to develop new features and fix architectural issues (both in vboot and in other code that can cause RO->RW dependencies) unburdened by legacy code from years ago.