On Thu, Aug 22, 2024 at 1:30 AM Julius Werner <jwerner@chromium.org> wrote:

[resent with correct sender address]

Hi,

I wanted to pick up the discussion from the leadership meeting again
here because I think it's easier to show my point with a few examples.
If you want an old coreboot RO build to work together with a ToT RW
build, then you don't just care about the vboot workbuffer matching...
*everything* that is shared in any way between bootblock/verstage and
romstage/ramstage/payload needs to be kept in sync or risks
introducing dangerous and not necessarily immediately obvious errors.

One major issue is the pre-RAM memory layout. Take for example this
simple one-line CL: https://review.coreboot.org/83680 . Would most
reviewers expect that it makes RW firmware built after that point
incompatible with RO firmware built before? Well, it does, because it
increases the size of a memory region here
https://review.coreboot.org/plugins/gitiles/coreboot/+/refs/heads/main/src/arch/x86/car.ld#34
which then moves the offsets of all subsequent memory regions
downwards accordingly. Some of these regions contain buffers that were
passed by the bootblock to later stages (e.g. CBFS_MCACHE,
FMAP_CACHE), and when the romstage tries to reuse those buffers it
suddenly finds different data at the same offsets.

We actually had to figure out how to support this for an RW update to
one of our shipped Chromebooks, and it looks like this:
https://chromium-review.googlesource.com/c/chromiumos/third_party/coreboot/+/5732763/3/src/arch/x86/assembly_entry.S
. It needs an ugly, complicated assembly hack to copy bytes around
from the locations where RO firmware left them into the locations
where RW firmware expects them. Since one of these regions is the
stack, you can't even do that in C code because you might already be
trampling over the data you're still trying to save once you enter a C
function. And this patch is just tailored to this one particular case
(and platform), it's not a universal solution — if we were resizing a
different region, or swapping the order of two regions, that would
require a completely different (maybe impossible) workaround.

Now, we could try to replace our simple memory region management with
something more complicated that comes with a built-in tagged directory
(kinda like a second CBMEM in pre-RAM). Then, if every single access
to one of those regions went through that directory, it would
automatically find the right location. We would need to write the code
to parse that directory in assembly for every architecture, since the
stack setup is one of the things that uses them. But even then, that
doesn't solve the resize problem: if we e.g. find a rare stack
overflow in the FSP that requires us to have a bigger romstage stack,
then we can't just reuse the directory the bootblock has written, we
need to actually change it to fix that bug and move all those regions
around to fit the new requirements. And what if migrating from the old
to the new layout on the fly is not possible? If we flip the order of
two sections, there might not be enough free scratch space to
temporarily save the contents of one section while we move the
contents of the other into its place. (Note that car.ld is only the
shared pre-RAM layout for Intel platforms. Every Arm and (recent) AMD
platform has their own separate layout like this, and some of them
tend to be very tight. We regularly have to land patches like
https://review.coreboot.org/78970 to make some sections larger or
smaller so that things still fit on certain platforms. Each of those
changes would lead to one of these situations where suddenly the
romstage needs to move everything into a different place from where
the bootblock left it because it needs space for new things that the
older code didn't account for.)

There are plenty of other dependencies between RO and RW stages, many
of them very abstract and platform specific. For example, on Arm SoCs
we often have a clock tree where a few shared top-level PLLs get
multiplexed onto many peripheral devices that each have their own
pre-divisors. The PLLs are usually set up by bootblock code (because
some of them are needed very early), and the peripheral divisors are
usually configured by the driver that initializes that peripheral. If
you e.g. initialize an I2C controller in ramstage to 400KHz, the I2C
driver can calculate what divider to set based on the target frequency
and a #define constant of what the top-level PLL was set to. But what
if for some reason the top-level PLL frequency needs to change? Today
we'd submit a patch to change that #define and all the peripheral
driver code will continue to calculate accurate divisor values. But if
you then try to combine that RW ramstage with an RO bootblock built
before that change, suddenly your divisor will be calculated with an
incorrect value and your I2C clock will either be too fast or too
slow, resulting in maybe the peripheral not working, or maybe your
boot speed just regressing in a way nobody will ever track down, or
maybe the bus running slightly too fast so that on 99.9% of devices it
will still work but on 0.1% of devices you suddenly start having an
awful heisenbug that will be a royal pain to track down.

There can be a near-infinite number of scenarios like this on various
platforms where code running in a later stage depends on some hardware
initialization that code running in an earlier stage did, and I see no
practical way to track them all even if we did try to implement
explicit backwards-compatibility for all of them. vboot is an optional
feature in coreboot and most people aren't even using it. How are we
going to get every developer/reviewer to care about whether a change
they're about to make could subtly break compatibility for a feature
they've never used?

And even if we did find a way to perfectly track every possible
dependency and wanted to implement backwards-compatibility for all of
them, what is the cost for doing so (both the effort of writing it,
and the amount of extra romstage code we need to carry around in our
binaries)? Not every incompatibility is easy to fix up once you detect
it. Let's go back to the vboot workbuffer, for example: the workbuffer
contains among other things a couple of data structures that have been
read from the TPM and contain information relevant at various points
in the boot flow. Right now, two of those (secdata_firmware and
secdata_kernel) get loaded in verstage, while the third (secdata_fwmp)
only gets loaded in our payload. But we have been discussing that we
might want to (for various reasons) change this in the future so that
secdata_fwmp is also loaded in verstage. Let's say we make that
change, and we want to stay backwards compatible to RO firmware built
before it. That means that early in romstage, before any new code that
depends on secdata_fwmp runs, we'll need to run some code that checks
the workbuffer version and fixes this incompatibility if it is too
old. But in order to be able to load secdata_fwmp, we'll need to link
the entire TPM driver stack into romstage which otherwise wouldn't be
needed there on many platforms. So we are bloating our binary size in
a potentially space-restricted pre-RAM stage just to be able to fix an
issue that most people are never going to care about. Multiply that by
the dozens of possible incompatibilities that will accumulate over the
years.

I totally understand that it would be great if we could just build a
system where you can just run any RO and RW together and have them
work it out, but I really don't think that's feasible in practice.
vboot was not designed to support that, for good reason. And if we
tried to just get started on adding compatibility code for specific
issues that people have on a specific platform right now to "see how
far we get", I don't think it's going to work out in the long term and
it will just lead to disruption and extra code complexity right now
that's eventually not going to pay off for anyone. It may work for the
one platform you care about for a while but sooner or later there'll
be another compatibility break that you can't easily mitigate. And if
people start to rely on this working then they may get angry if
someone else makes a change that is going to break compatibility on
their platform in a way that can't be fixed, which I think would be a
bad thing, because we do need to be able to develop new features and
fix architectural issues (both in vboot and in other code that can
cause RO->RW dependencies) unburdened by legacy code from years ago.
_______________________________________________
coreboot mailing list -- coreboot@coreboot.org
To unsubscribe send an email to coreboot-leave@coreboot.org