Julius Werner has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/38423 )
Change subject: cbfs: Add metadata cache ......................................................................
Patch Set 18:
(1 comment)
https://review.coreboot.org/c/coreboot/+/38423/10/src/lib/cbfs.c File src/lib/cbfs.c:
https://review.coreboot.org/c/coreboot/+/38423/10/src/lib/cbfs.c@28 PS10, Line 28: if (err == CB_CBFS_CACHE_FULL)
There are other ways such that RW_SECTION_A/B only has the next stage in its cbfs with another c […]
Sorry, let me just take a step back here because I feel like we're talking past each other a bit. I want to make sure that we're on the same page on a few fundamental things:
1. I am implementing two possible "levels" of security here: the one where only CONFIG_CBFS_VERIFICATION is enabled (every file and every bit of metadata is verified at least once per boot, before it is used), and the one where in addition to that CONFIG_TOCTOU_SAFETY is also enabled (every file and every bit of metadata is verified every time it is loaded, before those loaded bytes are used). It is up to the user which one to select, and selecting only the first one makes certain things faster, but both of those options are supposed to provide exactly the kinds of security guarantees they promise, with no holes, gaps or left-open windows. I am not planning to leave any security guarantee to be "added later" here (I mean, the code isn't all written yet but I have a plan of how it should look and I'm not going to mark those options as menuconfig-available and ready for use until it is), and if you think there's actually a guarantee that my design promises but doesn't quite keep in some edge case, please let me know.
2. Chrome OS is currently planning to *not* enable CONFIG_TOCTOU_SAFETY in the near-term (next couple of years). Our verification anchor scheme using Dauntless is not TOCTOU-safe, so there's no point trying to provide that guarantee in coreboot if there were already wide-open TOCTOU attack opportunities earlier in the boot flow. This was agreed upon with the security team when we designed RO verification. That's why when I'm giving examples specific to the Chrome OS case, I refer to how the code behaves with CONFIG_TOCTOU_SAFETY disabled. (In the case of an RW lookup falling back to flash because the mcache is full, you *only* need to verify all the metadata again when CONFIG_TOCTOU_SAFETY is enabled, because it was already verified once when the RW mcache was built. In Chrome OS the option would be disabled so it wouldn't have to do that and it would do a simpler, faster lookup with early termination. The code for this is sketched out in https://review.coreboot.org/c/coreboot/+/41120/10/src/lib/cbfs.c line 43 -- there's only a TODO comment there right now because the vboot integration isn't written yet, but the intention is that the metadata_hash variable passed to cbfs_lookup() will only not be NULL if CONFIG_TOCTOU_SAFETY is enabled.)
3. The whole point of the mcache is *only* to be a performance improvement, *nothing* else. None of the security guarantees provided by this design fundamentally rely on the mcache in any way. The system is just as secure when CONFIG_NO_CBFS_MCACHE is enabled, and lookups are just as secure when they fall back to flash because the RW mcache is full. The only thing you pay is extra time for reloading or reverifying metadata again. (CONFIG_TOCTOU_SAFETY currently depends on the mcache in Kconfig, and requires the RO mcache to be big enough to fit all files. That is a practical simplification for me, not a fundamental requirement. I could work around that requirement but to do that I would have to introduce a separate memlayout section to pass the RO metadata hash anchor forward to later stages. I don't expect there to really be a practical need for that specific configuration to work, so I decided to simplify things and save some implementation/maintenance effort by making the RO mcache a requirement for CONFIG_TOCTOU_SAFETY. This is *only* the RO mcache, you can set the RW mcache size to zero and still have a perfectly secure, perfectly TOCTOU-safe system that simply reverifies all the metadata on every RW file access.)
4. I agree with you that RO/RW coupling should be minimized in the sense that we should always strive to avoid limiting our options of what we can do with a future RW update. That some coupling must exist is, of course, a necessity, and I think you're underestimating it if you only think about the data persisted by vboot. We have a ton of memlayout sections that are shared between RO and RW and cannot be independently resized by the latter (the existing CBFS_CACHE, timestamps, CBMEM console, FMAP cache, the ROMSTAGE area, stack, page tables, etc.), so the mcache would just become one more out of many there. The important part, I think, is to make sure that on the rare chance where we find ourselves wanting to increase that area with an RW update but can't, this won't be catastrophic for us (i.e. prevent us from shipping that update or create some sort of security risk). I think my fallback path which still provides all the same security guarantees (with basically no extra maintenance overhead because it's reusing all the same code) for only a small performance penalty is the right solution to provide that certainty, and we have the ability (by enforcing minimum requirements through FAFT) to tune the risk of that performance penalty ever even coming into play to basically be as small as we want it to be.
Following these considerations I still believe that the model I'm presenting here is the best solution. I think it optimizes for low complexity and maintenance burden and for being fast in the common case, while guaranteeing both safety (i.e. not getting locked out of the ability to update) and security for all edge cases. I think the cost of causing a cache overflow is low (a small performance impact, nothing else), and the risk of getting there in the first place is tiny and can be individually adjusted by each user/platform depending on how worried they are about that performance impact and how much free SRAM/CAR space they can spare. (It should also be noted, of course, that platform constraints still put other restrictions on what an RW update can do, because there's only so much SRAM/CAR space to go around. Generally we try to make good use of the available space we have anyway, so on an SoC with a lot of SRAM we'd probably allocate more spare space to the CBFS mcache in the first place because there's no point in wasting it if we have some left over, decreasing the chance that mcache space would ever become a problem with an update. If a platform defines its mcache to be very tight to begin with there's probably a good reason for that, and even if the RW update was able to freely resize the cache it might not be able to free up enough space to fit what it needs to (even if there is another section that could be cannibalized a bit, that might not help if it's not adjacent to the mcache). In that scenario I'd much rather be able to take a small performance hit and keep on booting without mcache than to have a system that cannot be made to work at all if I can't find enough space.)
I do not like the approach of adding separate CBFS sections for the romstage because I think it would add a lot of complexity where I don't believe it is necessary, and it would break some of the backwards-compatibility I'm carefully trying to maintain here. I think those are serious problems and I don't see them standing in any relation to what I still really believe to be a non-issue (a rare chance at a small performance penalty in edge cases). The issue Furquan mentioned is a perfect example of a problem that will magically go away when this design is fully implemented, and I think we agreed on the doc now that they will just implement a stopgap solution for that which can later be discarded. (Fundamentally I'm not going to preclude the ability to use special CBFS sections and I'll make sure there are APIs for that -- e.g. I think the SMMSTORE code needs it, and on the payload side we have RW_LEGACY -- but doing that always adds complexity and that extra complexity needs to be justified somehow, it's only warranted when there is no simpler way to achieve the same goals. I really don't see that justification here.)
If you really think this needs to go in a different direction let's maybe set up a VC to talk it through, that might be easier than discussing it here?