Aaron Durbin has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/38423 )
Change subject: cbfs: Add metadata cache ......................................................................
Patch Set 16:
(1 comment)
https://review.coreboot.org/c/coreboot/+/38423/10/src/lib/cbfs.c File src/lib/cbfs.c:
https://review.coreboot.org/c/coreboot/+/38423/10/src/lib/cbfs.c@28 PS10, Line 28: if (err == CB_CBFS_CACHE_FULL)
I think the fallback is critical for RW updates though? If we ever get into a situation where we need to add more files in an RW update, and that starts overflowing the cache, we'd be screwed otherwise. The cache size has to be hardcoded in RO because verstage already needs to build the cache to load the RW romstage.
Or move the cache once we're in RW? I don't feel like it's a good policy to count on not hitting the cache. That seems not ideal to me. I understand where you are coming from for future proofing, but it seems like not a great crutch. I'll think on it some more, but I don't think we should count on that situation being our 1st and only option.
Note that the cache full error message is "CBFS ERROR: mcache overflow, ...", and any firmware log string with the string "ERROR" in it will make our suspend_stress_test script complain. (This, sadly, is the most reliable method I've found over the years to mark a non-fatal error condition in a way that bring-up teams will actually notice, and that's why I always tell people to start the really bad error messages with "ERROR". I agree that's terrible and we could really use a more reliable mechanism to report non-fatal but problematic errors, like including the log-level in the CBMEM console so we could at least scan for CB_ERR or something... but it was never something I had time to do myself.) If we want something more robust we can of course also add a FAFT test that checks for that specific message, and that also adds a bunch of dummy files to trigger it intentionally and confirm the device can still boot in an overflow scenario. But I think we'll need to implement that sort of policy in our release testing flow, so that we still have the option to manually waive it for a later RW update if that's the only way to solve a critical issue.
I wonder if we should funnel more information into eventlog or some other way to get structured messages/events through for the reasons you've stated. That's obviously for the future, though.