You're going where we were in 2000. Many of us have been here before. We'd have killed to have the 40K of space you have on the ARM. We were doing this with either just the register set or a 16-register scratchpad.
And we solved it with a streaming abstraction. So the argument that we're somehow x86 centric because we don't deal with small memory in the bootblock is baloney. If you want to argue that we got sloppy for other reasons, point taken. But in the case of DoC at least, we had very small memory window and that was it. It was NOT directly addressable.
I think the real issue is that when cbfs came along (2007) we had hit a point where all flash was addressable on the platforms we had. The idea that flash would not be memory addressable is, well, so brain dead that I guess we thought it would never happen. So of course it happened.
But there's several ways to deal with that problem, and I'm not convinced you're there yet. Don't have time to look but the original cbfs design could function with all the headers at the front, data following. Maybe we need to get back to that.
ron