Thanks for the details. I agree that adding 48-bit support is better than PA!=VA, but I really think you should implement the L0 table instead of messing with the granule size. This should really not be hard to do since our paging code is already recursive... just add an L0_ADDR_SHIFT and use it instead of L1_ADDR_SHIFT in mmu_init() and get_pte(), and then duplicate the code block handling L1 in init_xlat_table() for L0. That should be everything you need.
You can see that L1 is essentially already implemented as optional right now with those BITS_PER_VA > L1_ADDR_SHIFT tests, even though that never really matters at the moment with BITS_PER_VA hardcoded to 33. But you should be able to just do the same thing for L0 (and maybe make L1 non-optional instead because I don't think we ever anticipate not needing it at this point).
Changing the granule size would not only break the assumption that pages are 4K (which we probably have in a bunch of code) and make it impossible to map smaller boundaries which you sometimes need to do (e.g. SRAM areas don't always start or end at 64K boundaries), it also makes your page tables incredibly bloated (64K *each*, and you usually need at least 3) which would essentially make this unusable for SRAM systems. Adding L0 only costs you one extra 4K table and is much more flexible.
It would be nice if you could do this change in both coreboot and libpayload since we're trying to keep both implementations in sync.