What you're asking for is a parallelized or multicore coreboot IIUC.
We've done this before. I believe it was yhlu who implemented the multicore DRAM startup on K8 ca. 2005 or so. I implemented a proof of concept multi-core capability in coreboot in 2012. It was dead simple and based on work we did in the NIX kernel, a very basic fork/join model. Instead of halting after SMP startup, APs entered a state where they waited for work.It worked. It was not well received at the time. Maybe it's time to take a look at it again.
For your CAR case, all cores would have to finish before you moved into the DRAM stage. Is that really a problem? I don't think based on your note that you need such a complex model as found in linux with tasklets and schedulers and such. A simple space-shared model ought to be sufficient.
Further, adurbin's concurrency (thread) model is a very nice API.
ron