Hi there,
tl;dr: We are considering adding early parallel code execution in coreboot. We need to discuss how this can be done.
Nowadays we see firmware getting more complicated. At the same time CPUs do not necessarily catch up. Furthermore, recent increases in performance can be largely attributed to parallelism and stuffing more cores on die rather than sheer core computing power. However, firmware typically runs on just one CPU and is effectively barred from all the parallelism goodies available to OS software.
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second. Interestingly, great deal of tasks that needs to be done are not even computation-bound. They are IO bound. In case of SDHCI below it is possible to train eMMC link to switch from default low-freq single data rate (sdr50) mode to high frequency dual data rate mode (hs400). This link training increases eMMC throughput by factor by 15-20. As result time it takes to load kernel in depthcharge goes down from 130ms to 10ms. However, the training sequence requires constant by frequent CPU attention. As result, it doesn't make any sense to try to turn on higher-frequency modes because you don't get any net win. We also experimented by starting work in current MPinit code. Unfortunately it starts pretty late in the game and we do not have enough parallel time to reap meaningful benefit.
In order to address this problem we can do following things: 1. Add scheduler, early or not 2. Add early MPinit code
For [1] I am aware of one scheduler discussion in 2013, but that was long time ago and things may have moved a bit. I do not want to be a necromancer and reanimate old discussion, but does anybody see it as a useful/viable thing to do?
For [2] we have been working on prototype for Apollolake that does pre-memory MPinit. We've got to a stage where we can run C code on another core before DRAM is up (please do not try that at home, because you'd need custom experimental ucode). However, there are many questions what model to use and how to create infrastructure to run code in parallel in such early stage. Shall we just add "run this (mini) stage on this core" concept? Or shall we add tasklet/worklet structures that would allow code to live in run and when migration to DRAM happens have infrastructure take care of managing context and potentially resume it? One problem is that code running with CAR needs to stop by the time system is ready to tear down CAR and migrate to DRAM. We don't want to delay that by waiting on such task to complete. At the same time, certain task may have largely fluctuating run times so you would want to continue them. It is actually may be possible just to do that, if we use same address space for CAR and DRAM. But come to think of it, this is just a tip of iceberg and there are packs of other issues we would need to deal with.
Does any of that make sense? Perhaps somebody thought of this before? Let's see what may be other ways to deal with this challenge.
thanks Andrey
On 01/25/2017 03:16 PM, Guvendik, Bora wrote:
Port sdhci and mmc driver from depthcharge to coreboot. The purpose is to speed up boot time by starting
storage initialization on another CPU in parallel. On the Apollolake systems we checked, we found that cpu can take
up to 300ms sending CMD1s to HW, so we can avoid this delay by parallelizing.
Why not add this parallelization in the payload instead?
There is potentially more time to parallelize things in
coreboot. Payload execution is much faster,
so we don't get much parallel execution time.
- Why not send CMD1 once in coreboot to trigger power-up and let HW
initialize using only 1 cpu?
Jedec spec requires the CPU to keep sending CMD1s when
the hardware is busy (section 6.4.3). We tested
with real-world hardware and it indeed didn't work with
a single CMD1.
Why did you port the driver from depthcharge?
I wanted to use a driver that is proven to avoid bugs.
It is also easier to apply patches back and forth.
https://review.coreboot.org/#/c/18105
Thanks
Bora
Hello Andrey,
Does any of that make sense? Perhaps somebody thought of this before?
Let's see what may be other ways to deal with this challenge.
No, it does not. What you are proposing, in-fact, is to make boot-loader as quasi (adding scheduler) HW multithreading OS sans MMU (actually, dealing with two (for now) HW threads). And you have chosen Coreboot to implement this.
I will suggest what you are proposing first to be done in true BIOS, so IBVs can work on this proposal, and see how BIOS boot-up time will improve (by this parallelism). Besides, BIOS is much slower (UEFI BIOSes boot in the range of 30 seconds), and should be faster. And... BIOS is closed source, thus there is major business task which should go there, Project Management and some few millions of $$ USD to be spent on this project. Paid for by INTEL and INTEL BIOS Vendors. ;-)
Besides, one only knows what the next challenge is (repeating your words: *"...this is just a tip of iceberg and there are packs of other issues we would need to deal with."*
Since, very soon, you'll run to shared HW resource, and then you'll need to implement semaphores, atomic operations and God knows what!?
My two cent thinking (after all, this is only me, Zoran, independept self-contributor), Zoran
On Mon, Feb 13, 2017 at 8:19 AM, Andrey Petrov andrey.petrov@intel.com wrote:
Hi there,
tl;dr: We are considering adding early parallel code execution in coreboot. We need to discuss how this can be done.
Nowadays we see firmware getting more complicated. At the same time CPUs do not necessarily catch up. Furthermore, recent increases in performance can be largely attributed to parallelism and stuffing more cores on die rather than sheer core computing power. However, firmware typically runs on just one CPU and is effectively barred from all the parallelism goodies available to OS software.
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second. Interestingly, great deal of tasks that needs to be done are not even computation-bound. They are IO bound. In case of SDHCI below it is possible to train eMMC link to switch from default low-freq single data rate (sdr50) mode to high frequency dual data rate mode (hs400). This link training increases eMMC throughput by factor by 15-20. As result time it takes to load kernel in depthcharge goes down from 130ms to 10ms. However, the training sequence requires constant by frequent CPU attention. As result, it doesn't make any sense to try to turn on higher-frequency modes because you don't get any net win. We also experimented by starting work in current MPinit code. Unfortunately it starts pretty late in the game and we do not have enough parallel time to reap meaningful benefit.
In order to address this problem we can do following things:
- Add scheduler, early or not
- Add early MPinit code
For [1] I am aware of one scheduler discussion in 2013, but that was long time ago and things may have moved a bit. I do not want to be a necromancer and reanimate old discussion, but does anybody see it as a useful/viable thing to do?
For [2] we have been working on prototype for Apollolake that does pre-memory MPinit. We've got to a stage where we can run C code on another core before DRAM is up (please do not try that at home, because you'd need custom experimental ucode). However, there are many questions what model to use and how to create infrastructure to run code in parallel in such early stage. Shall we just add "run this (mini) stage on this core" concept? Or shall we add tasklet/worklet structures that would allow code to live in run and when migration to DRAM happens have infrastructure take care of managing context and potentially resume it? One problem is that code running with CAR needs to stop by the time system is ready to tear down CAR and migrate to DRAM. We don't want to delay that by waiting on such task to complete. At the same time, certain task may have largely fluctuating run times so you would want to continue them. It is actually may be possible just to do that, if we use same address space for CAR and DRAM. But come to think of it, this is just a tip of iceberg and there are packs of other issues we would need to deal with.
Does any of that make sense? Perhaps somebody thought of this before? Let's see what may be other ways to deal with this challenge.
thanks Andrey
On 01/25/2017 03:16 PM, Guvendik, Bora wrote:
Port sdhci and mmc driver from depthcharge to coreboot. The purpose is to speed up boot time by starting
storage initialization on another CPU in parallel. On the Apollolake systems we checked, we found that cpu can take
up to 300ms sending CMD1s to HW, so we can avoid this delay by parallelizing.
Why not add this parallelization in the payload instead?
There is potentially more time to parallelize things in
coreboot. Payload execution is much faster,
so we don't get much parallel execution time.
- Why not send CMD1 once in coreboot to trigger power-up and let HW
initialize using only 1 cpu?
Jedec spec requires the CPU to keep sending CMD1s when
the hardware is busy (section 6.4.3). We tested
with real-world hardware and it indeed didn't work with
a single CMD1.
Why did you port the driver from depthcharge?
I wanted to use a driver that is proven to avoid bugs.
It is also easier to apply patches back and forth.
https://review.coreboot.org/#/c/18105
Thanks
Bora
-- coreboot mailing list: coreboot@coreboot.org https://www.coreboot.org/mailman/listinfo/coreboot
Andrey Petrov wrote:
We are considering adding early parallel code execution in coreboot. We need to discuss how this can be done.
No - first we need to duscuss *if* this should be done.
Nowadays we see firmware getting more complicated.
Sorry, but that's nonsense. Indeed MSFT is pushing more and more complicated requirements into the EFI/UEFI ecosystem, but that's their problem, not a universal one.
Your colleague wants to speed up boot time by moving storage driver code from the payload into coreboot proper, but in fact this goes directly against the design goals of coreboot, so here's a refresh:
* coreboot has *minimal* platform (think buses, not peripherals) initialization code
* A payload does everything further.
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second. Interestingly, great deal of tasks that needs to be done are not even computation-bound. They are IO bound.
How much of that time is spent in the FSP?
scheduler
..
how to create infrastructure to run code in parallel in such early stage
I think you are going in completely the wrong direction.
You want a scheduler, but that very clearly does not belong in coreboot.
Shall we just add "run this (mini) stage on this core" concept? Or shall we add tasklet/worklet structures
Neither. The correct engineering solution is very simple - adapt FSP to fit into coreboot, instead of trying to do things the other way around.
This means that your scheduler lives in the payload. There is already precedent - SeaBIOS also already implements multitasking.
this is just a tip of iceberg
That's exactly why it has no place within coreboot, but belongs in a payload.
//Peter
Hi,
On 02/13/2017 06:05 AM, Peter Stuge wrote:
Andrey Petrov wrote:
Nowadays we see firmware getting more complicated.
Sorry, but that's nonsense. Indeed MSFT is pushing more and more complicated requirements into the EFI/UEFI ecosystem, but that's their problem, not a universal one.
I wish it was only MSFT. Chrome systems do a lot of work early on that is CPU intensive, and there waiting on secure hardware as well. Then there is the IO problem that original patch tries to address.
Your colleague wants to speed up boot time by moving storage driver code from the payload into coreboot proper, but in fact this goes directly against the design goals of coreboot, so here's a refresh:
coreboot has *minimal* platform (think buses, not peripherals) initialization code
A payload does everything further.
This is nice and clean design, no doubt about it. However, it is serial.
Another design goal of coreboot is to be fast. Do "be fast" and "be parallel" conflict?
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second. Interestingly, great deal of tasks that needs to be done are not even computation-bound. They are IO bound.
How much of that time is spent in the FSP?
FSP is about 250ms grand total. However, that is not all that great if you compare to IO to load kernel over SHDCI (130ms) and initialize eMMC device itself (100-300ms). Not to mention other IO-bound tasks that can very well be started in parallel early.
how to create infrastructure to run code in parallel in such early stage
I think you are going in completely the wrong direction.
You want a scheduler, but that very clearly does not belong in coreboot.
Actually I am just interested in getting things to boot faster. It can be scheduling or parallel execution on secondary HW threads.
Shall we just add "run this (mini) stage on this core" concept? Or shall we add tasklet/worklet structures
Neither. The correct engineering solution is very simple - adapt FSP to fit into coreboot, instead of trying to do things the other way around.
FSP definitely needs a lot of love to be more usable, I couldn't agree more. But if hardware needs be waited on and your initialization process is serial, you will end up wasting time on polling while you could be doing something else.
This means that your scheduler lives in the payload. There is already precedent - SeaBIOS also already implements multitasking.
Unfortunately it is way too late to even make a dent on overall boot time.
Andrey
What you're asking for is a parallelized or multicore coreboot IIUC.
We've done this before. I believe it was yhlu who implemented the multicore DRAM startup on K8 ca. 2005 or so. I implemented a proof of concept multi-core capability in coreboot in 2012. It was dead simple and based on work we did in the NIX kernel, a very basic fork/join model. Instead of halting after SMP startup, APs entered a state where they waited for work.It worked. It was not well received at the time. Maybe it's time to take a look at it again.
For your CAR case, all cores would have to finish before you moved into the DRAM stage. Is that really a problem? I don't think based on your note that you need such a complex model as found in linux with tasklets and schedulers and such. A simple space-shared model ought to be sufficient.
Further, adurbin's concurrency (thread) model is a very nice API.
ron
On Mon, Feb 13, 2017 at 8:05 AM, Peter Stuge peter@stuge.se wrote:
Andrey Petrov wrote:
We are considering adding early parallel code execution in coreboot. We need to discuss how this can be done.
No - first we need to duscuss *if* this should be done.
Nowadays we see firmware getting more complicated.
Sorry, but that's nonsense. Indeed MSFT is pushing more and more complicated requirements into the EFI/UEFI ecosystem, but that's their problem, not a universal one.
Your colleague wants to speed up boot time by moving storage driver code from the payload into coreboot proper, but in fact this goes directly against the design goals of coreboot, so here's a refresh:
coreboot has *minimal* platform (think buses, not peripherals) initialization code
A payload does everything further.
There's an inherent sequence point between coreboot and the payload. All of coreboot needs to complete prior to handing off execution to the payload. Everyone's boot up process differs, but boot speed is something Chrome OS cares about very much. That's one of the reason coreboot has been enlisted for Chrome OS products. By maintaining that delineation boot speed can very much suffer. Pushing work out to another piece of software doesn't inherently reduce the total amount of work to be done. That's the current dilemma. Do we just throw our hands up and say things will continue to be slower? Or do we come up to solutions to the current problems we're seeing?
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second. Interestingly, great deal of tasks that needs to be done are not even computation-bound. They are IO bound.
How much of that time is spent in the FSP?
scheduler
..
how to create infrastructure to run code in parallel in such early stage
I think you are going in completely the wrong direction.
You want a scheduler, but that very clearly does not belong in coreboot.
Shall we just add "run this (mini) stage on this core" concept? Or shall we add tasklet/worklet structures
Neither. The correct engineering solution is very simple - adapt FSP to fit into coreboot, instead of trying to do things the other way around.
This means that your scheduler lives in the payload. There is already precedent - SeaBIOS also already implements multitasking.
this is just a tip of iceberg
That's exactly why it has no place within coreboot, but belongs in a payload.
//Peter
-- coreboot mailing list: coreboot@coreboot.org https://www.coreboot.org/mailman/listinfo/coreboot
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 02/13/2017 01:19 AM, Andrey Petrov wrote: <snip>
For [2] we have been working on prototype for Apollolake that does pre-memory MPinit. We've got to a stage where we can run C code on another core before DRAM is up (please do not try that at home, because you'd need custom experimental ucode).
In addition to the very valid points raised by others on this list, this note in particular is concerning. Whenever we start talking about microcode, we're talking about yet another magic black box that coreboot has no control over and cannot maintain. Adding global functionality that is so system specific in practice as to rely on microcode feature support is not something I ever want to see, unless perhaps the relevant portions of the microcode are open and maintainable by the coreboot project.
In a nutshell, this proposal would make it even harder for any low-level coreboot development on these systems to take place outside of Intel, and as one of the main coreboot contractors this "soft lockdown" is something we are strongly opposed to. Furthermore, I suggest looking at the AMD K8 memory init code -- some basic parallelism was introduced for memory clear, but in the end the improved boot speed was not a "killer feature" and had the side effect of making the code difficult to maintain, leaving the K8 support permanently broken as of this writing.
- -- Timothy Pearson Raptor Engineering +1 (415) 727-8645 (direct line) +1 (512) 690-0200 (switchboard) https://www.raptorengineering.com
Hi,
On 02/13/2017 10:22 AM, Timothy Pearson wrote:
<snip> > For [2] we have been working on prototype for Apollolake that does > pre-memory MPinit. We've got to a stage where we can run C code on > another core before DRAM is up (please do not try that at home, because > you'd need custom experimental ucode).
In addition to the very valid points raised by others on this list, this note in particular is concerning. Whenever we start talking about microcode, we're talking about yet another magic black box that coreboot has no control over and cannot maintain. Adding global functionality that is so system specific in practice as to rely on microcode feature support is not something I ever want to see, unless perhaps the relevant portions of the microcode are open and maintainable by the coreboot project.
I am just talking about BIOS shadowing. This is a pretty standard feature, just that not every SoC implement it by default. Naturally, we would be only adding new code if it became publicly available. I believe shadowing works on many existing CPUs, so no, it is not "use this custom NDA-only ucode" to get the stuff working.
Andrey
2017-02-13 8:19 GMT+01:00 Andrey Petrov andrey.petrov@intel.com:
tl;dr: We are considering adding early parallel code execution in coreboot. We need to discuss how this can be done.
It's reasonable to discuss the "if" first.
Nowadays we see firmware getting more complicated.
The coreboot mantra isn't just "boot fast", but also "boot simple".
On your "scheduler or MPinit" question, _if_ we have to go down that route: I'd prefer a cooperative threaded single core scheduler, for one simple reason: it's easier to reason about the correctness of code that only ever ceases control at well-defined yield points. As you said, those tasks are not CPU bound. We also don't need experimental ucode for that even when running threads in CAR ;-)
Patrick
On 13.02.2017 08:19, Andrey Petrov wrote:
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second.
Can you provide exhaustive figures, which part of this system's boot process takes how long? That would make it easier to reason about where "parallelism" would provide a benefit.
In order to address this problem we can do following things:
- Add scheduler, early or not
Yes, but really doesn't fit into the coreboot idea, IMHO.
- Add early MPinit code
No? um, at best very limited (by the number of threads the hardware sup- ports).
For [2] we have been working on prototype for Apollolake that does pre-memory MPinit. We've got to a stage where we can run C code on another core before DRAM is up (please do not try that at home, because you'd need custom experimental ucode). However, there are many questions what model to use and how to create infrastructure to run code in parallel in such early stage. Shall we just add "run this (mini) stage on this core" concept? Or shall we add tasklet/worklet structures that would allow code to live in run and when migration to DRAM happens have infrastructure take care of managing context and potentially resume it? One problem is that code running with CAR needs to stop by the time system is ready to tear down CAR and migrate to DRAM. We don't want to delay that by waiting on such task to complete. At the same time, certain task may have largely fluctuating run times so you would want to continue them. It is actually may be possible just to do that, if we use same address space for CAR and DRAM. But come to think of it, this is just a tip of iceberg and there are packs of other issues we would need to deal with.
Sounds very scary, as if it would never fit, not matter how strong you push. If you really think, we should do something in parallel across coreboot stages, it might be time to redesign the whole thing across stages.
As long as there is a concept involving romstage/ramstage, we should keep it to one thing in romstage: getting DRAM up. If this needs a clumsy blob, then accept its time penalty.
Does any of that make sense? Perhaps somebody thought of this before? Let's see what may be other ways to deal with this challenge.
3. Design a driver architecture that doesn't suffer from io-waiting
This is something I kept in mind for payloads for some time now, but it could also apply to later coreboot stages: Instead of busy waiting for i/o, a driver could yield execution until it's called again. Obviously, this only helps if there is more than one driver running in "parallel". But it scales much better than one virtual core per driver...
Another idea just popped up: Performing "background" tasks in udelay() / mdelay() implementations ;)
I guess there are many more, maybe some viable, approaches to solve it with only one thread of execution.
Anyway, I rather see this parallelism in payloads. Another thought: If there is something in coreboot that really slows booting down, maybe that could be moved into the payload?
Nico
On Mon, Feb 13, 2017 at 11:17 AM Nico Huber nico.h@gmx.de wrote:
Another idea just popped up: Performing "background" tasks in udelay() / mdelay() implementations ;)
that is adurbin's threading model. I really like it.
A lot of times, concurrency will get you just as far as ||ism without the nastiness.
But if we're going to make a full up kernel for rom, my suggestion is we could start with a real kernel, perhaps linux. We could then rename coreboot to, say, LinuxBIOS.
ron
On Mon, Feb 13, 2017 at 1:16 PM, Nico Huber nico.h@gmx.de wrote:
On 13.02.2017 08:19, Andrey Petrov wrote:
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second.
Can you provide exhaustive figures, which part of this system's boot process takes how long? That would make it easier to reason about where "parallelism" would provide a benefit.
In order to address this problem we can do following things:
- Add scheduler, early or not
Yes, but really doesn't fit into the coreboot idea, IMHO.
- Add early MPinit code
No? um, at best very limited (by the number of threads the hardware sup- ports).
For [2] we have been working on prototype for Apollolake that does pre-memory MPinit. We've got to a stage where we can run C code on another core before DRAM is up (please do not try that at home, because you'd need custom experimental ucode). However, there are many questions what model to use and how to create infrastructure to run code in parallel in such early stage. Shall we just add "run this (mini) stage on this core" concept? Or shall we add tasklet/worklet structures that would allow code to live in run and when migration to DRAM happens have infrastructure take care of managing context and potentially resume it? One problem is that code running with CAR needs to stop by the time system is ready to tear down CAR and migrate to DRAM. We don't want to delay that by waiting on such task to complete. At the same time, certain task may have largely fluctuating run times so you would want to continue them. It is actually may be possible just to do that, if we use same address space for CAR and DRAM. But come to think of it, this is just a tip of iceberg and there are packs of other issues we would need to deal with.
Sounds very scary, as if it would never fit, not matter how strong you push. If you really think, we should do something in parallel across coreboot stages, it might be time to redesign the whole thing across stages.
As long as there is a concept involving romstage/ramstage, we should keep it to one thing in romstage: getting DRAM up. If this needs a clumsy blob, then accept its time penalty.
Does any of that make sense? Perhaps somebody thought of this before? Let's see what may be other ways to deal with this challenge.
- Design a driver architecture that doesn't suffer from io-waiting
This is something I kept in mind for payloads for some time now, but it could also apply to later coreboot stages: Instead of busy waiting for i/o, a driver could yield execution until it's called again. Obviously, this only helps if there is more than one driver running in "parallel". But it scales much better than one virtual core per driver...
Another idea just popped up: Performing "background" tasks in udelay() / mdelay() implementations ;)
I guess there are many more, maybe some viable, approaches to solve it with only one thread of execution.
Anyway, I rather see this parallelism in payloads. Another thought: If there is something in coreboot that really slows booting down, maybe that could be moved into the payload?
I don't think things are as simple as that with the current solution for these platforms. FSP very much complicates things because the execution context is lost on the transfer. But it's actually worse than that because resource allocation is dependent on the presence of PCI devices. If those disappear or appear after resource allocation then the IO map is not so hot. Things are definitely tightly coupled so it's not clear to me the answer is to punt everything to a payload making everything better.
FWIW, I've provided the feedback on FSP and its current deficiencies. However, FSP does allow one to ship products without having to deal with UEFI as a solution to the firmware and ensure all the correct hardware tuning is done since that's the only place Intel supports documenting/maintaining correct initialization sequences. It's definitely a predicament if one wants to continue shipping products on new hardware.
Nico
-- coreboot mailing list: coreboot@coreboot.org https://www.coreboot.org/mailman/listinfo/coreboot
Hi,
On 02/13/2017 11:16 AM, Nico Huber wrote:
On 13.02.2017 08:19, Andrey Petrov wrote:
For example Apollolake is struggling to finish firmware boot with all the whistles and bells (vboot, tpm and our friendly, ever-vigilant TXE) under one second.
Can you provide exhaustive figures, which part of this system's boot process takes how long? That would make it easier to reason about where "parallelism" would provide a benefit.
Such data is available. Here is a boot chart I drew few months back: http://imgur.com/a/huyPQ
I color-coded different work types. Some blocks are coded incorrectly please bear with me).
So what we can see is that everything is serial and there is great deal of waiting. For that specific SDHCI case you can see "Storage device initialization" that is happening in depthcharge. That is CMD1 that you need keep on sending to the controller. As you can see, it completes in 130ms. Unfortunately you really can't just send CMD1 and go about your business. You need to poll readiness status and keep on sending CMD1 again and again. Also, it is not always 130ms. It tends to vary and worst case we seen was over 300ms. Another one is "kernel read", which is pure IO and takes 132ms. If you invest some 300ms in training the link (has to happen on every boot on every board) to HS400 you can read it in just 10ms. Naturally you can't see HS400 in the picture because enabling it late in the boot flow would be counter productive.
That's essentially the motivation to why we are looking into starting this CMD1 and HS400 link training as early as possible. However fixing this particular issue is just a "per-platform" fix. I was hoping we could come up with a model that adds parallelism as a generic reusable feature not just a quick hack.
Andrey
So what we can see is that everything is serial and there is great deal of waiting. For that specific SDHCI case you can see "Storage device initialization" that is happening in depthcharge. That is CMD1 that you need keep on sending to the controller. As you can see, it completes in 130ms. Unfortunately you really can't just send CMD1 and go about your business. You need to poll readiness status and keep on sending CMD1 again and again. Also, it is not always 130ms. It tends to vary and worst case we seen was over 300ms.
Do you actually have an eMMC part that requires repeating CMD1 within a certain bounded time interval? What happens if you violate that? Does it just not progress initialization or does it actually fail in some way?
I can't find any official documentation suggesting that this is really required. JESD84-B51 just says (6.4.3): "The busy bit in the CMD1 response can be used by a device to tell the host that it is still working on its power-up/reset procedure (e.g., downloading the register information from memory field) and is not ready yet for communication. In this case the host must repeat CMD1 until the busy bit is cleared." This suggests that the only point of the command is polling for readiness.
Another one is "kernel read", which is pure IO and takes 132ms. If you invest some 300ms in training the link (has to happen on every boot on every board) to HS400 you can read it in just 10ms. Naturally you can't see HS400 in the picture because enabling it late in the boot flow would be counter productive.
Have you considered implementing HS400-ES (enhanced strobe) support in your host controllers? That feature allows you to run at HS400 speeds immediately without any tuning (by essentially turning the clock master around and having the device pulse its own clock when it's sending data IIRC). We've had great success improving boot speed with that on a different Chrome OS platform. This won't help you for your current generation of SoCs yet, but at least it should resolve the tuning issue in the long run as this feature becomes more standard (so this issue shouldn't actually get worse and worse in the future... it should go away again).