On Tue, Feb 14, 2017 at 1:07 PM, Patrick Georgi pgeorgi@google.com wrote:
2017-02-14 17:12 GMT+01:00 Aaron Durbin via coreboot coreboot@coreboot.org:
For an optimized bootflow all pieces of work that need to be done pretty much need to be closely coupled. One needs to globally optimize the full sequence.
Like initializing slow hardware even before RAM init (as long as it's just an initial command)? How about using PIT/IRQ0 plus some tiny register-only interrupt routine to do trivial register wrangling (we do have register scripts)?
I don't think I properly understand your suggestion. For this particular eMMC case are you suggesting taking the PIT interrupt and doing the next piece of work in it?
that we seem to be absolutely needing to maintain boot speeds. Is Chrome OS going against tide of coreboot wanting to solve those sorts of issues?
The problem is that two basic qualities collide here: speed and simplicity. The effect is that people ask to stop a second to reconsider the options. MPinit and parallelism are the "go to" solution for all performance related issues of the last 10 years, but they're not without cost. Questioning this approach doesn't mean that that we shouldn't go there at all, just that the obvious answers might not lead to simple solutions.
As Andrey stated elsewhere, we're far from CPU bound.
Agreed. But our chunking of work is very coarsely sectioned up. I think the other CPU path is an attempt to work around the coarseness of the work steps in the dependency chain.
For his concrete example: does eMMC init fail if you ping it more often than every 10ms? It better not, you already stated that it's hard to guarantee those 10ms, so there needs to be some spare room. We could look at the largest chunk of init process that could be restructured to implement cooperative multithreading on a single core for as many tasks as possible, to cut down on all those udelays (or even mdelays). Maybe we could even build a compiler plugin to ensure at compile time that the resulting code is proper (loops either have low bounds or are yielding, yield()/sched()/... aren't called within critical sections)...
That's a possibility, but you have to solve the case for each combination of hardware present and/or per platform. Building up the dependency chain is the most important piece. And from there to ensure execution context is not lost for longer than a set amount of time. We're miles away from that since we're run to completion serially right now.
Once we leave that scheduling to physics (ie enabling multicore operation), all bets are off (or we have to synchronize the execution to a degree that we could just as well do it manually). A lot of complexity just to have 8 times the CPU power for the same amount of IO bound tasks.
Patrick