Peter,
Thank you so much for doing such a thorough research and measurements! And for investing so much of your time and effort into improving flashrom! This is so impressive <3
To the point, I read everything in full, and I have no objections. I am wondering, since this is a big chunk of work, how to maybe plan it into phases: but this can be discussed later.
Also interested to know what other people think? -- Anastasia
On Mon, Mar 25, 2024 at 5:15 PM Peter Marheine pmarheine@chromium.org wrote:
Proposed changes in https://review.coreboot.org/c/flashrom/+/80806 have gotten me thinking about the problems with flashrom's current delay implementation, following which I've got a few ideas about how it can be improved. If other developers agree that these changes are reasonable, I think I'll go ahead and work on them.
Although I've measured some data, I haven't yet attempted to quantify the overall performance effect to flashrom from these changes because that would require implementing them. I do expect that I would gather that data for an actual implementation.
## Proposals
- Remove support for delaying inside flashrom using a calibrated busy-loop. Delete `myusec_delay` and `myusec_calibrate_delay`.
- Remove probing for OS timer resolution. Delete `clock_check_res` and `measure_os_delay_resolution`. Always use `clock_gettime`-based timing for busy-loops, or `gettimeofday` if `clock_gettime` is unavailable.
- Make the threshold in `default_delay` to choose between sleep or busy-looping configurable at build time. Reduce the default from 100ms to 100μs.
- Automatically attempt to run flashrom at realtime priority when supported by the system, to minimize timing error.
## Rationale
Each of these proposed changes fits in with the others to simplify how flashrom delays while waiting for work to complete, but their reasons are largely independent.
### Calibrated busy-loop removal
Flashrom's calibrated busy-loop is as old as flashrom itself, with its principle of operation first being established before 2003; the first commit to the flashrom repository included a version of `myusec_calibrate_delay` which attempted to count the number of iterations of an empty loop that the CPU could execute in 250ms. That initial version searched by doubling its delay time for each attempt, so delays could have been up to twice the length specified.
On modern systems, it is a grave mistake to assume that any CPU-based calibration process can accurately measure real time by counting the number of instructions executed (or the number of loop iterations completed). Runtime frequency changes have become much more common in the intervening 20 years, and if CPU frequency changes while flashrom is running then calibration becomes wrong.
On systems where all processors have the same maximum frequency, delay loop calibration tends to be fairly accurate because the current implementation calibrates for 1 second and this is at least as long as most systems take to reach their maximum clock speed under load. Short delays could be much longer than expected if the CPU clocked back down after calibration, but the calibrated delay would generally be close to correct at maximum speed so it would be unlikely to delay for less time than expected.
Today's CPUs are even more heterogenous, however. ARM machines have used heterogenous multiprocessors since about 2011, and recent x86 systems from both Intel and AMD can also have different maximum clock speeds for different cores. On these machines, a delay loop calibrated on a CPU with lower maximum frequency could delay for too short a time when run on a different CPU with higher maximum speed. Even SMT systems that have dynamic thread scheduling policy for each physical core might exhibit higher performance if a sibling thread is idle!
Although it may be possible to compensate for heterogenous multiprocessors (for example by calibrating delays for each CPU in the system or pinning flashrom to a single CPU using OS APIs), doing so would either increase runtime costs further or add to the maintenance burden of the flashrom maintainers. Because it is difficult to ensure a calibrated delay loop is actually correct, we should never use one.
### Ignoring OS-reported timer resolution
On Linux, `clock_getres` is documented to lie sometimes: a "tickless" kernel can generally report times with nanosecond resolution, but `clock_getres` reports resolution of around 1 millisecond. This confounds flashrom, causing it to use the calibrated delay loop instead of OS timing.
Because we cannot assume a CPU busy-loop to ever be truly calibrated on modern machines without significant effort, it is useless to try to use reported timer resolution as a signal to use a busy-loop instead (because there is no busy-loop to fall back to). We should instead use whatever OS timing is available.
`gettimeofday` is specified as early as System V r4, but is deprecated in POSIX.1-2008. `clock_gettime` is mandatory since POSIX.1-2008. Some systems that we wish to support may lack `clock_gettime`, which is preferred because it offers nanosecond resolution to `gettimeofday`'s microsecond resolution and offers clock options that will never be discontinuous (for example due to daylight savings or leap seconds). Between these two functions, all platforms that flashrom wishes to support should be supportable.
### Reduced sleep thresholds
On modern machines, the current 100 millisecond maximum busy-wait time is an eternity (and until recently it was 1 second, which is even more ridiculous). We can save a lot of power by giving up the CPU more readily, but choosing a threshold is somewhat challenging.
Modern x86 systems have typical context switch latency (from suspending one process, deferring to the kernel, and restarting another process) of around 2 microseconds, so this is a reasonable starting point for the maximum time to busy-wait for: it's unlikely that another process can do anything useful in less that 2 microseconds, or that the OS can suspend the CPU to save energy.
From surveying delay calls within flashrom, they span a large range from 1 microsecond up to a whole second. By inspection, 10 microseconds seems to be a common delay used when polling for completion of some operation. A few delay users describe how long it's usually expected for a polled operation to take, but most don't so it's difficult to tell how accurate the delays actually should be in order to minimize wasted time.
#### Wakeup latency measurements
To evaluate how much timing slop might be introduced by sleeping instead of busy-looping for short delays (less than 1 millisecond), I measured the latency of several operations on a selection of various Linux machines to evaluate how much longer it takes a program to actually resume from `nanosleep` versus the requested wait time. The systems were:
- Mediatek Kompanio 500 (MT8173). Four ARMv7 cores running at up to 2 GHz.
- Intel Xeon Gold 6154 (Xeon). Two sockets, each with 18 physical cores running 36 threads.
- Intel Celeron N3350 (APL). Two x86 cores running at 1.1 GHz, up to 2.4 GHz.
- AMD A4-9120C (AMD). Two x86 cores at 1.6 GHz, up to 2.4 GHz.
- Intel Core Ultra 5 115U (MTL). Eight x86 cores in three performance tiers at 0.7-1.5 GHz, up to 2.1-4.2 GHz.
I ran four tests each 16k times, getting the time elapsed for each iteration by calling `clock_gettime` immediately before and after each test. The minimum, maximum and mean time taken for each test was recorded. On all systems the reported timer resolution via `clock_getres` was 1 nanosecond.
- nop: do nothing
- sched_yield: call sched_yield() to attempt to run a different task without specifying a wait time
- nanosleep 1u: call nanosleep() with a requested wait time of 1 microsecond
- nanosleep 100u: call nanosleep() with a requested wait time of 100 microseconds
The nop case provides baseline information on how long it takes to measure the time it takes an operation to complete. sched_yield approximates syscall and scheduling overhead, under the assumption that the OS will run its scheduler to see if another process wants the CPU. If it does, the delay time could be greatly increased because another process may get the CPU.
The two nanosleep cases allow us to see how long a process that sleeps actually suspends for, and looking at two different delays is useful in case the OS tends to get sloppier if we request longer delays.
In the following table, measured times are in microseconds, listed as mean / min / max.
| Machine / Test | nop | sched_yield | nanosleep 1u | nanosleep 100u | |----------------|---------------------|-------------------|--------------------|-----------------------| | AMD | 0.1 / 0.06 / 38.3 | 0.6 / 0.6 / 100.1 | 64.5 / 6.4 / 2151 | 106.3 / 104.2 / 9619 | | MT8173 | 0.1 / 0.07 / 16.7 | 1.8 / 1.5 / 251.9 | 82.4 / 7.8 / 3051 | 195.9 / 109.3 / 5616 | | APL | 0.05 / 0.04 / 14.3 | 0.7 / 0.6 / 51.3 | 57.2 / 7.1 / 1236 | 217.6 / 108.4 / 10430 | | Xeon | 0.02 / 0.02 / 0.07 | 1.7 / 1.4 / 73.2 | 57.8 / 11.3 / 618 | 157.8 / 106.9 / 1299 | | MTL | 0.05 / 0.03 / 101.9 | 0.4 / 0.4 / 7.1 | 53.0 / 4.6 / 139 | 151.7 / 101.5 / 245 |
Latency in excess of the requested wait time is fairly consistent, on average about 60 microseconds and within as little as 10% of the requested time for 100 microsecond delays. Very short delays tend to be worse, and the average case does notably better on systems with more CPU cores- probably because the CPU sometimes gets given to another process and causes long delays while that other process runs or the OS is doing some uninterruptible operation.
These numbers are similar to the ones reported in https://review.coreboot.org/c/flashrom/+/80808, which lends credibility to this measurement method. High-resolution timers there improved minimum latency slightly, but as we'll see next, different scheduling improves the situation more.
I also ran the tests with realtime scheduling policy (as in `sched_setscheduler(SCHED_FIFO)`) using `chrt --fifo 1` to run at the minimum priority within the FIFO real-time class. This is expected to reduce latency both by giving this process priority over all other processes on the system, and by allowing the OS to use a simpler scheduling algorithm when ultimately choosing to restart the process under test.
In this table the nop and sched_yield tests are omitted because they are not as informative as the actual nanosleep numbers.
| Machine / Test | nanosleep 1u | nanosleep 100u | |----------------|----------------------|------------------------| | AMD | 9.2 / 4.8 / 18.6 | 129.1 / 104.7 / 512.2 | | MT8173 | 13.5 / 6.8 / 1505 | 117.2 / 106.7 / 1022 | | APL | 5.4 / 4.8 / 45.2 | 106.3 / 104.2 / 452.3 | | Xeon | 4.4 / 4.2 / 60.2 | 104.9 / 103.0 / 164.9 | | MTL | 2.4 / 2.1 / 74.2 | 102.4 / 101.2 / 123.0 |
At real-time priority, 1-microsecond sleeps look like they're basically only paying the cost of a context switch and 100-microsecond sleeps are more consistent than they were at normal priority: the average case is now accurate to within better than 10% on the faster machines with the slower ones not doing much worse.
#### Practical sleep thresholds
The existing code in flashrom that uses `clock_getres` rejects OS timers as a timing source if the resolution is less than 100 nanoseconds. Since the delay functions operate with microsecond precision, this suggests that it targets 10% accuracy for short waits and significantly better than that for long ones.
Because we see that 10% accuracy is feasible on most systems for 100 microsecond sleeps, this seems like a reasonable minimum time to sleep for. Even on systems where accuracy is worse than 10%, use of `sleep`-style APIs guarantees that delays will never be less than the specified time: this approach will never malfunction, but unexpectedly long delays could make flash programming take longer than it would otherwise.
Not all users or distributors will agree with this decision. Some very slow or old machines might achieve much worse precision such that it's considered worthwhile to busy-wait more aggressively (in order to complete programming more quickly), while some users might prefer to take every opportunity to save power by always sleeping. As a result, I believe this threshold should be made a compile-time option defaulting to 100 microseconds (although the final choice for the default can easily be adjusted).
#### Use of real-time scheduling
The accuracy of `sleep` APIs can be improved by using real-time scheduling classes as specified by POSIX, but enabling that for a given process usually requires special privileges. Many users run flashrom as the root user so it has the ability to make itself realtime, but there are also many applications where it need not run with root privileges.
Because realtime priority is not required for correct operation, flashrom should attempt to make itself realtime but continue even if unable to do so. An informational message should be emitted if that fails so users can be aware that granting flashrom additional privileges may yield better performance.
In libflashrom, realtime scheduling should not be enabled by default because it is very difficult to reason about the possible effects of changing a host process' scheduling class. Programs using flashrom as a library should manage priority themselves. _______________________________________________ flashrom mailing list -- flashrom@flashrom.org To unsubscribe send an email to flashrom-leave@flashrom.org