RFC: remove the calibrated delay loop - flashrom

25 Mar 2024


      Proposed changes in https://review.coreboot.org/c/flashrom/+/80806 have
gotten me thinking about the problems with flashrom's current delay
implementation, following which I've got a few ideas about how it can be
improved. If other developers agree that these changes are reasonable, I
think I'll go ahead and work on them.
Although I've measured some data, I haven't yet attempted to quantify the
overall performance effect to flashrom from these changes because that
would require implementing them. I do expect that I would gather that data
for an actual implementation.
## Proposals
1. Remove support for delaying inside flashrom using a calibrated busy-loop.
   Delete `myusec_delay` and `myusec_calibrate_delay`.
2. Remove probing for OS timer resolution. Delete `clock_check_res` and
   `measure_os_delay_resolution`. Always use `clock_gettime`-based timing
for
   busy-loops, or `gettimeofday` if `clock_gettime` is unavailable.
3. Make the threshold in `default_delay` to choose between sleep or
   busy-looping configurable at build time. Reduce the default from 100ms
   to 100μs.
4. Automatically attempt to run flashrom at realtime priority when supported
   by the system, to minimize timing error.
## Rationale
Each of these proposed changes fits in with the others to simplify how
flashrom
delays while waiting for work to complete, but their reasons are largely
independent.
### Calibrated busy-loop removal
Flashrom's calibrated busy-loop is as old as flashrom itself, with its
principle of operation first being established before 2003; the first commit
to the flashrom repository included a version of `myusec_calibrate_delay`
which
attempted to count the number of iterations of an empty loop that the CPU
could
execute in 250ms. That initial version searched by doubling its delay time
for
each attempt, so delays could have been up to twice the length specified.
On modern systems, it is a grave mistake to assume that any CPU-based
calibration process can accurately measure real time by counting the number
of instructions executed (or the number of loop iterations completed).
Runtime
frequency changes have become much more common in the intervening 20 years,
and if CPU frequency changes while flashrom is running then calibration
becomes
wrong.
On systems where all processors have the same maximum frequency, delay loop
calibration tends to be fairly accurate because the current implementation
calibrates for 1 second and this is at least as long as most systems take to
reach their maximum clock speed under load. Short delays could be much
longer
than expected if the CPU clocked back down after calibration, but the
calibrated delay would generally be close to correct at maximum speed so
it would be unlikely to delay for less time than expected.
Today's CPUs are even more heterogenous, however. ARM machines have used
heterogenous multiprocessors since about 2011, and recent x86 systems from
both Intel and AMD can also have different maximum clock speeds for
different
cores. On these machines, a delay loop calibrated on a CPU with lower
maximum
frequency could delay for too short a time when run on a different CPU with
higher maximum speed. Even SMT systems that have dynamic thread scheduling
policy for each physical core might exhibit higher performance if a sibling
thread is idle!
Although it may be possible to compensate for heterogenous multiprocessors
(for
example by calibrating delays for each CPU in the system or pinning
flashrom to
a single CPU using OS APIs), doing so would either increase runtime costs
further or add to the maintenance burden of the flashrom maintainers.
Because
it is difficult to ensure a calibrated delay loop is actually correct, we
should never use one.
### Ignoring OS-reported timer resolution
On Linux, `clock_getres` is documented to lie sometimes: a "tickless" kernel
can generally report times with nanosecond resolution, but `clock_getres`
reports resolution of around 1 millisecond. This confounds flashrom,
causing it
to use the calibrated delay loop instead of OS timing.
Because we cannot assume a CPU busy-loop to ever be truly calibrated on
modern
machines without significant effort, it is useless to try to use reported
timer
resolution as a signal to use a busy-loop instead (because there is no
busy-loop to fall back to). We should instead use whatever OS timing is
available.
`gettimeofday` is specified as early as System V r4, but is deprecated in
POSIX.1-2008. `clock_gettime` is mandatory since POSIX.1-2008. Some systems
that we wish to support may lack `clock_gettime`, which is preferred because
it offers nanosecond resolution to `gettimeofday`'s microsecond resolution
and offers clock options that will never be discontinuous (for example due
to
daylight savings or leap seconds). Between these two functions, all
platforms
that flashrom wishes to support should be supportable.
### Reduced sleep thresholds
On modern machines, the current 100 millisecond maximum busy-wait time is an
eternity (and until recently it was 1 second, which is even more
ridiculous).
We can save a lot of power by giving up the CPU more readily, but
choosing a threshold is somewhat challenging.
Modern x86 systems have typical context switch latency (from suspending one
process, deferring to the kernel, and restarting another process) of around
2
microseconds, so this is a reasonable starting point for the maximum time
to busy-wait for: it's unlikely that another process can do anything useful
in less that 2 microseconds, or that the OS can suspend the CPU to save
energy.
From surveying delay calls within flashrom, they span a large range from
1 microsecond up to a whole second. By inspection, 10 microseconds seems
to be a common delay used when polling for completion of some operation.
A few delay users describe how long it's usually expected for a polled
operation to take, but most don't so it's difficult to tell how accurate the
delays actually should be in order to minimize wasted time.
#### Wakeup latency measurements
To evaluate how much timing slop might be introduced by sleeping instead of
busy-looping for short delays (less than 1 millisecond), I measured the
latency of several operations on a selection of various Linux
machines to evaluate how much longer it takes a program to actually resume
from `nanosleep` versus the requested wait time. The systems were:
* Mediatek Kompanio 500 (MT8173). Four ARMv7 cores running at up to 2 GHz.
 * Intel Xeon Gold 6154 (Xeon). Two sockets, each with 18 physical cores
   running 36 threads.
 * Intel Celeron N3350 (APL). Two x86 cores running at 1.1 GHz, up to 2.4
GHz.
 * AMD A4-9120C (AMD). Two x86 cores at 1.6 GHz, up to 2.4 GHz.
 * Intel Core Ultra 5 115U (MTL). Eight x86 cores in three performance
tiers at
   0.7-1.5 GHz, up to 2.1-4.2 GHz.
I ran four tests each 16k times, getting the time elapsed for each iteration
by calling `clock_gettime` immediately before and after each test. The
minimum,
maximum and mean time taken for each test was recorded. On all systems the
reported timer resolution via `clock_getres` was 1 nanosecond.
* nop: do nothing
 * sched_yield: call sched_yield() to attempt to run a different task
   without specifying a wait time
 * nanosleep 1u: call nanosleep() with a requested wait time of 1
microsecond
 * nanosleep 100u: call nanosleep() with a requested wait time of 100
   microseconds
The nop case provides baseline information on how long it takes to measure
the time it takes an operation to complete. sched_yield approximates syscall
and scheduling overhead, under the assumption that the OS will run its
scheduler to see if another process wants the CPU. If it does, the delay
time
could be greatly increased because another process may get the CPU.
The two nanosleep cases allow us to see how long a process that sleeps
actually
suspends for, and looking at two different delays is useful in case the OS
tends to get sloppier if we request longer delays.
In the following table, measured times are in microseconds, listed as
mean / min / max.
| Machine / Test | nop                 | sched_yield       | nanosleep 1u
    | nanosleep 100u        |
|----------------|---------------------|-------------------|--------------------|-----------------------|
| AMD            | 0.1  / 0.06 /  38.3 | 0.6 / 0.6 / 100.1 | 64.5 /  6.4 /
2151 | 106.3 / 104.2 /  9619 |
| MT8173         | 0.1  / 0.07 /  16.7 | 1.8 / 1.5 / 251.9 | 82.4 /  7.8 /
3051 | 195.9 / 109.3 /  5616 |
| APL            | 0.05 / 0.04 /  14.3 | 0.7 / 0.6 /  51.3 | 57.2 /  7.1 /
1236 | 217.6 / 108.4 / 10430 |
| Xeon           | 0.02 / 0.02 /  0.07 | 1.7 / 1.4 /  73.2 | 57.8 / 11.3 /
 618 | 157.8 / 106.9 /  1299 |
| MTL            | 0.05 / 0.03 / 101.9 | 0.4 / 0.4 /   7.1 | 53.0 /  4.6 /
 139 | 151.7 / 101.5 /   245 |
Latency in excess of the requested wait time is fairly consistent, on
average
about 60 microseconds and within as little as 10% of the requested time for
100 microsecond delays. Very short delays tend to be worse, and the average
case does notably better on systems with more CPU cores- probably because
the
CPU sometimes gets given to another process and causes long delays while
that
other process runs or the OS is doing some uninterruptible operation.
These numbers are similar to the ones reported in
https://review.coreboot.org/c/flashrom/+/80808,
which lends credibility to this measurement method. High-resolution
timers there improved minimum latency slightly, but as we'll see
next, different scheduling improves the situation more.
---
I also ran the tests with realtime scheduling policy (as in
`sched_setscheduler(SCHED_FIFO)`) using `chrt --fifo 1` to run at the
minimum
priority within the FIFO real-time class. This is expected to reduce latency
both by giving this process priority over all other processes on the system,
and by allowing the OS to use a simpler scheduling algorithm when ultimately
choosing to restart the process under test.
In this table the nop and sched_yield tests are omitted because they are not
as informative as the actual nanosleep numbers.
| Machine / Test | nanosleep 1u         | nanosleep 100u         |
|----------------|----------------------|------------------------|
| AMD            |  9.2 /  4.8 /   18.6 | 129.1 / 104.7 /  512.2 |
| MT8173         | 13.5 /  6.8 / 1505   | 117.2 / 106.7 / 1022   |
| APL            |  5.4 /  4.8 /   45.2 | 106.3 / 104.2 /  452.3 |
| Xeon           |  4.4 /  4.2 /   60.2 | 104.9 / 103.0 /  164.9 |
| MTL            |  2.4 /  2.1 /   74.2 | 102.4 / 101.2 /  123.0 |
At real-time priority, 1-microsecond sleeps look like they're basically only
paying the cost of a context switch and 100-microsecond sleeps are more
consistent than they were at normal priority: the average case is now
accurate
to within better than 10% on the faster machines with the slower ones not
doing
much worse.
#### Practical sleep thresholds
The existing code in flashrom that uses `clock_getres` rejects OS timers as
a timing source if the resolution is less than 100 nanoseconds. Since the
delay
functions operate with microsecond precision, this suggests that it targets
10%
accuracy for short waits and significantly better than that for long ones.
Because we see that 10% accuracy is feasible on most systems for 100
microsecond sleeps, this seems like a reasonable minimum time to sleep for.
Even on systems where accuracy is worse than 10%, use of `sleep`-style APIs
guarantees that delays will never be less than the specified time: this
approach will never malfunction, but unexpectedly long delays could make
flash programming take longer than it would otherwise.
Not all users or distributors will agree with this decision. Some very slow
or
old machines might achieve much worse precision such that it's considered
worthwhile to busy-wait more aggressively (in order to complete programming
more
quickly), while some users might prefer to take every opportunity to save
power
by always sleeping. As a result, I believe this threshold should be made a
compile-time option defaulting to 100 microseconds (although the final
choice for the default can easily be adjusted).
#### Use of real-time scheduling
The accuracy of `sleep` APIs can be improved by using real-time scheduling
classes as specified by POSIX, but enabling that for a given process usually
requires special privileges. Many users run flashrom as the root user so it
has the ability to make itself realtime, but there are also many
applications
where it need not run with root privileges.
Because realtime priority is not required for correct operation, flashrom
should attempt to make itself realtime but continue even if unable to do so.
An informational message should be emitted if that fails so users can be
aware
that granting flashrom additional privileges may yield better performance.
In libflashrom, realtime scheduling should not be enabled by default
because it is very difficult to reason about the possible effects of
changing
a host process' scheduling class. Programs using flashrom as a library
should manage priority themselves.