Proposed changes in https://review.coreboot.org/c/flashrom/+/80806 have gotten me thinking about the problems with flashrom's current delay implementation, following which I've got a few ideas about how it can be improved. If other developers agree that these changes are reasonable, I think I'll go ahead and work on them.
Although I've measured some data, I haven't yet attempted to quantify the overall performance effect to flashrom from these changes because that would require implementing them. I do expect that I would gather that data for an actual implementation.
## Proposals
1. Remove support for delaying inside flashrom using a calibrated busy-loop. Delete `myusec_delay` and `myusec_calibrate_delay`. 2. Remove probing for OS timer resolution. Delete `clock_check_res` and `measure_os_delay_resolution`. Always use `clock_gettime`-based timing for busy-loops, or `gettimeofday` if `clock_gettime` is unavailable. 3. Make the threshold in `default_delay` to choose between sleep or busy-looping configurable at build time. Reduce the default from 100ms to 100μs. 4. Automatically attempt to run flashrom at realtime priority when supported by the system, to minimize timing error.
## Rationale
Each of these proposed changes fits in with the others to simplify how flashrom delays while waiting for work to complete, but their reasons are largely independent.
### Calibrated busy-loop removal
Flashrom's calibrated busy-loop is as old as flashrom itself, with its principle of operation first being established before 2003; the first commit to the flashrom repository included a version of `myusec_calibrate_delay` which attempted to count the number of iterations of an empty loop that the CPU could execute in 250ms. That initial version searched by doubling its delay time for each attempt, so delays could have been up to twice the length specified.
On modern systems, it is a grave mistake to assume that any CPU-based calibration process can accurately measure real time by counting the number of instructions executed (or the number of loop iterations completed). Runtime frequency changes have become much more common in the intervening 20 years, and if CPU frequency changes while flashrom is running then calibration becomes wrong.
On systems where all processors have the same maximum frequency, delay loop calibration tends to be fairly accurate because the current implementation calibrates for 1 second and this is at least as long as most systems take to reach their maximum clock speed under load. Short delays could be much longer than expected if the CPU clocked back down after calibration, but the calibrated delay would generally be close to correct at maximum speed so it would be unlikely to delay for less time than expected.
Today's CPUs are even more heterogenous, however. ARM machines have used heterogenous multiprocessors since about 2011, and recent x86 systems from both Intel and AMD can also have different maximum clock speeds for different cores. On these machines, a delay loop calibrated on a CPU with lower maximum frequency could delay for too short a time when run on a different CPU with higher maximum speed. Even SMT systems that have dynamic thread scheduling policy for each physical core might exhibit higher performance if a sibling thread is idle!
Although it may be possible to compensate for heterogenous multiprocessors (for example by calibrating delays for each CPU in the system or pinning flashrom to a single CPU using OS APIs), doing so would either increase runtime costs further or add to the maintenance burden of the flashrom maintainers. Because it is difficult to ensure a calibrated delay loop is actually correct, we should never use one.
### Ignoring OS-reported timer resolution
On Linux, `clock_getres` is documented to lie sometimes: a "tickless" kernel can generally report times with nanosecond resolution, but `clock_getres` reports resolution of around 1 millisecond. This confounds flashrom, causing it to use the calibrated delay loop instead of OS timing.
Because we cannot assume a CPU busy-loop to ever be truly calibrated on modern machines without significant effort, it is useless to try to use reported timer resolution as a signal to use a busy-loop instead (because there is no busy-loop to fall back to). We should instead use whatever OS timing is available.
`gettimeofday` is specified as early as System V r4, but is deprecated in POSIX.1-2008. `clock_gettime` is mandatory since POSIX.1-2008. Some systems that we wish to support may lack `clock_gettime`, which is preferred because it offers nanosecond resolution to `gettimeofday`'s microsecond resolution and offers clock options that will never be discontinuous (for example due to daylight savings or leap seconds). Between these two functions, all platforms that flashrom wishes to support should be supportable.
### Reduced sleep thresholds
On modern machines, the current 100 millisecond maximum busy-wait time is an eternity (and until recently it was 1 second, which is even more ridiculous). We can save a lot of power by giving up the CPU more readily, but choosing a threshold is somewhat challenging.
Modern x86 systems have typical context switch latency (from suspending one process, deferring to the kernel, and restarting another process) of around 2 microseconds, so this is a reasonable starting point for the maximum time to busy-wait for: it's unlikely that another process can do anything useful in less that 2 microseconds, or that the OS can suspend the CPU to save energy.
From surveying delay calls within flashrom, they span a large range from 1 microsecond up to a whole second. By inspection, 10 microseconds seems to be a common delay used when polling for completion of some operation. A few delay users describe how long it's usually expected for a polled operation to take, but most don't so it's difficult to tell how accurate the delays actually should be in order to minimize wasted time.
#### Wakeup latency measurements
To evaluate how much timing slop might be introduced by sleeping instead of busy-looping for short delays (less than 1 millisecond), I measured the latency of several operations on a selection of various Linux machines to evaluate how much longer it takes a program to actually resume from `nanosleep` versus the requested wait time. The systems were:
* Mediatek Kompanio 500 (MT8173). Four ARMv7 cores running at up to 2 GHz. * Intel Xeon Gold 6154 (Xeon). Two sockets, each with 18 physical cores running 36 threads. * Intel Celeron N3350 (APL). Two x86 cores running at 1.1 GHz, up to 2.4 GHz. * AMD A4-9120C (AMD). Two x86 cores at 1.6 GHz, up to 2.4 GHz. * Intel Core Ultra 5 115U (MTL). Eight x86 cores in three performance tiers at 0.7-1.5 GHz, up to 2.1-4.2 GHz.
I ran four tests each 16k times, getting the time elapsed for each iteration by calling `clock_gettime` immediately before and after each test. The minimum, maximum and mean time taken for each test was recorded. On all systems the reported timer resolution via `clock_getres` was 1 nanosecond.
* nop: do nothing * sched_yield: call sched_yield() to attempt to run a different task without specifying a wait time * nanosleep 1u: call nanosleep() with a requested wait time of 1 microsecond * nanosleep 100u: call nanosleep() with a requested wait time of 100 microseconds
The nop case provides baseline information on how long it takes to measure the time it takes an operation to complete. sched_yield approximates syscall and scheduling overhead, under the assumption that the OS will run its scheduler to see if another process wants the CPU. If it does, the delay time could be greatly increased because another process may get the CPU.
The two nanosleep cases allow us to see how long a process that sleeps actually suspends for, and looking at two different delays is useful in case the OS tends to get sloppier if we request longer delays.
In the following table, measured times are in microseconds, listed as mean / min / max.
| Machine / Test | nop | sched_yield | nanosleep 1u | nanosleep 100u | |----------------|---------------------|-------------------|--------------------|-----------------------| | AMD | 0.1 / 0.06 / 38.3 | 0.6 / 0.6 / 100.1 | 64.5 / 6.4 / 2151 | 106.3 / 104.2 / 9619 | | MT8173 | 0.1 / 0.07 / 16.7 | 1.8 / 1.5 / 251.9 | 82.4 / 7.8 / 3051 | 195.9 / 109.3 / 5616 | | APL | 0.05 / 0.04 / 14.3 | 0.7 / 0.6 / 51.3 | 57.2 / 7.1 / 1236 | 217.6 / 108.4 / 10430 | | Xeon | 0.02 / 0.02 / 0.07 | 1.7 / 1.4 / 73.2 | 57.8 / 11.3 / 618 | 157.8 / 106.9 / 1299 | | MTL | 0.05 / 0.03 / 101.9 | 0.4 / 0.4 / 7.1 | 53.0 / 4.6 / 139 | 151.7 / 101.5 / 245 |
Latency in excess of the requested wait time is fairly consistent, on average about 60 microseconds and within as little as 10% of the requested time for 100 microsecond delays. Very short delays tend to be worse, and the average case does notably better on systems with more CPU cores- probably because the CPU sometimes gets given to another process and causes long delays while that other process runs or the OS is doing some uninterruptible operation.
These numbers are similar to the ones reported in https://review.coreboot.org/c/flashrom/+/80808, which lends credibility to this measurement method. High-resolution timers there improved minimum latency slightly, but as we'll see next, different scheduling improves the situation more.
---
I also ran the tests with realtime scheduling policy (as in `sched_setscheduler(SCHED_FIFO)`) using `chrt --fifo 1` to run at the minimum priority within the FIFO real-time class. This is expected to reduce latency both by giving this process priority over all other processes on the system, and by allowing the OS to use a simpler scheduling algorithm when ultimately choosing to restart the process under test.
In this table the nop and sched_yield tests are omitted because they are not as informative as the actual nanosleep numbers.
| Machine / Test | nanosleep 1u | nanosleep 100u | |----------------|----------------------|------------------------| | AMD | 9.2 / 4.8 / 18.6 | 129.1 / 104.7 / 512.2 | | MT8173 | 13.5 / 6.8 / 1505 | 117.2 / 106.7 / 1022 | | APL | 5.4 / 4.8 / 45.2 | 106.3 / 104.2 / 452.3 | | Xeon | 4.4 / 4.2 / 60.2 | 104.9 / 103.0 / 164.9 | | MTL | 2.4 / 2.1 / 74.2 | 102.4 / 101.2 / 123.0 |
At real-time priority, 1-microsecond sleeps look like they're basically only paying the cost of a context switch and 100-microsecond sleeps are more consistent than they were at normal priority: the average case is now accurate to within better than 10% on the faster machines with the slower ones not doing much worse.
#### Practical sleep thresholds
The existing code in flashrom that uses `clock_getres` rejects OS timers as a timing source if the resolution is less than 100 nanoseconds. Since the delay functions operate with microsecond precision, this suggests that it targets 10% accuracy for short waits and significantly better than that for long ones.
Because we see that 10% accuracy is feasible on most systems for 100 microsecond sleeps, this seems like a reasonable minimum time to sleep for. Even on systems where accuracy is worse than 10%, use of `sleep`-style APIs guarantees that delays will never be less than the specified time: this approach will never malfunction, but unexpectedly long delays could make flash programming take longer than it would otherwise.
Not all users or distributors will agree with this decision. Some very slow or old machines might achieve much worse precision such that it's considered worthwhile to busy-wait more aggressively (in order to complete programming more quickly), while some users might prefer to take every opportunity to save power by always sleeping. As a result, I believe this threshold should be made a compile-time option defaulting to 100 microseconds (although the final choice for the default can easily be adjusted).
#### Use of real-time scheduling
The accuracy of `sleep` APIs can be improved by using real-time scheduling classes as specified by POSIX, but enabling that for a given process usually requires special privileges. Many users run flashrom as the root user so it has the ability to make itself realtime, but there are also many applications where it need not run with root privileges.
Because realtime priority is not required for correct operation, flashrom should attempt to make itself realtime but continue even if unable to do so. An informational message should be emitted if that fails so users can be aware that granting flashrom additional privileges may yield better performance.
In libflashrom, realtime scheduling should not be enabled by default because it is very difficult to reason about the possible effects of changing a host process' scheduling class. Programs using flashrom as a library should manage priority themselves.
Peter,
Thank you so much for doing such a thorough research and measurements! And for investing so much of your time and effort into improving flashrom! This is so impressive <3
To the point, I read everything in full, and I have no objections. I am wondering, since this is a big chunk of work, how to maybe plan it into phases: but this can be discussed later.
Also interested to know what other people think? -- Anastasia
On Mon, Mar 25, 2024 at 5:15 PM Peter Marheine pmarheine@chromium.org wrote:
Proposed changes in https://review.coreboot.org/c/flashrom/+/80806 have gotten me thinking about the problems with flashrom's current delay implementation, following which I've got a few ideas about how it can be improved. If other developers agree that these changes are reasonable, I think I'll go ahead and work on them.
Although I've measured some data, I haven't yet attempted to quantify the overall performance effect to flashrom from these changes because that would require implementing them. I do expect that I would gather that data for an actual implementation.
## Proposals
- Remove support for delaying inside flashrom using a calibrated busy-loop. Delete `myusec_delay` and `myusec_calibrate_delay`.
- Remove probing for OS timer resolution. Delete `clock_check_res` and `measure_os_delay_resolution`. Always use `clock_gettime`-based timing for busy-loops, or `gettimeofday` if `clock_gettime` is unavailable.
- Make the threshold in `default_delay` to choose between sleep or busy-looping configurable at build time. Reduce the default from 100ms to 100μs.
- Automatically attempt to run flashrom at realtime priority when supported by the system, to minimize timing error.
## Rationale
Each of these proposed changes fits in with the others to simplify how flashrom delays while waiting for work to complete, but their reasons are largely independent.
### Calibrated busy-loop removal
Flashrom's calibrated busy-loop is as old as flashrom itself, with its principle of operation first being established before 2003; the first commit to the flashrom repository included a version of `myusec_calibrate_delay` which attempted to count the number of iterations of an empty loop that the CPU could execute in 250ms. That initial version searched by doubling its delay time for each attempt, so delays could have been up to twice the length specified.
On modern systems, it is a grave mistake to assume that any CPU-based calibration process can accurately measure real time by counting the number of instructions executed (or the number of loop iterations completed). Runtime frequency changes have become much more common in the intervening 20 years, and if CPU frequency changes while flashrom is running then calibration becomes wrong.
On systems where all processors have the same maximum frequency, delay loop calibration tends to be fairly accurate because the current implementation calibrates for 1 second and this is at least as long as most systems take to reach their maximum clock speed under load. Short delays could be much longer than expected if the CPU clocked back down after calibration, but the calibrated delay would generally be close to correct at maximum speed so it would be unlikely to delay for less time than expected.
Today's CPUs are even more heterogenous, however. ARM machines have used heterogenous multiprocessors since about 2011, and recent x86 systems from both Intel and AMD can also have different maximum clock speeds for different cores. On these machines, a delay loop calibrated on a CPU with lower maximum frequency could delay for too short a time when run on a different CPU with higher maximum speed. Even SMT systems that have dynamic thread scheduling policy for each physical core might exhibit higher performance if a sibling thread is idle!
Although it may be possible to compensate for heterogenous multiprocessors (for example by calibrating delays for each CPU in the system or pinning flashrom to a single CPU using OS APIs), doing so would either increase runtime costs further or add to the maintenance burden of the flashrom maintainers. Because it is difficult to ensure a calibrated delay loop is actually correct, we should never use one.
### Ignoring OS-reported timer resolution
On Linux, `clock_getres` is documented to lie sometimes: a "tickless" kernel can generally report times with nanosecond resolution, but `clock_getres` reports resolution of around 1 millisecond. This confounds flashrom, causing it to use the calibrated delay loop instead of OS timing.
Because we cannot assume a CPU busy-loop to ever be truly calibrated on modern machines without significant effort, it is useless to try to use reported timer resolution as a signal to use a busy-loop instead (because there is no busy-loop to fall back to). We should instead use whatever OS timing is available.
`gettimeofday` is specified as early as System V r4, but is deprecated in POSIX.1-2008. `clock_gettime` is mandatory since POSIX.1-2008. Some systems that we wish to support may lack `clock_gettime`, which is preferred because it offers nanosecond resolution to `gettimeofday`'s microsecond resolution and offers clock options that will never be discontinuous (for example due to daylight savings or leap seconds). Between these two functions, all platforms that flashrom wishes to support should be supportable.
### Reduced sleep thresholds
On modern machines, the current 100 millisecond maximum busy-wait time is an eternity (and until recently it was 1 second, which is even more ridiculous). We can save a lot of power by giving up the CPU more readily, but choosing a threshold is somewhat challenging.
Modern x86 systems have typical context switch latency (from suspending one process, deferring to the kernel, and restarting another process) of around 2 microseconds, so this is a reasonable starting point for the maximum time to busy-wait for: it's unlikely that another process can do anything useful in less that 2 microseconds, or that the OS can suspend the CPU to save energy.
From surveying delay calls within flashrom, they span a large range from 1 microsecond up to a whole second. By inspection, 10 microseconds seems to be a common delay used when polling for completion of some operation. A few delay users describe how long it's usually expected for a polled operation to take, but most don't so it's difficult to tell how accurate the delays actually should be in order to minimize wasted time.
#### Wakeup latency measurements
To evaluate how much timing slop might be introduced by sleeping instead of busy-looping for short delays (less than 1 millisecond), I measured the latency of several operations on a selection of various Linux machines to evaluate how much longer it takes a program to actually resume from `nanosleep` versus the requested wait time. The systems were:
- Mediatek Kompanio 500 (MT8173). Four ARMv7 cores running at up to 2 GHz.
- Intel Xeon Gold 6154 (Xeon). Two sockets, each with 18 physical cores running 36 threads.
- Intel Celeron N3350 (APL). Two x86 cores running at 1.1 GHz, up to 2.4 GHz.
- AMD A4-9120C (AMD). Two x86 cores at 1.6 GHz, up to 2.4 GHz.
- Intel Core Ultra 5 115U (MTL). Eight x86 cores in three performance tiers at 0.7-1.5 GHz, up to 2.1-4.2 GHz.
I ran four tests each 16k times, getting the time elapsed for each iteration by calling `clock_gettime` immediately before and after each test. The minimum, maximum and mean time taken for each test was recorded. On all systems the reported timer resolution via `clock_getres` was 1 nanosecond.
- nop: do nothing
- sched_yield: call sched_yield() to attempt to run a different task without specifying a wait time
- nanosleep 1u: call nanosleep() with a requested wait time of 1 microsecond
- nanosleep 100u: call nanosleep() with a requested wait time of 100 microseconds
The nop case provides baseline information on how long it takes to measure the time it takes an operation to complete. sched_yield approximates syscall and scheduling overhead, under the assumption that the OS will run its scheduler to see if another process wants the CPU. If it does, the delay time could be greatly increased because another process may get the CPU.
The two nanosleep cases allow us to see how long a process that sleeps actually suspends for, and looking at two different delays is useful in case the OS tends to get sloppier if we request longer delays.
In the following table, measured times are in microseconds, listed as mean / min / max.
| Machine / Test | nop | sched_yield | nanosleep 1u | nanosleep 100u | |----------------|---------------------|-------------------|--------------------|-----------------------| | AMD | 0.1 / 0.06 / 38.3 | 0.6 / 0.6 / 100.1 | 64.5 / 6.4 / 2151 | 106.3 / 104.2 / 9619 | | MT8173 | 0.1 / 0.07 / 16.7 | 1.8 / 1.5 / 251.9 | 82.4 / 7.8 / 3051 | 195.9 / 109.3 / 5616 | | APL | 0.05 / 0.04 / 14.3 | 0.7 / 0.6 / 51.3 | 57.2 / 7.1 / 1236 | 217.6 / 108.4 / 10430 | | Xeon | 0.02 / 0.02 / 0.07 | 1.7 / 1.4 / 73.2 | 57.8 / 11.3 / 618 | 157.8 / 106.9 / 1299 | | MTL | 0.05 / 0.03 / 101.9 | 0.4 / 0.4 / 7.1 | 53.0 / 4.6 / 139 | 151.7 / 101.5 / 245 |
Latency in excess of the requested wait time is fairly consistent, on average about 60 microseconds and within as little as 10% of the requested time for 100 microsecond delays. Very short delays tend to be worse, and the average case does notably better on systems with more CPU cores- probably because the CPU sometimes gets given to another process and causes long delays while that other process runs or the OS is doing some uninterruptible operation.
These numbers are similar to the ones reported in https://review.coreboot.org/c/flashrom/+/80808, which lends credibility to this measurement method. High-resolution timers there improved minimum latency slightly, but as we'll see next, different scheduling improves the situation more.
I also ran the tests with realtime scheduling policy (as in `sched_setscheduler(SCHED_FIFO)`) using `chrt --fifo 1` to run at the minimum priority within the FIFO real-time class. This is expected to reduce latency both by giving this process priority over all other processes on the system, and by allowing the OS to use a simpler scheduling algorithm when ultimately choosing to restart the process under test.
In this table the nop and sched_yield tests are omitted because they are not as informative as the actual nanosleep numbers.
| Machine / Test | nanosleep 1u | nanosleep 100u | |----------------|----------------------|------------------------| | AMD | 9.2 / 4.8 / 18.6 | 129.1 / 104.7 / 512.2 | | MT8173 | 13.5 / 6.8 / 1505 | 117.2 / 106.7 / 1022 | | APL | 5.4 / 4.8 / 45.2 | 106.3 / 104.2 / 452.3 | | Xeon | 4.4 / 4.2 / 60.2 | 104.9 / 103.0 / 164.9 | | MTL | 2.4 / 2.1 / 74.2 | 102.4 / 101.2 / 123.0 |
At real-time priority, 1-microsecond sleeps look like they're basically only paying the cost of a context switch and 100-microsecond sleeps are more consistent than they were at normal priority: the average case is now accurate to within better than 10% on the faster machines with the slower ones not doing much worse.
#### Practical sleep thresholds
The existing code in flashrom that uses `clock_getres` rejects OS timers as a timing source if the resolution is less than 100 nanoseconds. Since the delay functions operate with microsecond precision, this suggests that it targets 10% accuracy for short waits and significantly better than that for long ones.
Because we see that 10% accuracy is feasible on most systems for 100 microsecond sleeps, this seems like a reasonable minimum time to sleep for. Even on systems where accuracy is worse than 10%, use of `sleep`-style APIs guarantees that delays will never be less than the specified time: this approach will never malfunction, but unexpectedly long delays could make flash programming take longer than it would otherwise.
Not all users or distributors will agree with this decision. Some very slow or old machines might achieve much worse precision such that it's considered worthwhile to busy-wait more aggressively (in order to complete programming more quickly), while some users might prefer to take every opportunity to save power by always sleeping. As a result, I believe this threshold should be made a compile-time option defaulting to 100 microseconds (although the final choice for the default can easily be adjusted).
#### Use of real-time scheduling
The accuracy of `sleep` APIs can be improved by using real-time scheduling classes as specified by POSIX, but enabling that for a given process usually requires special privileges. Many users run flashrom as the root user so it has the ability to make itself realtime, but there are also many applications where it need not run with root privileges.
Because realtime priority is not required for correct operation, flashrom should attempt to make itself realtime but continue even if unable to do so. An informational message should be emitted if that fails so users can be aware that granting flashrom additional privileges may yield better performance.
In libflashrom, realtime scheduling should not be enabled by default because it is very difficult to reason about the possible effects of changing a host process' scheduling class. Programs using flashrom as a library should manage priority themselves. _______________________________________________ flashrom mailing list -- flashrom@flashrom.org To unsubscribe send an email to flashrom-leave@flashrom.org
On Wed, 27 Mar 2024 at 11:20, Anastasia Klimchuk aklm@chromium.org wrote:
Also interested to know what other people think?
I also agree with Peter, the "calibrated busy-loop" worked great on 16 bit DOS, but with modern operating systems and CPUs it just doesn't make sense.
Richard
I did find that DJGPP documents its clock resolution to be only 50ms (since DOS only provides a 20 Hz timer), so only using clock-based delays would be a very large (~500x) performance hit on DJGPP. I think we should keep a calibrated delay loop for DOS only, since we can expect other OSes to have much better timing support (and it seems less likely that a system will be using CPU frequency scaling of any kind when running DOS).
As for performance changes from sleeping more aggressively, I've done some experimentation on an Intel system using the ichspi programmer and found that even reducing the sleep threshold to 10 microseconds doesn't affect overall performance because almost every delay that gets used is 10 microseconds or less. By doing some grepping through the source tree, it looks like delays used tend to either be 1-10 microseconds or multiple milliseconds. For instance, almost every chip sets its probe delay to none, and only four have a probe delay of more than 10 microseconds. Since there are very few situations where we actually end up sleeping even with a 100 microsecond sleep threshold, I don't think it's important to use realtime priority. Especially because of the risks involved in using realtime priority (possibly making the system completely unresponsive if an error causes flashrom to get stuck), I no longer believe we should automatically try to use it (leaving it up to callers to select a high priority if desired).
I am wondering, since this is a big chunk of work, how to maybe plan it into phases: but this can be discussed later.
It actually doesn't seem like very much work to me. Removing the calibrated loop is pretty easy (I've already written a change that does that), and the rest is mostly fiddling to ensure the program can still build on weird/old platforms like DJGPP. You can expect a short chain of patches in fairly short order, unless I encounter unexpected difficulties. :)
On Thu, Mar 28, 2024 at 12:01 AM Richard Hughes hughsient@gmail.com wrote:
On Wed, 27 Mar 2024 at 11:20, Anastasia Klimchuk aklm@chromium.org wrote:
Also interested to know what other people think?
I also agree with Peter, the "calibrated busy-loop" worked great on 16 bit DOS, but with modern operating systems and CPUs it just doesn't make sense.
Richard
On Wed, 27 Mar 2024 at 23:06, Peter Marheine pmarheine@chromium.org wrote:
I think we should keep a calibrated delay loop for DOS only
I'm sceptical anyone is actually using flashrom on DOS. Personally I'd drop it as a build target.
Richard.
I'm sceptical anyone is actually using flashrom on DOS. Personally I'd drop it as a build target.
It is still used, and we have build instructions. Someone recently built flashrom for DOS following the instructions. So even if it's not a very common use case, it's still a use case. We are not dropping it for now :)
You can expect a short chain of patches in fairly short order, unless I encounter unexpected difficulties. :)
Sounds great!
On Thu, 28 Mar 2024 at 10:11, Anastasia Klimchuk aklm@chromium.org wrote:
Someone recently built flashrom for DOS following the instructions.
I'm sure that's true; but do you want to keep around such legacy code (increasing the maintenance load) in the codebase forever? Do you want to test it for build-time and runtime regressions before you do each official release?
Richard
Am 27.03.24 um 14:01 schrieb Richard Hughes:
On Wed, 27 Mar 2024 at 11:20, Anastasia Klimchuk aklm@chromium.org wrote:
Also interested to know what other people think?
I also agree with Peter, the "calibrated busy-loop" worked great on 16 bit DOS, but with modern operating systems and CPUs it just doesn't make sense.
Well, it theoretically may not make sense, but the busy-loop waits were sometimes the only reasonable waiting solution on older non-SPI (LPC/FWH/parallel) flash chips.
For example, some chips use the toggle bit detection to check for a finished write. In theory, toggle-bit works like this: Read the same location over and over until the designated toggle bit stops changing, then you know your write is done. Now, some Winbond (AFAIR, it's been some years) have an additional requirement even mentioned in the data sheet: Reread too quickly, and the bit will not toggle, falsely giving the impression that the write is done. On other chips, rereading too slowly would also fool the write-is-done detection. Those old machines hosting affected flash chips usually do not have any highly dynamic frequency scaling not P-cores/E-cores, so the concerns about "doesn't make sense" don't apply here.
You're of course free to declare those chips obsolete, but in that case it may make sense to drop support for them altogether (i.e. refuse write or remove from database, but keep an incompatibility notice) instead of giving people a false hope. Silent breakage may be OK for higher-level software, but it should never be intentionally done for software which can brick your machine.
That said, flashrom for DOS users still exist.
The easiest option for getting rid of all those concerns is to compile a SPI-only non-DOS flashrom and make any delay calibrations dependent on (non-SPI OR DOS). That won't simplify the code, though.
Dropping the DOS code and the non-SPI code is another option. I'm not advocating for that, but if you want to clean up code, that's the simplest way to get rid of all the various warts.
As a side note, it would be interesting to know what the main usage of flashrom is nowadays. I'd expect it to be mostly Chromebooks, possibly some other embedded boards. Focusing on the 99% might yield enormous cleanup opportunities.
Regards, Carl-Daniel
do you want to keep around such legacy code (increasing the maintenance load) in the codebase forever?
No I don't want to keep such code forever to maintain, mainly because it won't be needed forever. But dropping support for DOS will be a separate effort (with dedicated threads and announcements). I don't think it should be like "while we are here, let's also drop DOS", it's a bigger effort than "while we are here".
For example, some chips use the toggle bit detection to check for a finished write.
Do you know, are such chips marked in flashchips definition, how do we know which ones are like this? I understand from your words this is a subset of older non-SPI (LPC/FWH/parallel) flash chips but which ones?
Also relevant to delays: are those the same chips that need extra delay 1s? For context, comments here https://review.coreboot.org/c/flashrom/+/80807 Do you maybe remember how to find them in flashchips?
As I understand, there are some small(?) number of older(?) chips which need special treatment on delays, it would be so helpful to find out which ones exactly.
Focusing on the 99% might yield enormous cleanup opportunities.
Yes, that's a really good observation. I have such thoughts in the background.
For completeness, the proposed change is at https://review.coreboot.org/c/flashrom/+/81545
I've tested it myself (and added a unit test) and am fairly confident everything works well, but if anybody has a "weirder" machine than an x86 PC to test on, additional coverage would be helpful.
On Fri, Apr 5, 2024 at 11:13 PM Anastasia Klimchuk aklm@chromium.org wrote:
do you want to keep around such legacy code (increasing the maintenance load) in the codebase forever?
No I don't want to keep such code forever to maintain, mainly because it won't be needed forever. But dropping support for DOS will be a separate effort (with dedicated threads and announcements). I don't think it should be like "while we are here, let's also drop DOS", it's a bigger effort than "while we are here".
For example, some chips use the toggle bit detection to check for a finished write.
Do you know, are such chips marked in flashchips definition, how do we know which ones are like this? I understand from your words this is a subset of older non-SPI (LPC/FWH/parallel) flash chips but which ones?
Also relevant to delays: are those the same chips that need extra delay 1s? For context, comments here https://review.coreboot.org/c/flashrom/+/80807 Do you maybe remember how to find them in flashchips?
As I understand, there are some small(?) number of older(?) chips which need special treatment on delays, it would be so helpful to find out which ones exactly.
Focusing on the 99% might yield enormous cleanup opportunities.
Yes, that's a really good observation. I have such thoughts in the background.
-- Anastasia.
I've tested compilation of https://review.coreboot.org/c/flashrom/+/81545 and reading, erasing, writing flash on a FreeBSD -CURRENT amd64 machine (ASUS F2A85-M). Seems to work as expected.
On Fri, Apr 12, 2024, 08:22 Peter Marheine pmarheine@chromium.org wrote:
For completeness, the proposed change is at https://review.coreboot.org/c/flashrom/+/81545
I've tested it myself (and added a unit test) and am fairly confident everything works well, but if anybody has a "weirder" machine than an x86 PC to test on, additional coverage would be helpful.
On Fri, Apr 5, 2024 at 11:13 PM Anastasia Klimchuk aklm@chromium.org wrote:
do you want to keep around such legacy code (increasing the maintenance load) in the codebase forever?
No I don't want to keep such code forever to maintain, mainly because it won't be needed forever. But dropping support for DOS will be a separate effort (with dedicated threads and announcements). I don't think it should be like "while we are here, let's also drop DOS", it's a bigger effort than "while we are here".
For example, some chips use the toggle bit detection to check for a finished write.
Do you know, are such chips marked in flashchips definition, how do we know which ones are like this? I understand from your words this is a subset of older non-SPI (LPC/FWH/parallel) flash chips but which ones?
Also relevant to delays: are those the same chips that need extra delay 1s? For context, comments here https://review.coreboot.org/c/flashrom/+/80807 Do you maybe remember how to find them in flashchips?
As I understand, there are some small(?) number of older(?) chips which need special treatment on delays, it would be so helpful to find out which ones exactly.
Focusing on the 99% might yield enormous cleanup opportunities.
Yes, that's a really good observation. I have such thoughts in the background.
-- Anastasia.
flashrom mailing list -- flashrom@flashrom.org To unsubscribe send an email to flashrom-leave@flashrom.org