A couple of comments on all of this:
After having processed the read command, it seems the hard disk reports "not busy" before "there is data" but that's the wrong order according to the ATA-3 working draft that I'm using for reference. (From 1997 but it's the latest I've found available at no cost. T13 d2008r7b)
All of the ATA drafts, up to and inluding ATA-8 are available for free download from t13.org, as they have been for several years. ATA-3 has actually been withdrawn. ATA-5 and ATA-6 are usually the best references, they are the most easily readable. ATA-7 and ATA-8 include SATA, which starts to convolute them.
IDE UDMA is always going to be faster than PIO. If your intent is only to speed up CF, then it doesn't really matter, UDMA and MDMA CFs are still pretty rare. If you want to make all filo loads faster, you may want to look into IDE DMA. It really isn't much harder to program than IDE PIO.
If you don't want to do that much work, you may want to at least try to use rep insd when you can. It is faster. All relatively modern IDE controllers support it just fine. (Some older controllers do not, however)
In any IDE driver, polling "Status" is usually bad form. You should poll "Alternate Status", until BSY goes away, wait for the other bits to get set properly, then read Status once to clear the interrupt. Polling Status will sometimes lead to spurious interrupts, which *should* be innocuous, but sometimes can have confusing effects.