What about introducing a per-programmer delay function pointer instead? Your programmer could buffer everything (delays, writes) until the first read, then send them as a batch.
AFAICS this would solve almost all issues and be a good first step forward. The only remaining issue would be fast read, but for that it's easy to replace chip_readl (32bits) with chip_readn (n bytes).
Regards, Carl-Daniel