What about introducing a per-programmer delay function pointer instead? Your programmer could buffer everything (delays, writes) until the first read, then send them as a batch.
The only problem i see here that in order for this to work properly, i would need to create the buffer on the programmer side (otherwise there could be too big delays between the packets (even if streaming a buffered store from the computer), eg. network packet loss) and i'm not sure whether the AVR's 1k SRAM would be enough to hold even the low level description of a single page load, given address + data = 4 bytes + 1 byte operation = 5*256 = 1280 bytes. ofcourse the addresses could be "compressed" (sequential), but it still doesnt sound right - even simple op+byte would be 512+ bytes. We could compress down to op + n + n bytes ... that could work. Not sure.
I sent this accidentally only to carl-daniel first, sorry.