I am not sure if this was discussed before, but i would suggest to skip programming blocks/areas which should be programmed to the erased state. E.g. if half of the input file consists of 0xFF areas, we could half the programming time.
And on chips where read is much faster than erase (e.g. SST 25VF...) we can speed up erasing if we do a read before erase and only erase that blocks that are not already in the erased state.
Regards, Helge