If I really wanted to push the CPU-power envelope, I'd get a refurbished PowerMac. I still might. The cluster would give me an aggregate 2500 MFLOPS, while a previous-model Mac would give me 3500 for about the
is this supposed to be funny or do you really believe S. Job's jokes?
I currently have a G3, which I benchmarked the main component of my current algorithm on, obtaining 0.5 MFLOPS/MHz.
I then disassembled the code and noted that FP load, multiply-add, and store instructions could be replaced by their Altivec equivalents, which operate on four times the number of operands and are equally fast. I even checked the dispatcher rules to make sure that would not be a bottleneck - the G4+ is able to dispatch a vector multiply-add and a vector load/store, plus an integer operation (say, pointer arithmetic) and a branch if required, all in the same clock cycle.
Thus I obtained an effective performance figure of 2.0 MFLOPS/MHz, versus Athlon-XP figure of 0.5 (for both x87 and SSE) and Pentium-4 figure of 0.4 (for SSE). This is not hype - this is me reading the documentation and doing the maths.
Note that all figures assume the working set fits in cache. I believe I can ensure this, and it's also considerably easier to achieve with a Mac's 1MB L3.