Thus I obtained an effective performance figure of 2.0 MFLOPS/MHz, versus Athlon-XP figure of 0.5 (for both x87 and SSE) and Pentium-4 figure of 0.4 (for SSE). This is not hype - this is me reading the documentation and doing the maths.
This is an estimate, not a benchmark. It would be interesting what the Apple C compiler would make from your code. Your code seems to map to AltiVec very well, this is pretty rare.
My algorithm is made up of standard matrix operations, such as multiplies and inversions. By transposing one of the matrices beforehand, the individual operations are mostly sequential in terms of memory access and uniform in terms of operation, which means it does map very well to vector code.
It would probably also benefit from 3DNOW on the Athlon, but I understand x86 assembly code is a lot hairier, and certainly my compiler is unable to generate 3DNOW code by itself. As a result, I've been unable to determine how much benefit I might see.
Have G4 with equivalent MHz rating of Athlons been sold lately? It seems Athlons are clocked at twice the speed of G4s.
What is the cost of a refurbished G4 system vs. a bunch of recent Athlons? I think the MFlops/$ benchmark is more interesting.
For comparable prices, I can get an 867MHz dual G4, or three single Athlon-XPs at 1666MHz. Given that I can get four times the performance per clock per CPU with the G4, it's a net win - as I posted earlier, 3500 against 2500 MFLOPS. I'm still concerned about the potential of having to use Windows software though, and that's likely to sway the final decision.
Another worthwhile comparison might be in terms of MFLOPS per watt, and that would be particularly relevant to larger clusters, where power consumption and heat become significant environmental factors.
I understand a 1GHz G4+ is at around 15-20W maximum, which I'd assume is while running Altivec code (and would go down with the vector units not in use). By comparison, Athlons seem to average around 50W, depending on the particular core type and clock speed, and are unable to shut down functional units that are not in use.
On Tue, Feb 25, 2003, Jonathan Morton wrote:
Thus I obtained an effective performance figure of 2.0 MFLOPS/MHz, versus Athlon-XP figure of 0.5 (for both x87 and SSE) and Pentium-4 figure of 0.4 (for SSE). This is not hype - this is me reading the documentation and doing the maths.
This is an estimate, not a benchmark. It would be interesting what the Apple C compiler would make from your code. Your code seems to map to AltiVec very well, this is pretty rare.
My algorithm is made up of standard matrix operations, such as multiplies and inversions. By transposing one of the matrices beforehand, the individual operations are mostly sequential in terms of memory access and uniform in terms of operation, which means it does map very well to vector code.
Hi,
Have you looked into atlas for matrix operations? It gets a very high percentage of peak performance on the ia32 class (as well as on other architectures), and may very well change the assumptions you are making. We use it extensively on our 50 node (100 AthlonMP 2000, 100Gb RAM) cluster to do Density Functional Theory. Just FYI, FFTW (Fastest Fourier Transforms in the West) is also a self-optimizing codebase that gives very good performance. Both are easily found via google.
HTH, Daniel