On Tue, Feb 25, 2003, Jonathan Morton wrote:
Thus I obtained an effective performance figure of 2.0 MFLOPS/MHz, versus Athlon-XP figure of 0.5 (for both x87 and SSE) and Pentium-4 figure of 0.4 (for SSE). This is not hype - this is me reading the documentation and doing the maths.
This is an estimate, not a benchmark. It would be interesting what the Apple C compiler would make from your code. Your code seems to map to AltiVec very well, this is pretty rare.
My algorithm is made up of standard matrix operations, such as multiplies and inversions. By transposing one of the matrices beforehand, the individual operations are mostly sequential in terms of memory access and uniform in terms of operation, which means it does map very well to vector code.
Hi,
Have you looked into atlas for matrix operations? It gets a very high percentage of peak performance on the ia32 class (as well as on other architectures), and may very well change the assumptions you are making. We use it extensively on our 50 node (100 AthlonMP 2000, 100Gb RAM) cluster to do Density Functional Theory. Just FYI, FFTW (Fastest Fourier Transforms in the West) is also a self-optimizing codebase that gives very good performance. Both are easily found via google.
HTH, Daniel