Re: Low cost cluster

25 Feb 2003


      ...
...
If I really wanted to push the CPU-power envelope, I'd get a refurbished
PowerMac.  I still might.  The cluster would give me an aggregate 2500
MFLOPS, while a previous-model Mac would give me 3500 for about the
is this supposed to be funny or do you really believe S. Job's jokes?
I currently have a G3, which I benchmarked the main component of my 
current algorithm on, obtaining 0.5 MFLOPS/MHz.
I then disassembled the code and noted that FP load, multiply-add, 
and store instructions could be replaced by their Altivec 
equivalents, which operate on four times the number of operands and 
are equally fast.  I even checked the dispatcher rules to make sure 
that would not be a bottleneck - the G4+ is able to dispatch a vector 
multiply-add and a vector load/store, plus an integer operation (say, 
pointer arithmetic) and a branch if required, all in the same clock 
cycle.
Thus I obtained an effective performance figure of 2.0 MFLOPS/MHz, 
versus Athlon-XP figure of 0.5 (for both x87 and SSE) and Pentium-4 
figure of 0.4 (for SSE).  This is not hype - this is me reading the 
documentation and doing the maths.
Note that all figures assume the working set fits in cache.  I 
believe I can ensure this, and it's also considerably easier to 
achieve with a Mac's 1MB L3.
-- 
--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@chromatix.demon.co.uk
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Low cost cluster