"STEPHAN,YANN (HP-France,ex1)" yann_stephan@hp.com writes:
I love this one:
Less buggy <== Currently the Bios does some many things in order to
support all OS, that the code is huge and still written in assembly language.
I will say more debuggable than normal bios.
If you have something that will log the serial console output it is also immensely more debuggable. The key is that the serial port is trivial and can be initialized before ram. In fact before just about everything else.
There are two great challenges when debugging a BIOS. 1) Identical hardware works differently. 2) Hardware bugs get to be worked around in software.
A classic case of identical hardware working differently is many times hardware designers do not initial registers (like pci bars) at reset and just clamp onto whatever value they floated to. This has caused the LinuxBIOS pci code to think rw BARS are readonly. This kind of thing is annoying as the problem only shows up in the when you have a large number of systems using LinuxBIOS.
With the supermicro p4dpr in the MCR cluster I am currently tracking a number of issues that don't reproduce on every motherboard, and are actually workarounds for hardware bugs. After small scale testing the most efficient way for me to track and see if a bug has been fixed is to flash the BIOS onto the cluster.
Then I can review the serial console boot logs and see where/how the BIOS hung to see how well my fixes worked.
The most interesting fix relates to the intel P64H2 and it's errata of a race between a good power signal and a clock signal. When it gets the clock signal before it gets good power the pci busses do strange things. Typicall locking up when scanning the pci bus, but not always.
The initial work around was to reboot the machine multiple times hopeing to avoid the race. Just recently I was able to implement the intel recommended work around which involved stopping and starting the reference clock. The pll that generates the reference clock is accessible on the smbus but it's interface on the i2c bus is buggy. So I had to add code to detect when the pll's i2c interface was not responding and continue on with life.
The proper function of this code I have been able to verify with serial console log messages. In addition when there is bad hardware or a BIOS issue the bug report isn't that the node refuses to boot. The bug report is that the node dies doing X. Which much much easier to start with when debugging a problem.
Eric