Need help with LinuxBIOS speech

Eric W. Biederman ebiederman at lnxi.com
Sun Sep 15 15:49:01 CEST 2002


"STEPHAN,YANN (HP-France,ex1)" <yann_stephan at hp.com> writes:

> I love this one:
> 
>     Less buggy <== Currently the Bios does some many things in order to
> support all OS, that the code is huge and still written in assembly
> language.
> 
> 
> I will say more debuggable than normal bios.

If you have something that will log the serial console output it is
also immensely more debuggable.  The key is that the serial port
is trivial and can be initialized before ram. In fact before just
about everything else.

There are two great challenges when debugging a BIOS.
1) Identical hardware works differently.
2) Hardware bugs get to be worked around in software.

A classic case of identical hardware working differently is many times
hardware designers do not initial registers (like pci bars) at reset
and just clamp onto whatever value they floated to.   This has
caused the LinuxBIOS pci code to think rw BARS are readonly.  This
kind of thing is annoying as the problem only shows up in the when
you have a large number of systems using LinuxBIOS.

With the supermicro p4dpr in the MCR cluster I am currently tracking a
number of issues that don't reproduce on every motherboard, and are
actually workarounds for hardware bugs.  After small scale testing
the most efficient way for me to track and see if a bug has been fixed
is to flash the BIOS onto the cluster.

Then I can review the serial console boot logs and see where/how the
BIOS hung to see how well my fixes worked.

The most interesting fix relates to the intel P64H2 and it's errata
of a race between a good power signal and a clock signal.  When it
gets the clock signal before it gets good power the pci busses do
strange things.  Typicall locking up when scanning the pci bus, but
not always.  

The initial work around was to reboot the machine multiple times
hopeing to avoid the race.  Just recently I was able to implement the
intel recommended work around which involved stopping and starting the
reference clock.  The pll that generates the reference clock is
accessible on the smbus but it's interface on the i2c bus is buggy.
So I had to add code to detect when the pll's i2c interface was not
responding and continue on with life.

The proper function of this code I have been able to verify with
serial console log messages.  In addition when there is bad hardware
or a BIOS issue the bug report isn't that the node refuses to boot.
The bug report is that the node dies doing X.  Which much much easier
to start with when debugging a problem.

Eric





More information about the coreboot mailing list