[coreboot] K8 machine with (a lot of) ram from 2 vendors keeps resetting

Thu Sep 3 23:06:03 CEST 2009

Hi Ward,

On Wed, Sep 2, 2009 at 8:03 PM, Ward Vandewege<ward at gnu.org> wrote:
> Hi Marc,
>
> On Wed, Sep 02, 2009 at 04:03:45PM -0600, Marc Jones wrote:
>> > With the proprietary BIOS, this setup works perfectly. That said, the manual
>> > for the board does say it's not recommended to mix memory types. Our system
>> > integrator mentions that when 16 banks are used, the memory runs at maximum
>> > 533MHz.
>>
>> As we talked about, it looks like the memory is sized correctly and
>> the next thing to try forcing the memory speed slower. 8 dual rank
>> dimms (16 banks) s[eed limitation might be spec'd or errata. A check
>> might need to go into the main k8 mem init code.
>
> I've forced the speed down to 533MHz with this patch:
>
> --- src/northbridge/amd/amdk8/raminit_f.code(revision 4625)
> +++ src/northbridge/amd/amdk8/raminit_f.code(working copy)
> @@ -1811,6 +1811,10 @@
>    }
>    min_latency = 3;
>
> +   // Force minimum cycle time to 3.75ns (i.e. 266MHz)
> +   min_cycle_time = 0x375;
> +
> +   printk_raminit("1 bios_cycle_time: %08x\n", bios_cycle_time);
>    printk_raminit("1 min_cycle_time: %08x\n", min_cycle_time);
>
>    /* Compute the least latency with the fastest clock supported
>
> Which appears to do the right thing, but the behavior is unchanged. Here's a
> boot log prior to the patch:
>
>  http://ward.vandewege.net/coreboot/h8dme/minicom-20090902-with-64G-samsung-on-cpu1-kingston-on-cpu2-before-533MHz-limit.cap
>
> and here's after the patch:
>
>  http://ward.vandewege.net/coreboot/h8dme/minicom-20090902-with-64G-samsung-on-cpu1-kingston-on-cpu2-after-533MHz-limit.cap
>
> This is with all samsung ram on CPU1 and all kingston ram on CPU2, as you
> suggested on irc.
>

I don't think that this is a RAM matching problem. Each CPU/MC is
completely has completely matched RAM in this setup. The memory is
being sized correctly and I had hoped that slowing down would help.
The next step is to do a memory test or instrument the memory clear to
see how it fails. If it is on a boundary it would indicate an
addressing problem. Something more random would indicate timing. But
it is a little bit of guess work from here. Comparing the MC PCI
registers (function1 and 2) against the legacy bios might reveal
something as well.

The actual reset is probably a triple fault which probably started
with a op-code exception. We can instrument the exception handler if
you get really stuck.

Marc

-- 
http://marcjonesconsulting.com