Hi folks,
as we know, the KGPE-D16 is likely to hang during PCI init, especially if the serial console is enabled (Timothy mentioned that he did not observe failures with the debug level of the console lowered - however, for me this did not work). Typical symptoms look like the following:
ERROR: PNP: 002e.b 70 irq size: 0x0000000001 not assigned
After discussing the issue with Timothy and doing a (huge) number of experiments in different settings I am of the opinion that the issue *does* seem to be memory/clock related. I tried various memory configurations and found some interesting correlations between memory configuration and the rate of failures. Also, I found another trigger which makes the hang *much* more likely.
For testing, I used the current coreboot master with the proposed revert which made the MC4 errors go away. After some experimenting, I found two settings which made the PCI-hangs go away:
1. Setting minimum memory voltage to 1.5V (probably unrelated)
2. Setting maximum memory clock down to 400 MHz (DDR3-800 instead of DDR3-1600)
With this setting, the number of PCI hangs went considerably down on a number of different configurations (all using two Opteron 6276 CPUs). The numbers are (#hangs / #boot attempts):
1xCK0 (in slot A2): warm (0/8), cold (0/3) => no failures!
1xYK0 (in slot A2): warm (0/6), cold (0/3) => no failures!
8xYK0 (in all orange slots): warm (1/8), cold (1/5) => rare
16xYK0 (in all slots): warm (3/8), cold (1/5) => more likely
Specs of the memory modules (all from Samsung):
CK0: 8GB DDR3-1600, 1.5V
YK0: 8GB DDR3L-1600, 1.35V
However, there is one more and quite severe issue: If I set the following option, I am get a hang in like 70% of the boot attempts:
Chipset => Enable PCI-E MMCONFIG Support
I wouldn't care too much about leaving this disabled. Unfortunately, I am getting a "force reboot" of the Kernel each time it tries to initialize the nouveau driver. Not sure if this is really related to this option or the fact that I'm running the RDIMMs at such low settings, but we'll have to find out.
Cheers, Daniel