Hi folks,
as reported, the KGPE-D16 was mostly unusable for me in my 2x Opteron 6276 + 128 GB RAM configuration as it simply did not boot reliably - even with serial console debugging disabled completely. After experimenting with various config options and comparing my best "known half-working" config from earlier attempts, I finally found out that the hangs were related to the configuration and not to a specific coreboot version.
I attached the configs showing my current "reliable" setup (that survived 10 cold and 10 warm reboots without a single hangup!) and one of the previous "unreliable" setups which often needed several cold boots to successfully boot up once. There are several options which might be reposible for these hangs. Personally, I believe what helps is to completely disable the serial console and not just disable debugging to serial console.
As asked for previously, I also took some boot time measures from pressing the power button to "grub beep" in my 128 GB RAM configuration. Here they are:
vendor bios, unoptimized with iPXE setup: 59s coreboot, current with the "reliable" config: 73s coreboot, Jan 17 2017, with the "reliable" config: 91s coreboot, current with the "unreliable" config: 131s
I assume that further investigation of the root cause could help to locate the real bug (like e.g. the setup of the serial console). Yet, I hope that having a "working-good" config will be useful for people suffering from the same issue as I did. For me, this setup is still far from being what I expected (memory is clocked too low and idle power consumption is 170W instead of 90W), but at least the machine boots up reliably every time now.
Cheers, Daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 03/12/2017 07:58 PM, Daniel Kulesz via coreboot wrote:
Hi folks,
as reported, the KGPE-D16 was mostly unusable for me in my 2x Opteron 6276 + 128 GB RAM configuration as it simply did not boot reliably - even with serial console debugging disabled completely. After experimenting with various config options and comparing my best "known half-working" config from earlier attempts, I finally found out that the hangs were related to the configuration and not to a specific coreboot version.
I attached the configs showing my current "reliable" setup (that survived 10 cold and 10 warm reboots without a single hangup!) and one of the previous "unreliable" setups which often needed several cold boots to successfully boot up once. There are several options which might be reposible for these hangs. Personally, I believe what helps is to completely disable the serial console and not just disable debugging to serial console.
As asked for previously, I also took some boot time measures from pressing the power button to "grub beep" in my 128 GB RAM configuration. Here they are:
vendor bios, unoptimized with iPXE setup: 59s coreboot, current with the "reliable" config: 73s coreboot, Jan 17 2017, with the "reliable" config: 91s coreboot, current with the "unreliable" config: 131s
I assume that further investigation of the root cause could help to locate the real bug (like e.g. the setup of the serial console). Yet, I hope that having a "working-good" config will be useful for people suffering from the same issue as I did. For me, this setup is still far from being what I expected (memory is clocked too low and idle power consumption is 170W instead of 90W), but at least the machine boots up reliably every time now.
Cheers, Daniel
Could you verify something for me? In internal tests it looks like setting CONFIG_SQUELCH_EARLY_SMP resolves the hang with the serial console enabled, but I need secondary verification of this due to the intermittent nature of the problem. You seem to be hardest hit by the bug so your system should make a good test case.
Thanks!
- -- Timothy Pearson Raptor Engineering +1 (415) 727-8645 (direct line) +1 (512) 690-0200 (switchboard) https://www.raptorengineering.com
Hi Timothy,
On Tue, 21 Mar 2017 10:32:46 -0500 Timothy Pearson tpearson@raptorengineering.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 03/12/2017 07:58 PM, Daniel Kulesz via coreboot wrote:
Hi folks,
as reported, the KGPE-D16 was mostly unusable for me in my 2x Opteron 6276 + 128 GB RAM configuration as it simply did not boot reliably - even with serial console debugging disabled completely. After experimenting with various config options and comparing my best "known half-working" config from earlier attempts, I finally found out that the hangs were related to the configuration and not to a specific coreboot version.
I attached the configs showing my current "reliable" setup (that survived 10 cold and 10 warm reboots without a single hangup!) and one of the previous "unreliable" setups which often needed several cold boots to successfully boot up once. There are several options which might be reposible for these hangs. Personally, I believe what helps is to completely disable the serial console and not just disable debugging to serial console.
As asked for previously, I also took some boot time measures from pressing the power button to "grub beep" in my 128 GB RAM configuration. Here they are:
vendor bios, unoptimized with iPXE setup: 59s coreboot, current with the "reliable" config: 73s coreboot, Jan 17 2017, with the "reliable" config: 91s coreboot, current with the "unreliable" config: 131s
I assume that further investigation of the root cause could help to locate the real bug (like e.g. the setup of the serial console). Yet, I hope that having a "working-good" config will be useful for people suffering from the same issue as I did. For me, this setup is still far from being what I expected (memory is clocked too low and idle power consumption is 170W instead of 90W), but at least the machine boots up reliably every time now.
Cheers, Daniel
Could you verify something for me? In internal tests it looks like setting CONFIG_SQUELCH_EARLY_SMP resolves the hang with the serial console enabled, but I need secondary verification of this due to the intermittent nature of the problem. You seem to be hardest hit by the bug so your system should make a good test case.
Thanks!
Unfortunately, I had the option already enabled when using the "config-unreliable" in my initial posting. So it looks like this setting is not effective in stopping the hangs.
Cheers, Daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Unfortunately, I had the option already enabled when using the "config-unreliable" in my initial posting. So it looks like this setting is not effective in stopping the hangs.
Cheers, Daniel
After further external analysis once the KGPE-D16 was placed on our test stand for OpenBMC development, the intermittent hang resulting in hard system lockup (requiring a full standby power cycle) appears to be fixed in this patch series on Gerrit:
https://review.coreboot.org/#/c/19280/
Can you please test and confirm?
Thanks!
- -- Timothy Pearson Raptor Engineering +1 (415) 727-8645 (direct line) +1 (512) 690-0200 (switchboard) https://www.raptorengineering.com
On Thu, 13 Apr 2017 17:17:03 -0500 Timothy Pearson tpearson@raptorengineering.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Unfortunately, I had the option already enabled when using the "config-unreliable" in my initial posting. So it looks like this setting is not effective in stopping the hangs.
Cheers, Daniel
After further external analysis once the KGPE-D16 was placed on our test stand for OpenBMC development, the intermittent hang resulting in hard system lockup (requiring a full standby power cycle) appears to be fixed in this patch series on Gerrit:
https://review.coreboot.org/#/c/19280/
Can you please test and confirm?
Thanks!
Sure thing: I applied the patch on top of current coreboot-master and compiled the ROM using the "config-unreliable" from my initial posting. Then I did a series of five warm reboots and watched the output on the serial console. Result: No more hangs! (Previously, it took like 20 attempts to get even one successful boot)
Great job, and many thanks!
Cheers, Daniel