GM45 S3 resume issues - coreboot

11 Nov 2015

Hi,
I've been looking into S3 resume on GM45 mainboards, which often fails
in rather interesting ways.
Many X200 units resume on every try (barring any panics in Linux, of
course).  Some only resume about half of the time.  And some/most T400
units fail to resume all of the time.
These systems fail to resume in one of the following ways:
* S3 resume (indicated by the SLP_TYP bit) is detected, SLP_TYP is
    cleared, DRAM receive-enable calibration fails with a timing
    under/overflow, the system resets, and coreboot boots normally into
    the payload (with the sleep LED still on) because SLP_TYP is now
    unset.  See x200-resume-fail-receive-enable-calibration.log and
    t400-resume-fail-receive-enable-calibration.log.
  * S3 resume is detected, SLP_TYP is cleared, raminit and the rest of
    romstage completes without error, but then something between the
    southbridge's smm_init() and cpu_initialize() hangs (maybe the
    system is stuck in SMM).  See x200-resume-fail-smm-hang.log and
    t400-resume-fail-smm-hang.log.
  * S3 resume is detected, SLP_TYP is cleared, romstage completes, but
    something within smm_init() hangs before dumping (possibly while
    clearing [1]) TCO1_STS bits.  See t400-resume-fail-tco-hang.log
There are a couple of other ways in which I've seen S3 resume fail, but
these are the most common.
I thought of working around the first issue (clearing SLP_TYP, resetting
due to a raminit error, then booting into the payload) by clearing
SLP_TYP near the end of the romstage main() (after raminit).  So I tried
the following patch:
---

diff --git a/src/mainboard/lenovo/x200/romstage.c b/src/mainboard/lenovo/x200/romstage.c
index 86a973f..915baf2 100644
--- a/src/mainboard/lenovo/x200/romstage.c
+++ b/src/mainboard/lenovo/x200/romstage.c
@@ -103,10 +103,6 @@ void main(unsigned long bist)
 #if CONFIG_HAVE_ACPI_RESUME
    	printk(BIOS_DEBUG, "Resume from S3 detected.\n");
    	s3resume = 1;
-		/* Clear SLP_TYPE. This will break stage2 but
-		 * we care for that when we get there.
-		 */
-		outl(pm1_cnt & ~(7 << 10), DEFAULT_PMBASE + 0x04);
 #else
    	printk(BIOS_DEBUG, "Resume from S3 detected, but disabled.\n");
 #endif
@@ -190,6 +186,11 @@ void main(unsigned long bist)
/* Magic for S3 resume */
    	pci_write_config32(PCI_DEV(0, 0, 0), D0F0_SKPD, SKPAD_ACPI_S3_MAGIC);
+
+		/* Clear SLP_TYPE. This will break stage2 but
+		 * we care for that when we get there.
+		 */
+		outl(pm1_cnt & ~(7 << 10), DEFAULT_PMBASE + 0x04);
    } else {
    	/* Magic for S3 resume */
    	pci_write_config32(PCI_DEV(0, 0, 0), D0F0_SKPD, SKPAD_NORMAL_BOOT_MAGIC);
---
But that just made these errors even more frequent.  Trying to resume
from S3 put the system into a reset loop with receive-enable calibration
errors (see x200-patched-resume-fail-receive-enable-loop.log).  So
instead of rebooting into the payload or hanging, the system just resets
forever.
So, it seems that whenever the system boots with SLP_TYP set, something
is very likely to go wrong in coreboot and cause either a reset or a
hang.
Does anyone have any ideas why this might be?
[1]: Tangentially, I noticed that the i82801ix reset_tco_status() says
     "Don't clear BOOT_STS before SECOND_TO_STS" when it clears
     BOOT_STS.  In the next two lines it clears BOOT_STS if set.  It
     never clears SECOND_TO_STS.  Is this a bug?  However, according to
     the ICH9 datasheet, there is no SECOND_TO_STS bit in TCO1_STS (the
     high bits of that register are reserved).
Thanks,
-- 
Patrick "P. J." McDermott
  http://www.pehjota.net/
Lead Developer, ProteanOS
  http://www.proteanos.com/