Hello Michał, Dear coreboot community,
Is there any update on this issue since the last message?
My situation is exactly the same with our custom TGL-UP3/LP4x board as well with Intel TGL-UP3-LP4x RVP. I am able to provide more details on this issue that came from my effort to resolve it.
The only change to the public code is that I removed the hard dependencies on chromeec (commented out EC calls from ec.c and forced board_id to TGL_UP3_LP4_MICRON) in order to use the original Intel EC binary that comes inside the reference UEFI image (I have no interest in EC development on my board and had problems with building chromeec for the RVP).
Observations: - No matter whether I set or unset CONFIG_USE_INTEL_FSP_MP_INIT or CONFIG_USE_INTEL_FSP_TO_CALL_COREBOOT_PUBLISH_MP_PPI, I still get the same behavior ("Clearing pending MCEs... [reset]"). - Watch out for DCI, when enabled (e.g. partially - in FSP-M and not in FSP-S), it can make some assertions fail in debug FSP or cause resets with release FSP even before encountering the core issue. - The gdb stub is currently broken for platforms that set IDT_IN_EVERY_STAGE=y - see my previous thread "GDB stub & bootblock dependencies (CONFIG_IDT_IN_EVERY_STAGE=y)" for a possible solution (sorry, can't upstream code from Siemens yet).
The output is similar to what Michał Żygowski already wrote before (see the attachment for a full version):
.... Clearing SMI status registers SMI_STS: PM1 PM1_STS: TMROF TCO_STS: INTRD_DET GPE0 STD STS: BATLOW smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7b000000, cpu = 0 In relocation handler: CPU 0 New SMBASE=0x7b000000 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff400, cpu = 3 In relocation handler: CPU 3 New SMBASE=0x7afff400 IEDBASE=0x7b400000 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff800, cpu = 2 In relocation handler: CPU 2 New SMBASE=0x7afff800 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afffc00, cpu = 1 In relocation handler: CPU 1 New SMBASE=0x7afffc00 IEDBASE=0x7b400000 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affec00, cpu = 5 In relocation handler: CPU 5 New SMBASE=0x7affec00 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff000, cpu = 4 In relocation handler: CPU 4 New SMBASE=0x7afff000 IEDBASE=0x7b400000 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe400, cpu = 7 In relocation handler: CPU 7 New SMBASE=0x7affe400 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe800, cpu = 6 In relocation handler: CPU 6 New SMBASE=0x7affe800 IEDBASE=0x7b400000 Relocation complete. Initializing CPU #0 CPU: vendor Intel device 806c1 CPU: family 06, model 8c, stepping 01 Clearing out pending MCEs [ here comes the reset ]
The defconfig is also attached. The coreboot version noted here is rather old because there's been a problem with the SPD data availability for the RVP and this is the single time it passed FSP meminit. Nevertheless it can be still reproduced on current version on the custom TGL-UP3/LP4X board (for which I unfortunately cannot provide sources).
On the custom board with the same CPU/DRAM configuration where the failure also occurs, I tried to skip the mca_configure() call in src/soc/intel/tigerlake/cpu.c but the failure just moves to LAPIC setup following it. Adding waiting loops before the mca_configure() call prevented the resets and has suggested that the cause might not be timing-dependent. Adding more debug output into the mca_configure() function in src/soc/intel/common/block/cpu/cpulib.c showed that the reset occurs just when the wrmsr call with values {0xffffffff, 0xffffffff} to some of the MCE banks in order to clear it (the number of the bank tends to be 4 but not all the time). GBLRST_CAUSE is always 00000000 00000000 after the reset.
According to public Intel SDM (#325462), volume 2D, page 6-14, section "Operation in a Uni-Processor Platform", there's an algorithm described in pseudofortrancode which corresponds to the actual implementation of mca_configure() in coreboot:
FOR I = 0 to IA32_MCG_CAP.COUNT-1 DO IF (IA32_MC[I]_STATUS = uncorrectable error) THEN #GP(0);
I don't know how to verify whether the cause of the reset is the GPE that can be caused by wrmsr. As mentioned, the GBLRST_CAUSE is always 0 after the reset occurs on the custom board.
Thanks for any ideas. Have a nice weekend.
Regards, Jan
Jan Samek Siemens, s.r.o. ADV D EU CZ AE AC 7 jan.samek@siemens.com