TigerLake RVP TCSS init failure

List overview All Threads
Download

newer

older

[SPECIFICATION RFC v3] The...

7010 Motherboard Variants

Michal Zygowski

1 Feb 2021 1 Feb '21

4:48 p.m.

Dear coreboot community,

I have encountered problem with silicon init on Tiger Lake RVP platform. I managed to resolve previous issues with memory initialization and now hitting an error with TCSS init. The FSP asserts on IOM ready check, which is 0. The configuration has selected CONFIG_USE_INTEL_FSP_MP_INIT (without MP PPI service).

When the CONFIG_USE_INTEL_FSP_TO_CALL_COREBOOT_PUBLISH_MP_PPI is selected, then the FSP-S returns smoothly (at least from one of the phases I guess) and resets after clearing MCEs in coreboot's CPU init:

CPU: vendor Intel device 806c0 CPU: family 06, model 8c, stepping 00 Clearing out pending MCEs Setting up local APIC... apic_id: 0x00 done. Turbo is available but hidden Turbo is available and visible CPU #0 initialized Initializing CPU #2 Initializing CPU #6 Initializing CPU #7 CPU: vendor Intel device 806c0 CPU: family 06, model 8c, stepping 00 CPU: vendor Intel device 806c0 CPU: family 06, model 8c, stepping 00 Clearing out pending MCEs Cl (tutaj następuje reset)

Any ideas what may cause these issues? When I clean this up, I will upstream the DDR4 variant of TGL UP3 RVP.

-- Michał Żygowski Firmware Engineer https://3mdeb.com | @3mdeb_com

Attachments:

attachment.htm (text/html — 27.1 KB)

Show replies by date

Michal Zygowski

9 Feb 9 Feb

4:38 p.m.

Any ideas what may be wrong?

I can share more details/logs if needed.

On 01.02.2021 16:48, Michal Zygowski wrote:

...

Dear coreboot community,

I have encountered problem with silicon init on Tiger Lake RVP platform. I managed to resolve previous issues with memory initialization and now hitting an error with TCSS init. The FSP asserts on IOM ready check, which is 0. The configuration has selected CONFIG_USE_INTEL_FSP_MP_INIT (without MP PPI service).

When the CONFIG_USE_INTEL_FSP_TO_CALL_COREBOOT_PUBLISH_MP_PPI is selected, then the FSP-S returns smoothly (at least from one of the phases I guess) and resets after clearing MCEs in coreboot's CPU init:

CPU: vendor Intel device 806c0 CPU: family 06, model 8c, stepping 00 Clearing out pending MCEs Setting up local APIC... apic_id: 0x00 done. Turbo is available but hidden Turbo is available and visible CPU #0 initialized Initializing CPU #2 Initializing CPU #6 Initializing CPU #7 CPU: vendor Intel device 806c0 CPU: family 06, model 8c, stepping 00 CPU: vendor Intel device 806c0 CPU: family 06, model 8c, stepping 00 Clearing out pending MCEs Cl (tutaj następuje reset)

Any ideas what may cause these issues? When I clean this up, I will upstream the DDR4 variant of TGL UP3 RVP.

-- Michał Żygowski Firmware Engineer https://3mdeb.com | @3mdeb_com

coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

-- Michał Żygowski Firmware Engineer https://3mdeb.com | @3mdeb_com

Guillermo Placencia

8:32 p.m.

Samek, Jan

20 Aug 20 Aug

3:22 p.m.

Hello Michał, Dear coreboot community,

Is there any update on this issue since the last message?

My situation is exactly the same with our custom TGL-UP3/LP4x board as well with Intel TGL-UP3-LP4x RVP. I am able to provide more details on this issue that came from my effort to resolve it.

The only change to the public code is that I removed the hard dependencies on chromeec (commented out EC calls from ec.c and forced board_id to TGL_UP3_LP4_MICRON) in order to use the original Intel EC binary that comes inside the reference UEFI image (I have no interest in EC development on my board and had problems with building chromeec for the RVP).

Observations: - No matter whether I set or unset CONFIG_USE_INTEL_FSP_MP_INIT or CONFIG_USE_INTEL_FSP_TO_CALL_COREBOOT_PUBLISH_MP_PPI, I still get the same behavior ("Clearing pending MCEs... [reset]"). - Watch out for DCI, when enabled (e.g. partially - in FSP-M and not in FSP-S), it can make some assertions fail in debug FSP or cause resets with release FSP even before encountering the core issue. - The gdb stub is currently broken for platforms that set IDT_IN_EVERY_STAGE=y - see my previous thread "GDB stub & bootblock dependencies (CONFIG_IDT_IN_EVERY_STAGE=y)" for a possible solution (sorry, can't upstream code from Siemens yet).

The output is similar to what Michał Żygowski already wrote before (see the attachment for a full version):

.... Clearing SMI status registers SMI_STS: PM1 PM1_STS: TMROF TCO_STS: INTRD_DET GPE0 STD STS: BATLOW smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7b000000, cpu = 0 In relocation handler: CPU 0 New SMBASE=0x7b000000 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff400, cpu = 3 In relocation handler: CPU 3 New SMBASE=0x7afff400 IEDBASE=0x7b400000 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff800, cpu = 2 In relocation handler: CPU 2 New SMBASE=0x7afff800 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afffc00, cpu = 1 In relocation handler: CPU 1 New SMBASE=0x7afffc00 IEDBASE=0x7b400000 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affec00, cpu = 5 In relocation handler: CPU 5 New SMBASE=0x7affec00 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff000, cpu = 4 In relocation handler: CPU 4 New SMBASE=0x7afff000 IEDBASE=0x7b400000 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe400, cpu = 7 In relocation handler: CPU 7 New SMBASE=0x7affe400 IEDBASE=0x7b400000 Writing SMRR. base = 0x7b000006, mask=0xff800c00 Relocation complete. smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe800, cpu = 6 In relocation handler: CPU 6 New SMBASE=0x7affe800 IEDBASE=0x7b400000 Relocation complete. Initializing CPU #0 CPU: vendor Intel device 806c1 CPU: family 06, model 8c, stepping 01 Clearing out pending MCEs [ here comes the reset ]

The defconfig is also attached. The coreboot version noted here is rather old because there's been a problem with the SPD data availability for the RVP and this is the single time it passed FSP meminit. Nevertheless it can be still reproduced on current version on the custom TGL-UP3/LP4X board (for which I unfortunately cannot provide sources).

On the custom board with the same CPU/DRAM configuration where the failure also occurs, I tried to skip the mca_configure() call in src/soc/intel/tigerlake/cpu.c but the failure just moves to LAPIC setup following it. Adding waiting loops before the mca_configure() call prevented the resets and has suggested that the cause might not be timing-dependent. Adding more debug output into the mca_configure() function in src/soc/intel/common/block/cpu/cpulib.c showed that the reset occurs just when the wrmsr call with values {0xffffffff, 0xffffffff} to some of the MCE banks in order to clear it (the number of the bank tends to be 4 but not all the time). GBLRST_CAUSE is always 00000000 00000000 after the reset.

According to public Intel SDM (#325462), volume 2D, page 6-14, section "Operation in a Uni-Processor Platform", there's an algorithm described in pseudofortrancode which corresponds to the actual implementation of mca_configure() in coreboot:

FOR I = 0 to IA32_MCG_CAP.COUNT-1 DO IF (IA32_MC[I]_STATUS = uncorrectable error) THEN #GP(0);

I don't know how to verify whether the cause of the reset is the GPE that can be caused by wrmsr. As mentioned, the GBLRST_CAUSE is always 0 after the reset occurs on the custom board.

Thanks for any ideas. Have a nice weekend.

Regards, Jan

Jan Samek Siemens, s.r.o. ADV D EU CZ AE AC 7 jan.samek@siemens.com

Michał Żygowski

23 Aug 23 Aug

6:16 p.m.

Hi Jan,

Unfortunately I haven't been able to resolve the issues I had and I can already see you are bumping into the same ones I have experienced. Also (unfortunately again) I don't have good news for you:

1. You can't make it work (I have been trying to find help form Intel how to move on with this platform without success). 2. Why you can't make it work? Most likely because you have an engineering sample (ES) CPU/SoC which is shipped by default along with this platform. And this ES CPU will simply not work, as I have been told. I have been fighting with this platform for weeks both with coreboot and MinPlatform without success.

Switching to a production SoC should do the magic according to Intel employees who tested on their side both coreboot and EDK2 MinPlatform. At this point I gave up fighting this unfair battle.

Sorry to be discouraging, I just want to give a sincere opinion of what I have experienced. I would consider buying a production SoC/CPU to save the time and frustration during the bringup (which should be very easy and fast with an RVP).

Best regards,

-- Michał Żygowski Firmware Engineer GPG: 6B5BA214D21FCEB2 https://3mdeb.com | @3mdeb_com On 8/20/21 3:22 PM, Samek, Jan wrote: > Hello Michał, > Dear coreboot community, > > Is there any update on this issue since the last message? > > My situation is exactly the same with our custom TGL-UP3/LP4x board as well with Intel > TGL-UP3-LP4x RVP. I am able to provide more details on this issue that came from my > effort to resolve it. > > The only change to the public code is that I removed the hard dependencies on > chromeec (commented out EC calls from ec.c and forced board_id to > TGL_UP3_LP4_MICRON) in order to use the original Intel EC binary that comes > inside the reference UEFI image (I have no interest in EC development on my > board and had problems with building chromeec for the RVP). > > Observations: > - No matter whether I set or unset CONFIG_USE_INTEL_FSP_MP_INIT or > CONFIG_USE_INTEL_FSP_TO_CALL_COREBOOT_PUBLISH_MP_PPI, I still get > the same behavior ("Clearing pending MCEs... [reset]"). > - Watch out for DCI, when enabled (e.g. partially - in FSP-M and not in FSP-S), it > can make some assertions fail in debug FSP or cause resets with release FSP > even before encountering the core issue. > - The gdb stub is currently broken for platforms that set IDT_IN_EVERY_STAGE=y - > see my previous thread "GDB stub & bootblock dependencies > (CONFIG_IDT_IN_EVERY_STAGE=y)" for a possible solution (sorry, can't > upstream code from Siemens yet). > > The output is similar to what Michał Żygowski already wrote before (see > the attachment for a full version): > > .... > Clearing SMI status registers > SMI_STS: PM1 > PM1_STS: TMROF > TCO_STS: INTRD_DET > GPE0 STD STS: BATLOW > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7b000000, cpu = 0 > In relocation handler: CPU 0 > New SMBASE=0x7b000000 IEDBASE=0x7b400000 > Writing SMRR. base = 0x7b000006, mask=0xff800c00 > Relocation complete. > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff400, cpu = 3 > In relocation handler: CPU 3 > New SMBASE=0x7afff400 IEDBASE=0x7b400000 > Relocation complete. > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff800, cpu = 2 > In relocation handler: CPU 2 > New SMBASE=0x7afff800 IEDBASE=0x7b400000 > Writing SMRR. base = 0x7b000006, mask=0xff800c00 > Relocation complete. > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afffc00, cpu = 1 > In relocation handler: CPU 1 > New SMBASE=0x7afffc00 IEDBASE=0x7b400000 > Relocation complete. > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affec00, cpu = 5 > In relocation handler: CPU 5 > New SMBASE=0x7affec00 IEDBASE=0x7b400000 > Writing SMRR. base = 0x7b000006, mask=0xff800c00 > Relocation complete. > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff000, cpu = 4 > In relocation handler: CPU 4 > New SMBASE=0x7afff000 IEDBASE=0x7b400000 > Relocation complete. > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe400, cpu = 7 > In relocation handler: CPU 7 > New SMBASE=0x7affe400 IEDBASE=0x7b400000 > Writing SMRR. base = 0x7b000006, mask=0xff800c00 > Relocation complete. > smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe800, cpu = 6 > In relocation handler: CPU 6 > New SMBASE=0x7affe800 IEDBASE=0x7b400000 > Relocation complete. > Initializing CPU #0 > CPU: vendor Intel device 806c1 > CPU: family 06, model 8c, stepping 01 > Clearing out pending MCEs > [ here comes the reset ] > > The defconfig is also attached. The coreboot version noted here is rather old > because there's been a problem with the SPD data availability for > the RVP and this is the single time it passed FSP meminit. Nevertheless > it can be still reproduced on current version on the custom TGL-UP3/LP4X > board (for which I unfortunately cannot provide sources). > > On the custom board with the same CPU/DRAM configuration where the failure > also occurs, I tried to skip the mca_configure() call in src/soc/intel/tigerlake/cpu.c > but the failure just moves to LAPIC setup following it. Adding waiting loops before the > mca_configure() call prevented the resets and has suggested that the cause > might not be timing-dependent. Adding more debug output into the mca_configure() > function in src/soc/intel/common/block/cpu/cpulib.c showed that the reset occurs > just when the wrmsr call with values {0xffffffff, 0xffffffff} to some of the MCE > banks in order to clear it (the number of the bank tends to be 4 but not all the time). > GBLRST_CAUSE is always 00000000 00000000 after the reset. > > According to public Intel SDM (#325462), volume 2D, page 6-14, section "Operation in > a Uni-Processor Platform", there's an algorithm described in pseudofortrancode which > corresponds to the actual implementation of mca_configure() in coreboot: > > FOR I = 0 to IA32_MCG_CAP.COUNT-1 DO > IF (IA32_MC[I]_STATUS = uncorrectable error) > THEN #GP(0); > > I don't know how to verify whether the cause of the reset is the GPE that can be caused by > wrmsr. As mentioned, the GBLRST_CAUSE is always 0 after the reset occurs on the custom > board. > > Thanks for any ideas. > Have a nice weekend. > > Regards, > Jan > > Jan Samek > Siemens, s.r.o. > ADV D EU CZ AE AC 7 > jan.samek@siemens.com > > > _______________________________________________ > coreboot mailing list -- coreboot@coreboot.org > To unsubscribe send an email to coreboot-leave@coreboot.org >

Samek, Jan

24 Aug 24 Aug

10 a.m.

New subject: Fw: Re: TigerLake RVP TCSS init failure

Hello Michał,

No need to apologize for discouragement, this is a valuable information I wish I had earlier before I invested so much time into trying to solve the issue (I should've asked earlier ofc.).

At least the hope is that the production units should work. Now I guess it's time for us to solve the situation internally and with Intel.

Thank you again.

Regards, Jan Samek Siemens, s.r.o. ADV D EU CZ AE AC 7 jan.samek@siemens.com

________________________________________ From: Michał Żygowski michal.zygowski@3mdeb.com Sent: 23 August 2021 18:16 To: coreboot@coreboot.org Subject: [coreboot] Re: TigerLake RVP TCSS init failure

Hi Jan,

Unfortunately I haven't been able to resolve the issues I had and I can already see you are bumping into the same ones I have experienced. Also (unfortunately again) I don't have good news for you:

Switching to a production SoC should do the magic according to Intel employees who tested on their side both coreboot and EDK2 MinPlatform. At this point I gave up fighting this unfair battle.

Best regards, -- Michał Żygowski Firmware Engineer GPG: 6B5BA214D21FCEB2 https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2F3mdeb.com%... | @3mdeb_com

On 8/20/21 3:22 PM, Samek, Jan wrote:

...

Hello Michał, Dear coreboot community,

Is there any update on this issue since the last message?

My situation is exactly the same with our custom TGL-UP3/LP4x board as well with Intel TGL-UP3-LP4x RVP. I am able to provide more details on this issue that came from my effort to resolve it.

The only change to the public code is that I removed the hard dependencies on chromeec (commented out EC calls from ec.c and forced board_id to TGL_UP3_LP4_MICRON) in order to use the original Intel EC binary that comes inside the reference UEFI image (I have no interest in EC development on my board and had problems with building chromeec for the RVP).

Observations:

No matter whether I set or unset CONFIG_USE_INTEL_FSP_MP_INIT or CONFIG_USE_INTEL_FSP_TO_CALL_COREBOOT_PUBLISH_MP_PPI, I still get the same behavior ("Clearing pending MCEs... [reset]").

Watch out for DCI, when enabled (e.g. partially - in FSP-M and not in FSP-S), it can make some assertions fail in debug FSP or cause resets with release FSP even before encountering the core issue.

The gdb stub is currently broken for platforms that set IDT_IN_EVERY_STAGE=y - see my previous thread "GDB stub & bootblock dependencies (CONFIG_IDT_IN_EVERY_STAGE=y)" for a possible solution (sorry, can't upstream code from Siemens yet).

The output is similar to what Michał Żygowski already wrote before (see the attachment for a full version):
....
Clearing SMI status registers
SMI_STS: PM1
PM1_STS: TMROF
TCO_STS: INTRD_DET
GPE0 STD STS: BATLOW
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7b000000, cpu = 0
In relocation handler: CPU 0
New SMBASE=0x7b000000 IEDBASE=0x7b400000
Writing SMRR. base = 0x7b000006, mask=0xff800c00
Relocation complete.
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff400, cpu = 3
In relocation handler: CPU 3
New SMBASE=0x7afff400 IEDBASE=0x7b400000
Relocation complete.
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff800, cpu = 2
In relocation handler: CPU 2
New SMBASE=0x7afff800 IEDBASE=0x7b400000
Writing SMRR. base = 0x7b000006, mask=0xff800c00
Relocation complete.
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afffc00, cpu = 1
In relocation handler: CPU 1
New SMBASE=0x7afffc00 IEDBASE=0x7b400000
Relocation complete.
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affec00, cpu = 5
In relocation handler: CPU 5
New SMBASE=0x7affec00 IEDBASE=0x7b400000
Writing SMRR. base = 0x7b000006, mask=0xff800c00
Relocation complete.
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7afff000, cpu = 4
In relocation handler: CPU 4
New SMBASE=0x7afff000 IEDBASE=0x7b400000
Relocation complete.
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe400, cpu = 7
In relocation handler: CPU 7
New SMBASE=0x7affe400 IEDBASE=0x7b400000
Writing SMRR. base = 0x7b000006, mask=0xff800c00
Relocation complete.
smm_do_relocation : curr_smbase 0x30000 perm_smbase 0x7affe800, cpu = 6
In relocation handler: CPU 6
New SMBASE=0x7affe800 IEDBASE=0x7b400000
Relocation complete.
Initializing CPU #0
CPU: vendor Intel device 806c1
CPU: family 06, model 8c, stepping 01
Clearing out pending MCEs
[ here comes the reset ]
The defconfig is also attached. The coreboot version noted here is rather old because there's been a problem with the SPD data availability for the RVP and this is the single time it passed FSP meminit. Nevertheless it can be still reproduced on current version on the custom TGL-UP3/LP4X board (for which I unfortunately cannot provide sources).

On the custom board with the same CPU/DRAM configuration where the failure also occurs, I tried to skip the mca_configure() call in src/soc/intel/tigerlake/cpu.c but the failure just moves to LAPIC setup following it. Adding waiting loops before the mca_configure() call prevented the resets and has suggested that the cause might not be timing-dependent. Adding more debug output into the mca_configure() function in src/soc/intel/common/block/cpu/cpulib.c showed that the reset occurs just when the wrmsr call with values {0xffffffff, 0xffffffff} to some of the MCE banks in order to clear it (the number of the bank tends to be 4 but not all the time). GBLRST_CAUSE is always 00000000 00000000 after the reset.

According to public Intel SDM (#325462), volume 2D, page 6-14, section "Operation in a Uni-Processor Platform", there's an algorithm described in pseudofortrancode which corresponds to the actual implementation of mca_configure() in coreboot:
FOR I = 0 to IA32_MCG_CAP.COUNT-1 DO
    IF (IA32_MC[I]_STATUS = uncorrectable error)
        THEN #GP(0);
I don't know how to verify whether the cause of the reset is the GPE that can be caused by wrmsr. As mentioned, the GBLRST_CAUSE is always 0 after the reset occurs on the custom board.

Thanks for any ideas. Have a nice weekend.

Regards, Jan

Jan Samek Siemens, s.r.o. ADV D EU CZ AE AC 7 jan.samek@siemens.com

coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

_______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

Samek, Jan

11:08 a.m.

Hello again Michal, I'd like to additionally ask you about a small detail regarding to the issue: What was the stepping that started to work?

I'm currently encountering this behavior on B-0. I really did have some really bad memory init errors on A-0 which was considered an engineering sample and swapping for B-0 solved it. Nevertheless, this MCE issue still persists on B-0 in my case.

Thanks for info.

Regards, Jan Samek Siemens, s.r.o. ADV D EU CZ AE AC 7 jan.samek@siemens.com

Naresh G. Solanki

11:45 a.m.

I remember working on an issue related to "FSP asserts on IOM ready check" I guess on ICL RVP It got fixed for me after making sure proper IOM binary was added using the fit tool.

I'm sure your setup is able to boot with vendor provided bios. Then other option is that you can extract iom.bin from a working bios & pack it into your generated coreboot fw. Then give it a try.

Regards, Naresh Solanki

On Tue, Aug 24, 2021 at 2:39 PM Samek, Jan jan.samek@siemens.com wrote:

...

Hello again Michal, I'd like to additionally ask you about a small detail regarding to the issue: What was the stepping that started to work?

I'm currently encountering this behavior on B-0. I really did have some really bad memory init errors on A-0 which was considered an engineering sample and swapping for B-0 solved it. Nevertheless, this MCE issue still persists on B-0 in my case.

Thanks for info.

Regards, Jan Samek Siemens, s.r.o. ADV D EU CZ AE AC 7 jan.samek@siemens.com _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

-- Best regards, Naresh G. Solanki

Michał Żygowski

12:38 p.m.

On 24.08.2021 11:45, Naresh G. Solanki wrote:

...

I remember working on an issue related to "FSP asserts on IOM ready check" I guess on ICL RVP It got fixed for me after making sure proper IOM binary was added using the fit tool.

I'm sure your setup is able to boot with vendor provided bios. Then other option is that you can extract iom.bin from a working bios & pack it into your generated coreboot fw. Then give it a try.

I couldn't get the IOM working despite preserving the ME region during flash updates. So literalyl I have been using the same IOM binary as shipped. Using the same IOM binary doesn't guarantee the correct operation of the TCSS and ends up in assert. Why? Because for some reason an engineering sample PCH have been forcing the ME into disabled state. No command could get ME out of this state which is probably the reason why IOM was not working correctly as well (ME didn't load the firmware?) or the FSP was not compatible with pre-production silicon.

...

Regards, Naresh Solanki

On Tue, Aug 24, 2021 at 2:39 PM Samek, Jan <jan.samek@siemens.com mailto:jan.samek@siemens.com> wrote:

Hello again Michal,
I'd like to additionally ask you about a small detail regarding to
the issue: What was the stepping that started to work?

I'm currently encountering this behavior on B-0. I really did have
some really bad memory init errors on A-0 which was considered an
engineering sample and swapping for B-0 solved it. Nevertheless,
this MCE issue still persists on B-0 in my case.

Thanks for info.

Regards,
Jan Samek
Siemens, s.r.o.
ADV D EU CZ AE AC 7
jan.samek@siemens.com <mailto:jan.samek@siemens.com>
_______________________________________________
coreboot mailing list -- coreboot@coreboot.org
<mailto:coreboot@coreboot.org>
To unsubscribe send an email to coreboot-leave@coreboot.org
<mailto:coreboot-leave@coreboot.org>

-- Best regards, Naresh G. Solanki

coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

Best regards,

-- Michał Żygowski Firmware Engineer GPG: 6B5BA214D21FCEB2 https://3mdeb.com | @3mdeb_com

Naresh G. Solanki

1:43 p.m.

If its possible to verify the same by extracting binaries using a fit tool will be the best. Ideally CRC of IOM, MG & TBT fw CRC should match with that of working UEFI bios.

Also enabling CR4.MCE may help when things go wrong as it triggers exception handler.

If you have itp/dci/csript, then please step around code where it hangs. Exact instruction which cause hang may give some hint. Alternatively adding more printk around failure to locate exact failing point or nature of failure.

If clearing mca is skipped, does it still cause a hang ?

You mentioned: "engineering sample PCH have been forcing the ME into disabled state" Is this seen with UEFI bios ?

Regards, Naresh Solanki

On Tue, Aug 24, 2021 at 4:09 PM Michał Żygowski michal.zygowski@3mdeb.com wrote:

...

On 24.08.2021 11:45, Naresh G. Solanki wrote:

...
I remember working on an issue related to "FSP asserts on IOM ready check" I guess on ICL RVP It got fixed for me after making sure proper IOM binary was added using the fit tool.

I'm sure your setup is able to boot with vendor provided bios. Then other option is that you can extract iom.bin from a working bios & pack it into your generated coreboot fw. Then give it a try.

I couldn't get the IOM working despite preserving the ME region during flash updates. So literalyl I have been using the same IOM binary as shipped. Using the same IOM binary doesn't guarantee the correct operation of the TCSS and ends up in assert. Why? Because for some reason an engineering sample PCH have been forcing the ME into disabled state. No command could get ME out of this state which is probably the reason why IOM was not working correctly as well (ME didn't load the firmware?) or the FSP was not compatible with pre-production silicon.

...
Regards, Naresh Solanki

On Tue, Aug 24, 2021 at 2:39 PM Samek, Jan <jan.samek@siemens.com mailto:jan.samek@siemens.com> wrote:
Hello again Michal,
I'd like to additionally ask you about a small detail regarding to
the issue: What was the stepping that started to work?

I'm currently encountering this behavior on B-0. I really did have
some really bad memory init errors on A-0 which was considered an
engineering sample and swapping for B-0 solved it. Nevertheless,
this MCE issue still persists on B-0 in my case.

Thanks for info.

Regards,
Jan Samek
Siemens, s.r.o.
ADV D EU CZ AE AC 7
jan.samek@siemens.com <mailto:jan.samek@siemens.com>
_______________________________________________
coreboot mailing list -- coreboot@coreboot.org
<mailto:coreboot@coreboot.org>
To unsubscribe send an email to coreboot-leave@coreboot.org
<mailto:coreboot-leave@coreboot.org>
-- Best regards, Naresh G. Solanki

coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
Best regards,

Michał Żygowski Firmware Engineer GPG: 6B5BA214D21FCEB2 https://3mdeb.com | @3mdeb_com _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

-- Best regards, Naresh G. Solanki

Michał Żygowski

1:50 p.m.

Hi Naresh,

On 24.08.2021 13:43, Naresh G. Solanki wrote:

...

You mentioned: "engineering sample PCH have been forcing the ME into disabled state" Is this seen with UEFI bios ?

Yes it is. I have observed it running the shipped Intel UEFI firmware as well.

...

Regards, Naresh Solanki

Best regards,

-- Michał Żygowski Firmware Engineer GPG: 6B5BA214D21FCEB2 https://3mdeb.com | @3mdeb_com

Naresh G. Solanki

1:57 p.m.

Please verify based on other points I mentioned. If possible start with fit tools

On Tue, Aug 24, 2021 at 5:21 PM Michał Żygowski michal.zygowski@3mdeb.com wrote:

...

Hi Naresh,

On 24.08.2021 13:43, Naresh G. Solanki wrote:

...
You mentioned: "engineering sample PCH have been forcing the ME into disabled state" Is this seen with UEFI bios ?

Yes it is. I have observed it running the shipped Intel UEFI firmware as well.

...
Regards, Naresh Solanki

Best regards,

Michał Żygowski Firmware Engineer GPG: 6B5BA214D21FCEB2 https://3mdeb.com | @3mdeb_com _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

-- Best regards, Naresh G. Solanki

Samek, Jan

2:20 p.m.

Hello Naresh,

Thanks for the tip with IOM.

I will definitely revisit the blob handling (including IOM) on my side because I think that's not the part I can be proud of in my development setup. The chance of something going wrong here is rather high.

When I remove the mca_configure() call, the failure occurs on the next call (LAPIC setup) as well. When I postpone it by wait loops before mca_configure(), it has no effect - so I don't think it's timing related (for details see my original message).

@Michal Zygowski:

Regarding the stepping, I used A-0 in the past (until February/March) and had memory init issues on the DDR4 variant of the RVP, as you describe, and the reference UEFI could initialize the memory either.

After upgrading to B-0 this issue was solved. Although short after I began with porting coreboot to our design, this MCE error prevented any other progress with coreboot on any of the boards (I continued only with LPDDR4x variant as this is what our design uses and abandoned the DDR4 variant).

Also thanks for the public bug tracker links, that's much easier to follow.

Regards, Jan

Samek, Jan

25 Oct 25 Oct

5:04 p.m.

Hello coreboot Community,

After a long time, there's an update to this Tiger Lake issue:

For now, the masks in mca_configure() are used as a workaround to ignore the MCEs:

--- a/src/soc/intel/common/block/cpu/cpulib.c +++ b/src/soc/intel/common/block/cpu/cpulib.c @@ -346,7 +346,7 @@ void mca_configure(void) for (i = 0; i < num_banks; i++) { /* Initialize machine checks */ wrmsr(IA32_MC_CTL(i), - (msr_t) {.lo = 0xffffffff, .hi = 0xffffffff}); + (msr_t) {.lo = 0, .hi = 0}); /* FIXME: MCEs temp. disabled */ }

It was found by Werner that these MCEs are set by FSP-M. With possibility being wrong FSP parameters, SPD data etc. There was also a need to disable MCE checking in FSP-S UPD to get through the silicon init.

Nevertheless what after discussion with Intel and Werner, what seems to be the root cause, might be the MTRR setup. From what I see from the logs, the values indeed look somehow strange to me. Sorry, I have no clue yet how to set up MTRRs correctly or how they should look like.

... BS: BS_WRITE_TABLES run times (exec / console): 7 / 307 ms MTRR: Physical address space: 0x0000000000000000 - 0x00000000000a0000 size 0x000a0000 type 6 0x00000000000a0000 - 0x00000000000c0000 size 0x00020000 type 0 0x00000000000c0000 - 0x0000000077000000 size 0x76f40000 type 6 0x0000000077000000 - 0x0000000080000000 size 0x09000000 type 0 0x0000000080000000 - 0x0000000090000000 size 0x10000000 type 1 0x0000000090000000 - 0x0000000100000000 size 0x70000000 type 0 0x0000000100000000 - 0x0000000480400000 size 0x380400000 type 6 MTRR: Fixed MSR 0x250 0x0606060606060606 MTRR: Fixed MSR 0x258 0x0606060606060606 MTRR: Fixed MSR 0x259 0x0000000000000000 MTRR: Fixed MSR 0x268 0x0606060606060606 MTRR: Fixed MSR 0x269 0x0606060606060606 MTRR: Fixed MSR 0x26a 0x0606060606060606 MTRR: Fixed MSR 0x26b 0x0606060606060606 MTRR: Fixed MSR 0x26c 0x0606060606060606 MTRR: Fixed MSR 0x26d 0x0606060606060606 MTRR: Fixed MSR 0x26e 0x0606060606060606 MTRR: Fixed MSR 0x26f 0x0606060606060606 call enable_fixed_mtrr() CPU physical address size: 39 bits MTRR: default type WB/UC MTRR counts: 6/7. MTRR: WB selected as default type. MTRR: 0 base 0x0000000077000000 mask 0x0000007fff000000 type 0 MTRR: 1 base 0x0000000078000000 mask 0x0000007ff8000000 type 0 MTRR: 2 base 0x0000000080000000 mask 0x0000007ff0000000 type 1 MTRR: 3 base 0x0000000090000000 mask 0x0000007ff0000000 type 0 MTRR: 4 base 0x00000000a0000000 mask 0x0000007fe0000000 type 0 MTRR: 5 base 0x00000000c0000000 mask 0x0000007fc0000000 type 0 MTRR: Fixed MSR 0x250 0x0606060606060606 MTRR: Fixed MSR 0x258 0x0606060606060606 MTRR: Fixed MSR 0x259 0x0000000000000000 MTRR: Fixed MSR 0x268 0x0606060606060606 MTRR: Fixed MSR 0x269 0x0606060606060606 MTRR: Fixed MSR 0x26a 0x0606060606060606 MTRR: Fixed MSR 0x26b 0x0606060606060606 MTRR: Fixed MSR 0x26c 0x0606060606060606 MTRR: Fixed MSR 0x26d 0x0606060606060606 MTRR: Fixed MSR 0x26e 0x0606060606060606 MTRR: Fixed MSR 0x26f 0x0606060606060606 MTRR: Fixed MSR 0x250 0x0606060606060606 MTRR: Fixed MSR 0x250 0x0606060606060606 MTRR: Fixed MSR 0x258 0x0606060606060606 MTRR: Fixed MSR 0x259 0x0000000000000000 MTRR: Fixed MSR 0x268 0x0606060606060606 MTRR: Fixed MSR 0x269 0x0606060606060606 MTRR: Fixed MSR 0x26a 0x0606060606060606 MTRR: Fixed MSR 0x26b 0x0606060606060606 MTRR: Fixed MSR 0x26c 0x0606060606060606 MTRR: Fixed MSR 0x26d 0x0606060606060606 MTRR: Fixed MSR 0x26e 0x0606060606060606 MTRR: Fixed MSR 0x26f 0x0606060606060606 MTRR: Fixed MSR 0x258 0x0606060606060606 call enable_fixed_mtrr() MTRR: Fixed MSR 0x259 0x0000000000000000 MTRR: Fixed MSR 0x268 0x0606060606060606 MTRR: Fixed MSR 0x269 0x0606060606060606 MTRR: Fixed MSR 0x26a 0x0606060606060606 MTRR: Fixed MSR 0x26b 0x0606060606060606 MTRR: Fixed MSR 0x26c 0x0606060606060606 MTRR: Fixed MSR 0x26d 0x0606060606060606 MTRR: Fixed MSR 0x26e 0x0606060606060606 MTRR: Fixed MSR 0x26f 0x0606060606060606 CPU physical address size: 39 bits call enable_fixed_mtrr() MTRR: Fixed MSR 0x250 0x0606060606060606 call enable_fixed_mtrr() MTRR: Fixed MSR 0x258 0x0606060606060606 MTRR: Fixed MSR 0x259 0x0000000000000000 MTRR: Fixed MSR 0x268 0x0606060606060606 MTRR: Fixed MSR 0x269 0x0606060606060606 MTRR: Fixed MSR 0x26a 0x0606060606060606 MTRR: Fixed MSR 0x26b 0x0606060606060606 MTRR: Fixed MSR 0x26c 0x0606060606060606 MTRR: Fixed MSR 0x26d 0x0606060606060606 MTRR: Fixed MSR 0x26e 0x0606060606060606 MTRR: Fixed MSR 0x26f 0x0606060606060606 CPU physical address size: 39 bits call enable_fixed_mtrr()

MTRR check Fixed MTRRs : Enabled Variable MTRRs: Enabled

MTRR: Fixed MSR 0x250 0x0606060606060606 POST: 0x93 MTRR: Fixed MSR 0x258 0x0606060606060606 MTRR: Fixed MSR 0x259 0x0000000000000000 MTRR: Fixed MSR 0x268 0x0606060606060606 MTRR: Fixed MSR 0x269 0x0606060606060606 MTRR: Fixed MSR 0x26a 0x0606060606060606 MTRR: Fixed MSR 0x26b 0x0606060606060606 MTRR: Fixed MSR 0x26c 0x0606060606060606 MTRR: Fixed MSR 0x26d 0x0606060606060606 MTRR: Fixed MSR 0x26e 0x0606060606060606 MTRR: Fixed MSR 0x26f 0x0606060606060606 BS: BS_WRITE_TABLES exit times (exec / console): 213 / 151 ms ...

Is this expected for MTRRs to look like this or is there something completely garbled?

The full log with MTRR debug info and maximum log level is attached.

Sorry for playing a blind guessing game here, the upstreaming of sources is still not solved on our side.

Thanks for any clues.

Regards, Jan

Arthur Heymans

6:02 p.m.

That MTRR setup looks suboptimal for sure, but not fatally flawed. What's located at 0x77000000 till 0x80000000? I suspect it's just dram but maybe allocated for different purposes like TSEG, GFX stolen memory, .... If you mark it as such during resource allocation the MTRR solution will be more optimised (see soc/intel/common/block/systemagent/systemagent.c).

Kind regards

Arthur

On Mon, Oct 25, 2021 at 5:05 PM Samek, Jan jan.samek@siemens.com wrote:

...

Hello coreboot Community,

After a long time, there's an update to this Tiger Lake issue:

For now, the masks in mca_configure() are used as a workaround to ignore the MCEs:

--- a/src/soc/intel/common/block/cpu/cpulib.c
+++ b/src/soc/intel/common/block/cpu/cpulib.c
@@ -346,7 +346,7 @@ void mca_configure(void)
        for (i = 0; i < num_banks; i++) {
                /* Initialize machine checks */
                wrmsr(IA32_MC_CTL(i),
-                       (msr_t) {.lo = 0xffffffff, .hi = 0xffffffff});
+                       (msr_t) {.lo = 0, .hi = 0});  /* FIXME: MCEs

temp. disabled */ }

...
BS: BS_WRITE_TABLES run times (exec / console): 7 / 307 ms
MTRR: Physical address space:
0x0000000000000000 - 0x00000000000a0000 size 0x000a0000 type 6
0x00000000000a0000 - 0x00000000000c0000 size 0x00020000 type 0
0x00000000000c0000 - 0x0000000077000000 size 0x76f40000 type 6
0x0000000077000000 - 0x0000000080000000 size 0x09000000 type 0
0x0000000080000000 - 0x0000000090000000 size 0x10000000 type 1
0x0000000090000000 - 0x0000000100000000 size 0x70000000 type 0
0x0000000100000000 - 0x0000000480400000 size 0x380400000 type 6
MTRR: Fixed MSR 0x250 0x0606060606060606
MTRR: Fixed MSR 0x258 0x0606060606060606
MTRR: Fixed MSR 0x259 0x0000000000000000
MTRR: Fixed MSR 0x268 0x0606060606060606
MTRR: Fixed MSR 0x269 0x0606060606060606
MTRR: Fixed MSR 0x26a 0x0606060606060606
MTRR: Fixed MSR 0x26b 0x0606060606060606
MTRR: Fixed MSR 0x26c 0x0606060606060606
MTRR: Fixed MSR 0x26d 0x0606060606060606
MTRR: Fixed MSR 0x26e 0x0606060606060606
MTRR: Fixed MSR 0x26f 0x0606060606060606
call enable_fixed_mtrr()
CPU physical address size: 39 bits
MTRR: default type WB/UC MTRR counts: 6/7.
MTRR: WB selected as default type.
MTRR: 0 base 0x0000000077000000 mask 0x0000007fff000000 type 0
MTRR: 1 base 0x0000000078000000 mask 0x0000007ff8000000 type 0
MTRR: 2 base 0x0000000080000000 mask 0x0000007ff0000000 type 1
MTRR: 3 base 0x0000000090000000 mask 0x0000007ff0000000 type 0
MTRR: 4 base 0x00000000a0000000 mask 0x0000007fe0000000 type 0
MTRR: 5 base 0x00000000c0000000 mask 0x0000007fc0000000 type 0
MTRR: Fixed MSR 0x250 0x0606060606060606
MTRR: Fixed MSR 0x258 0x0606060606060606
MTRR: Fixed MSR 0x259 0x0000000000000000
MTRR: Fixed MSR 0x268 0x0606060606060606
MTRR: Fixed MSR 0x269 0x0606060606060606
MTRR: Fixed MSR 0x26a 0x0606060606060606
MTRR: Fixed MSR 0x26b 0x0606060606060606
MTRR: Fixed MSR 0x26c 0x0606060606060606
MTRR: Fixed MSR 0x26d 0x0606060606060606
MTRR: Fixed MSR 0x26e 0x0606060606060606
MTRR: Fixed MSR 0x26f 0x0606060606060606
MTRR: Fixed MSR 0x250 0x0606060606060606
MTRR: Fixed MSR 0x250 0x0606060606060606
MTRR: Fixed MSR 0x258 0x0606060606060606
MTRR: Fixed MSR 0x259 0x0000000000000000
MTRR: Fixed MSR 0x268 0x0606060606060606
MTRR: Fixed MSR 0x269 0x0606060606060606
MTRR: Fixed MSR 0x26a 0x0606060606060606
MTRR: Fixed MSR 0x26b 0x0606060606060606
MTRR: Fixed MSR 0x26c 0x0606060606060606
MTRR: Fixed MSR 0x26d 0x0606060606060606
MTRR: Fixed MSR 0x26e 0x0606060606060606
MTRR: Fixed MSR 0x26f 0x0606060606060606
MTRR: Fixed MSR 0x258 0x0606060606060606
call enable_fixed_mtrr()
MTRR: Fixed MSR 0x259 0x0000000000000000
MTRR: Fixed MSR 0x268 0x0606060606060606
MTRR: Fixed MSR 0x269 0x0606060606060606
MTRR: Fixed MSR 0x26a 0x0606060606060606
MTRR: Fixed MSR 0x26b 0x0606060606060606
MTRR: Fixed MSR 0x26c 0x0606060606060606
MTRR: Fixed MSR 0x26d 0x0606060606060606
MTRR: Fixed MSR 0x26e 0x0606060606060606
MTRR: Fixed MSR 0x26f 0x0606060606060606
CPU physical address size: 39 bits
call enable_fixed_mtrr()
MTRR: Fixed MSR 0x250 0x0606060606060606
call enable_fixed_mtrr()
MTRR: Fixed MSR 0x258 0x0606060606060606
MTRR: Fixed MSR 0x259 0x0000000000000000
MTRR: Fixed MSR 0x268 0x0606060606060606
MTRR: Fixed MSR 0x269 0x0606060606060606
MTRR: Fixed MSR 0x26a 0x0606060606060606
MTRR: Fixed MSR 0x26b 0x0606060606060606
MTRR: Fixed MSR 0x26c 0x0606060606060606
MTRR: Fixed MSR 0x26d 0x0606060606060606
MTRR: Fixed MSR 0x26e 0x0606060606060606
MTRR: Fixed MSR 0x26f 0x0606060606060606
CPU physical address size: 39 bits
call enable_fixed_mtrr()

MTRR check
Fixed MTRRs   : Enabled
Variable MTRRs: Enabled

MTRR: Fixed MSR 0x250 0x0606060606060606
POST: 0x93
MTRR: Fixed MSR 0x258 0x0606060606060606
MTRR: Fixed MSR 0x259 0x0000000000000000
MTRR: Fixed MSR 0x268 0x0606060606060606
MTRR: Fixed MSR 0x269 0x0606060606060606
MTRR: Fixed MSR 0x26a 0x0606060606060606
MTRR: Fixed MSR 0x26b 0x0606060606060606
MTRR: Fixed MSR 0x26c 0x0606060606060606
MTRR: Fixed MSR 0x26d 0x0606060606060606
MTRR: Fixed MSR 0x26e 0x0606060606060606
MTRR: Fixed MSR 0x26f 0x0606060606060606
BS: BS_WRITE_TABLES exit times (exec / console): 213 / 151 ms
...

Is this expected for MTRRs to look like this or is there something completely garbled?

The full log with MTRR debug info and maximum log level is attached.

Sorry for playing a blind guessing game here, the upstreaming of sources is still not solved on our side.

Thanks for any clues.

Regards, Jan _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

Samek, Jan

27 Oct 27 Oct

5:30 p.m.

Hi Arthur,

Thanks for the insight but sorry, I am getting beyond my knowledge limits to be able to answer your question with some level of certainty. I am currently not able to wrap my head around this whole x86 MTRR and System Agent stuff.

At least I was able to boot (to some degree) a minimal linux kernel / busybox payload which might possibly give a better overview of the memory layout from the logs - attaching.

Sorry for the late reply.

Regards, Jan

________________________________________ From: Arthur Heymans arthur@aheymans.xyz Sent: 25 October 2021 18:02 To: Samek, Jan (ADV D EU CZ AE AC 7) Cc: coreboot@coreboot.org; Zeh, Werner (DI MC MTS SP HW 1); Michal Zygowski; naresh.solanki.2011@gmail.com Subject: Re: [coreboot] Re: TigerLake RVP TCSS init failure

Kind regards

Arthur

On Mon, Oct 25, 2021 at 5:05 PM Samek, Jan <jan.samek@siemens.commailto:jan.samek@siemens.com> wrote: Hello coreboot Community,

After a long time, there's an update to this Tiger Lake issue:

For now, the masks in mca_configure() are used as a workaround to ignore the MCEs:

MTRR check Fixed MTRRs : Enabled Variable MTRRs: Enabled

Is this expected for MTRRs to look like this or is there something completely garbled?

The full log with MTRR debug info and maximum log level is attached.

Sorry for playing a blind guessing game here, the upstreaming of sources is still not solved on our side.

Thanks for any clues.

Regards, Jan _______________________________________________ coreboot mailing list -- coreboot@coreboot.orgmailto:coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.orgmailto:coreboot-leave@coreboot.org

Michał Żygowski

24 Aug 24 Aug

12:33 p.m.

Hello Jan,

On 24.08.2021 11:08, Samek, Jan wrote:

...

Hello again Michal, I'd like to additionally ask you about a small detail regarding to the issue: What was the stepping that started to work?

According to the bug I have issued, the problems gone with B0 stepping: https://bugzilla.tianocore.org/show_bug.cgi?id=3219 I have been advised to use Bx stepping (so any other tha Ax which is engineering sample). SOme more detaisl on FSP GitHub as well https://github.com/intel/FSP/issues/63

...

I'm currently encountering this behavior on B-0. I really did have some really bad memory init errors on A-0 which was considered an engineering sample and swapping for B-0 solved it. Nevertheless, this MCE issue still persists on B-0 in my case.

Thanks for info.

Regards, Jan Samek Siemens, s.r.o. ADV D EU CZ AE AC 7 jan.samek@siemens.com _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org

Best regards

-- Michał Żygowski Firmware Engineer GPG: 6B5BA214D21FCEB2 https://3mdeb.com | @3mdeb_com

1288

days inactive

1556

days old

coreboot@coreboot.org

16 comments

6 participants

tags (0)

participants (6)

Arthur Heymans
Guillermo Placencia
Michal Zygowski
Michał Żygowski
Naresh G. Solanki
Samek, Jan