Hi all,
On Librem Mini v2, rebooting, then suspending and resuming fails to resume.
I've tracked this down to a change in the TOLUM returned by FSP, which causes failures to find important cbmem regions during S3 resume. (I've run into problems relating to the TOLUM change before: https://puri.sm/posts/how-we-fixed-reboot-loops-on-the-librem-mini/.) This doesn't seem to happen on all CML boards but has always happened on Mini v2 for whatever reason. I have some ideas to address it, but I'm not sure which is best.
For example: * Cold boot: cbmem_top() = 0x99fff000 * Reboot: cbmem_top() = 0x9a000000 (4K later, FSP seems to reserve 4K less memory for itself on reboot) * Resume after reboot: cbmem_top() = 0x99fff000 (will not be able to find cbmem from reboot, not sure if the upper 4 KB have been overwritten by FSP)
coreboot needs to find cbmem (the imd structures) at TOLUM on resume, so failures cascade from there. (Happy to describe in detail but I don't think anything after this point is the root issue.)
It seems pretty unlikely that FSP is going to change, so coreboot probably has to tolerate this TOLUM change :-/ Thought of a few approaches but I'm not sure which is best, they mostly trade off putting more holes in the memory map versus wasting a bit of memory on buffer space:
* Put the imd structures below the FSP reserved memory with some buffer space? * Put the imd structures somewhere else entirely, like toward the beginning of the available low memory instead of the end? * Ask FSP to reserve more than 8 KB for some buffer in case TOLUM changes on resume, so the imd structures are still there?
I could guard any of this with a Kconfig selected only by affected boards if needed. I appreciate any input!
Thanks, Jonathon
Hi Jonathon,
On 03.11.23 22:46, Jonathon Hall wrote:
On Librem Mini v2, rebooting, then suspending and resuming fails to resume.
I've tracked this down to a change in the TOLUM returned by FSP, which causes failures to find important cbmem regions during S3 resume. (I've run into problems relating to the TOLUM change before: https://puri.sm/posts/how-we-fixed-reboot-loops-on-the-librem-mini/.) This doesn't seem to happen on all CML boards but has always happened on Mini v2 for whatever reason. I have some ideas to address it, but I'm not sure which is best.
For example:
- Cold boot: cbmem_top() = 0x99fff000
- Reboot: cbmem_top() = 0x9a000000 (4K later, FSP seems to reserve 4K
less memory for itself on reboot)
- Resume after reboot: cbmem_top() = 0x99fff000 (will not be able to
find cbmem from reboot, not sure if the upper 4 KB have been overwritten by FSP)
I would love to blame FSP for this, but first we should make sure that it's not coreboot's fault. I assume FSP is free to change the allocation depending on its inputs. So it would be coreboot's job to ensure that these inputs don't change when resuming. Obviously UPDs handed to FSP-M shouldn't change. Have you confirmed that? (maybe dump them or a check- sum). Otherwise, the hardware state could be different. I don't know any example for FSP-M, but generally FSP checking for the presence of a PCIe device, for instance, is imaginable. Then it could be bad timing. Maybe as a desperate last test, try a 200ms delay before jumping into FSP-M.
If it's not that simple, I think we should bug Intel to provide a complete list of all inputs that affect TOLUM.
- Put the imd structures below the FSP reserved memory with some buffer
space?
This would probably require additional hacks for coreboot to find things in the FSP reserved memory later. I'm not sure how invasive this would be. I can't remember rn. what were the reasons to keep the IMD structures on top. But IIRC FSP was changed for this, so I bet there are good reasons. (Ironically, I believe not having to move them when the amount of data FSP-M spews changes, was among them.)
- Put the imd structures somewhere else entirely, like toward the
beginning of the available low memory instead of the end?
This could conflict with payloads, and (legacy) bootloaders and OSs.
- Ask FSP to reserve more than 8 KB for some buffer in case TOLUM
changes on resume, so the imd structures are still there?
This was also one of my first thoughts. We may still have to jump through some hoops if the location of the FSP reserved things move, though.
Nico
Thanks for the great suggestions Nico!
I'll check the UPDs between cold boot / cold boot resume / reboot / reboot resume, try some delays, etc., to see if I can identify anything relevant on the coreboot side that could impact this. It'd be great to find a solution there rather than having to kludge around it.
If I can't find a solution there, I'll try padding the FSP allocation, and we can see what that implementation looks like. I belive the IMD structs already have magic numbers, so I could probably look for those on 4K-aligned positions in the bootloader reserved area during resume without costing too much (and all of this could be Kconfig-controlled if needed).
On 11/3/23 18:35, Nico Huber wrote:
Hi Jonathon,
On 03.11.23 22:46, Jonathon Hall wrote:
On Librem Mini v2, rebooting, then suspending and resuming fails to resume.
I've tracked this down to a change in the TOLUM returned by FSP, which causes failures to find important cbmem regions during S3 resume. (I've run into problems relating to the TOLUM change before: https://puri.sm/posts/how-we-fixed-reboot-loops-on-the-librem-mini/.) This doesn't seem to happen on all CML boards but has always happened on Mini v2 for whatever reason. I have some ideas to address it, but I'm not sure which is best.
For example:
- Cold boot: cbmem_top() = 0x99fff000
- Reboot: cbmem_top() = 0x9a000000 (4K later, FSP seems to reserve 4K
less memory for itself on reboot)
- Resume after reboot: cbmem_top() = 0x99fff000 (will not be able to
find cbmem from reboot, not sure if the upper 4 KB have been overwritten by FSP)
I would love to blame FSP for this, but first we should make sure that it's not coreboot's fault. I assume FSP is free to change the allocation depending on its inputs. So it would be coreboot's job to ensure that these inputs don't change when resuming. Obviously UPDs handed to FSP-M shouldn't change. Have you confirmed that? (maybe dump them or a check- sum). Otherwise, the hardware state could be different. I don't know any example for FSP-M, but generally FSP checking for the presence of a PCIe device, for instance, is imaginable. Then it could be bad timing. Maybe as a desperate last test, try a 200ms delay before jumping into FSP-M.
If it's not that simple, I think we should bug Intel to provide a complete list of all inputs that affect TOLUM.
- Put the imd structures below the FSP reserved memory with some buffer
space?
This would probably require additional hacks for coreboot to find things in the FSP reserved memory later. I'm not sure how invasive this would be. I can't remember rn. what were the reasons to keep the IMD structures on top. But IIRC FSP was changed for this, so I bet there are good reasons. (Ironically, I believe not having to move them when the amount of data FSP-M spews changes, was among them.)
- Put the imd structures somewhere else entirely, like toward the
beginning of the available low memory instead of the end?
This could conflict with payloads, and (legacy) bootloaders and OSs.
- Ask FSP to reserve more than 8 KB for some buffer in case TOLUM
changes on resume, so the imd structures are still there?
This was also one of my first thoughts. We may still have to jump through some hoops if the location of the FSP reserved things move, though.
Nico
Hi Jonathon,
another thought occurred, memory is coming back a bit slowly. We used to have code in coreboot that tried to predict the TOLUM placement (instead of relying on the HOB information). Looking at that, the only thing that wasn't MiB aligned was a 4KiB space reserved for PTT (Platform Trust Tech, their firmware TPM, AIUI). There's some information about it in commit f5fe3590af9a (soc/intel/skylake: Usable dram top calculation based on HW registers).
This is tied to and controlled by the ME firmware, so checking its state before FSP-M runs might be worth a shot. There's now also a cse_enable_ptt() API in coreboot and related status code, but I'm not sure if that works pre-RAM. Beside hardware differences, the ME firmware and its settings might be what makes the difference for the Librem Mini.
Hope that helps, Nico
On 06.11.23 14:39, Jonathon Hall wrote:
Thanks for the great suggestions Nico!
I'll check the UPDs between cold boot / cold boot resume / reboot / reboot resume, try some delays, etc., to see if I can identify anything relevant on the coreboot side that could impact this. It'd be great to find a solution there rather than having to kludge around it.
If I can't find a solution there, I'll try padding the FSP allocation, and we can see what that implementation looks like. I belive the IMD structs already have magic numbers, so I could probably look for those on 4K-aligned positions in the bootloader reserved area during resume without costing too much (and all of this could be Kconfig-controlled if needed).
On 11/3/23 18:35, Nico Huber wrote:
Hi Jonathon,
On 03.11.23 22:46, Jonathon Hall wrote:
On Librem Mini v2, rebooting, then suspending and resuming fails to resume.
I've tracked this down to a change in the TOLUM returned by FSP, which causes failures to find important cbmem regions during S3 resume. (I've run into problems relating to the TOLUM change before: https://puri.sm/posts/how-we-fixed-reboot-loops-on-the-librem-mini/.) This doesn't seem to happen on all CML boards but has always happened on Mini v2 for whatever reason. I have some ideas to address it, but I'm not sure which is best.
For example:
- Cold boot: cbmem_top() = 0x99fff000
- Reboot: cbmem_top() = 0x9a000000 (4K later, FSP seems to reserve 4K
less memory for itself on reboot)
- Resume after reboot: cbmem_top() = 0x99fff000 (will not be able to
find cbmem from reboot, not sure if the upper 4 KB have been overwritten by FSP)
I would love to blame FSP for this, but first we should make sure that it's not coreboot's fault. I assume FSP is free to change the allocation depending on its inputs. So it would be coreboot's job to ensure that these inputs don't change when resuming. Obviously UPDs handed to FSP-M shouldn't change. Have you confirmed that? (maybe dump them or a check- sum). Otherwise, the hardware state could be different. I don't know any example for FSP-M, but generally FSP checking for the presence of a PCIe device, for instance, is imaginable. Then it could be bad timing. Maybe as a desperate last test, try a 200ms delay before jumping into FSP-M.
If it's not that simple, I think we should bug Intel to provide a complete list of all inputs that affect TOLUM.
- Put the imd structures below the FSP reserved memory with some buffer
space?
This would probably require additional hacks for coreboot to find things in the FSP reserved memory later. I'm not sure how invasive this would be. I can't remember rn. what were the reasons to keep the IMD structures on top. But IIRC FSP was changed for this, so I bet there are good reasons. (Ironically, I believe not having to move them when the amount of data FSP-M spews changes, was among them.)
- Put the imd structures somewhere else entirely, like toward the
beginning of the available low memory instead of the end?
This could conflict with payloads, and (legacy) bootloaders and OSs.
- Ask FSP to reserve more than 8 KB for some buffer in case TOLUM
changes on resume, so the imd structures are still there?
This was also one of my first thoughts. We may still have to jump through some hoops if the location of the FSP reserved things move, though.
Nico
Wow, you're 100% right Nico. PTT was enabled in the ME firmware for both Mini v1 and v2, disabling it eliminates this problem. I had found a 4K reserved region in HOB that did not appear during reboot but couldn't figure out what it was.
We enable the HAP bit so we don't get the firmware TPM anyway, PTT probably should have been disabled but was overlooked.
Unfortunately flashing this change via our existing methods would create problems for existing users, the system won't boot again until power is removed and reapplied. Probably whatever bit of the ME remains to handle power sequencing crashes when the ME region is overwritten.
I have another solution though that also addresses a few other problems - doing a full reset for ACPI reboot instead of a system reset solves this too. That's just adding FULL_RST to the FADT reset value, so not very invasive and I can select it just for these boards. This also addresses some reboot problems observed with the DP-HDMI converter and some specific SATA SSDs since it power cycles them during reboot. The only real trade-off is adding 5 seconds during reboot which is reasonable to solve all these issues.
Thanks a ton for all the advice Nico, this is invaluable information for this and future boards as well.
On 11/8/23 07:25, Nico Huber wrote:
Hi Jonathon,
another thought occurred, memory is coming back a bit slowly. We used to have code in coreboot that tried to predict the TOLUM placement (instead of relying on the HOB information). Looking at that, the only thing that wasn't MiB aligned was a 4KiB space reserved for PTT (Platform Trust Tech, their firmware TPM, AIUI). There's some information about it in commit f5fe3590af9a (soc/intel/skylake: Usable dram top calculation based on HW registers).
This is tied to and controlled by the ME firmware, so checking its state before FSP-M runs might be worth a shot. There's now also a cse_enable_ptt() API in coreboot and related status code, but I'm not sure if that works pre-RAM. Beside hardware differences, the ME firmware and its settings might be what makes the difference for the Librem Mini.
Hope that helps, Nico
On 06.11.23 14:39, Jonathon Hall wrote:
Thanks for the great suggestions Nico!
I'll check the UPDs between cold boot / cold boot resume / reboot / reboot resume, try some delays, etc., to see if I can identify anything relevant on the coreboot side that could impact this. It'd be great to find a solution there rather than having to kludge around it.
If I can't find a solution there, I'll try padding the FSP allocation, and we can see what that implementation looks like. I belive the IMD structs already have magic numbers, so I could probably look for those on 4K-aligned positions in the bootloader reserved area during resume without costing too much (and all of this could be Kconfig-controlled if needed).
On 11/3/23 18:35, Nico Huber wrote:
Hi Jonathon,
On 03.11.23 22:46, Jonathon Hall wrote:
On Librem Mini v2, rebooting, then suspending and resuming fails to resume.
I've tracked this down to a change in the TOLUM returned by FSP, which causes failures to find important cbmem regions during S3 resume. (I've run into problems relating to the TOLUM change before: https://puri.sm/posts/how-we-fixed-reboot-loops-on-the-librem-mini/.) This doesn't seem to happen on all CML boards but has always happened on Mini v2 for whatever reason. I have some ideas to address it, but I'm not sure which is best.
For example:
- Cold boot: cbmem_top() = 0x99fff000
- Reboot: cbmem_top() = 0x9a000000 (4K later, FSP seems to reserve 4K
less memory for itself on reboot)
- Resume after reboot: cbmem_top() = 0x99fff000 (will not be able to
find cbmem from reboot, not sure if the upper 4 KB have been overwritten by FSP)
I would love to blame FSP for this, but first we should make sure that it's not coreboot's fault. I assume FSP is free to change the allocation depending on its inputs. So it would be coreboot's job to ensure that these inputs don't change when resuming. Obviously UPDs handed to FSP-M shouldn't change. Have you confirmed that? (maybe dump them or a check- sum). Otherwise, the hardware state could be different. I don't know any example for FSP-M, but generally FSP checking for the presence of a PCIe device, for instance, is imaginable. Then it could be bad timing. Maybe as a desperate last test, try a 200ms delay before jumping into FSP-M.
If it's not that simple, I think we should bug Intel to provide a complete list of all inputs that affect TOLUM.
- Put the imd structures below the FSP reserved memory with some buffer
space?
This would probably require additional hacks for coreboot to find things in the FSP reserved memory later. I'm not sure how invasive this would be. I can't remember rn. what were the reasons to keep the IMD structures on top. But IIRC FSP was changed for this, so I bet there are good reasons. (Ironically, I believe not having to move them when the amount of data FSP-M spews changes, was among them.)
- Put the imd structures somewhere else entirely, like toward the
beginning of the available low memory instead of the end?
This could conflict with payloads, and (legacy) bootloaders and OSs.
- Ask FSP to reserve more than 8 KB for some buffer in case TOLUM
changes on resume, so the imd structures are still there?
This was also one of my first thoughts. We may still have to jump through some hoops if the location of the FSP reserved things move, though.
Nico