Christian Gmeiner wrote:
Most of the time the system works as expected but from time to rebooting the system fails completely.
Only ever when rebooting, or does cold boot also fail sometimes?
(Make a test system to cold boot your system in a loop.)
there are two FPGAs connected via PCIe to the system where one is used to reset the system. The reset is done via SYS_RESET#.
Are the FPGAs also reset by that?
If yes, how long do they need to initialize to where HDL acts correctly on PCIe?
If no, how long do they need to move from resetting the system to where HDL acts correctly on PCIe for the newly resetted platform?
Now I run into different kind of issues:
- pcie link training fails from time to time
On both links, or only one of them? Can you tell?
- looks like PLTRST# of the sunrisepoint pch holds the system in reset for minutes.
I don't know if there's enough PCH documentation to know exactly why it would do that - but I can imagine that it holds reset as long as some conditions are not met, I can also imagine that the FPGAs cause some undefined PCIe behavior in the PCH which happens to get it stuck in reset for a while.
Are there any hints to debug my issues?
As always, try to isolate the problem.
Can you completely remove one or ideally both FPGAs from the equation?
You mention that one of them is used for reset. At least disable the other one, destructively if need be.
//Peter
Am Di., 25. Sep. 2018 um 12:27 Uhr schrieb Peter Stuge peter@stuge.se:
Christian Gmeiner wrote:
Most of the time the system works as expected but from time to rebooting the system fails completely.
Only ever when rebooting, or does cold boot also fail sometimes?
Both fail.
(Make a test system to cold boot your system in a loop.)
there are two FPGAs connected via PCIe to the system where one is used to reset the system. The reset is done via SYS_RESET#.
Are the FPGAs also reset by that?
Yes
If yes, how long do they need to initialize to where HDL acts correctly on PCIe?
I would need to measure this but i would say less then 300ms.
Now I run into different kind of issues:
- pcie link training fails from time to time
On both links, or only one of them? Can you tell?
This is hard to tell.. all I see that the link gets reset quite fast and quite often.
- looks like PLTRST# of the sunrisepoint pch holds the system in reset for minutes.
I don't know if there's enough PCH documentation to know exactly why it would do that - but I can imagine that it holds reset as long as some conditions are not met, I can also imagine that the FPGAs cause some undefined PCIe behavior in the PCH which happens to get it stuck in reset for a while.
Are there any hints to debug my issues?
As always, try to isolate the problem.
Can you completely remove one or ideally both FPGAs from the equation?
You mention that one of them is used for reset. At least disable the other one, destructively if need be.
During the last weeks I found the root cause of my problem - PCIe spread spectrum
Our FPGAs need a stable 100MHz PCIE clock to work. The used FSP config thing looked like this:
void mainboard_memory_init_params(FSPM_UPD *mupd) { FSP_M_CONFIG *mem_cfg; struct spd_block blk = { .addr_map = { 0x50 }, };
mem_cfg = &mupd->FspmConfig;
mem_cfg->PegDisableSpreadSpectrumClocking = 1; mem_cfg->PchPmPciePllSsc = 0;
... }
With this configuration the PCIe reference clock was off more then 8% which caused the system to hang during cold and warm boots.
In the next step I removed assignment of PchPmPciePllSsc as it is documented as 'No BIOS override'. With this change I got more then 1000 soft and 2000 hard reboots without any problem. Keep in mind we started with only 10 successful reboots.
The big problem is that PegDisableSpreadSpectrumClocking has no effect at all. I measured the freq it is not the 100MHz as expected. And I need to have a stable 100MHz this clock source is used internally by the FPGA to drive internal clocks. The end results is that EtherCAT is not able to sync.
I tried to change the ICC config via MEI messaging but I am not able to change the clock settings even with an successful return code.