Hey all,
It appears that under certain situations / hardware, HT can come up with the LinkFail and CrcError bits set on certain devices, even though the bus isn't *currently* in an error state. This causes 'hypertransport_scan_chain()' to stop traversing down a chain. I've made the following patch which knocks down the error state and re-reads to identify if the error is transient or not (It also reports the error rather than silently aborts the chain scan which caused me about 6 hours of hunting to find):
*****BEGIN CUT***** Index: hypertransport.c =================================================================== --- hypertransport.c (revision 2064) +++ hypertransport.c (working copy) @@ -345,12 +345,25 @@ /* Wait until the link initialization is complete */ do { ctrl = pci_read_config16(prev.dev, prev.pos + prev.ctrl_off); - /* Is this the end of the hypertransport chain? - * Has the link failed? - * If so further scanning is pointless. - */ - if (ctrl & ((1 << 6) | (1 << 4))) { - goto end_of_chain; + + if (ctrl & (1 << 6)) + goto end_of_chain; // End of chain + + if (ctrl & ((1 << 4) | (1 << 8))) { + /* + * Either the link has failed, or we have + * a CRC error. + * Sometimes this can happen due to link + * retrain, so lets knock it down and see + * if its transient + */ + ctrl |= ((1 << 6) | (1 <<8)); // Link fail + Crc + pci_write_config16(prev.dev, prev.pos + prev.ctrl_off, ctrl); + ctrl = pci_read_config16(prev.dev, prev.pos + prev.ctrl_off); + if (ctrl & ((1 << 4) | (1 << 8))) { + printk_alert("Detected error on Hypertransport Link\n"); + goto end_of_chain; + } } } while((ctrl & (1 << 5)) == 0);
****END CUT*****