Hey all,
It appears that under certain situations / hardware, HT can come up
with the LinkFail and CrcError bits set on certain devices, even though
the bus isn't *currently* in an error state. This causes
'hypertransport_scan_chain()' to stop traversing down a chain. I've
made the following patch which knocks down the error state and re-reads
to identify if the error is transient or not (It also reports the error
rather than silently aborts the chain scan which caused me about 6
hours of hunting to find):
*****BEGIN CUT*****
Index: hypertransport.c
===================================================================
--- hypertransport.c (revision 2064)
+++ hypertransport.c (working copy)
@@ -345,12 +345,25 @@
/* Wait until the link initialization is complete */
do {
ctrl = pci_read_config16(prev.dev, prev.pos + prev.ctrl_off);
-
/* Is this the end of the hypertransport chain?
-
* Has the link failed?
-
* If so further scanning is pointless.
-
*/
-
if (ctrl & ((1 << 6) | (1 << 4))) {
-
goto end_of_chain;
+
+
if (ctrl & (1 << 6))
+
goto end_of_chain; // End of chain
+
+
if (ctrl & ((1 << 4) | (1 << 8))) {
+
/*
+
* Either the link has failed, or we have
+
* a CRC error.
+
* Sometimes this can happen due to link
+
* retrain, so lets knock it down and see
+
* if its transient
+
*/
+
ctrl |= ((1 << 6) | (1 <<8)); // Link fail + Crc
+
pci_write_config16(prev.dev, prev.pos + prev.ctrl_off, ctrl);
+
ctrl = pci_read_config16(prev.dev, prev.pos + prev.ctrl_off);
+
if (ctrl & ((1 << 4) | (1 << 8))) {
+
printk_alert("Detected error on Hypertransport Link\n");
+
goto end_of_chain;
+
}
}
} while((ctrl & (1 << 5)) == 0);
****END CUT*****