Hi, I've got this ugly mct_d fatal exit error again on one of my H8QME-2+ boards. Even every single board is absolutely identical 4 Opterons 16G Ram, etc... there are several boards booting and working without any problem with coreboot and others don't even start and mct_d fatal exit :(.
Has someone an idea what the problem could be ?
Thanks any comment would be appreciated.
Knut Kujat
Does it happen when you create same configuration using SIMnow?
Rudolf
On Friday 26 February 2010 14:41:49 you wrote:
Hi Knut
I've got this ugly mct_d fatal exit error again on one of my H8QME-2+ boards. Even every single board is absolutely identical 4 Opterons 16G Ram, etc... there are several boards booting and working without any problem with coreboot and others don't even start and mct_d fatal exit :(.
Has someone an idea what the problem could be ?
AFAIK the boxes are using engineering samples, so who knows, have you tried swapping CPUs? Have you tried swapping the RAM? Does that happen with or without HTX board?
Regards Christian
Christian Leber escribió:
On Friday 26 February 2010 14:41:49 you wrote:
Hi Knut
I've got this ugly mct_d fatal exit error again on one of my H8QME-2+ boards. Even every single board is absolutely identical 4 Opterons 16G Ram, etc... there are several boards booting and working without any problem with coreboot and others don't even start and mct_d fatal exit :(.
Has someone an idea what the problem could be ?
AFAIK the boxes are using engineering samples, so who knows, have you tried swapping CPUs? Have you tried swapping the RAM? Does that happen with or without HTX board?
Regards Christian
Hi,
it happens with and without board :(. No I haven't tried swapping CPUs or RAM. But this errors appears on memory initialization, right? So its most likely a ram issue?
thx, Knut Kujat.
Knut Kujat wrote:
I haven't tried swapping CPUs or RAM. But this errors appears on memory initialization, right? So its most likely a ram issue?
The memory controller is built-in to the CPU.
Try swapping components around and see if the problem follows some particular parts.
//Peter
Peter Stuge escribió:
Knut Kujat wrote:
I haven't tried swapping CPUs or RAM. But this errors appears on memory initialization, right? So its most likely a ram issue?
The memory controller is built-in to the CPU.
Try swapping components around and see if the problem follows some particular parts.
//Peter
Hello,
switching memory from a working board to the failing board worked "partially" because now it boots and even starts seabios but seabios can't find the hard drive!! It's like there isn't one installed I already switched HD with the working board and no result, of course everything works fine with vendor bios.
That's odd!
thx, Knut Kujat.
Knut Kujat escribió:
Peter Stuge escribió:
Knut Kujat wrote:
I haven't tried swapping CPUs or RAM. But this errors appears on memory initialization, right? So its most likely a ram issue?
The memory controller is built-in to the CPU.
Try swapping components around and see if the problem follows some particular parts.
//Peter
Hello,
switching memory from a working board to the failing board worked "partially" because now it boots and even starts seabios but seabios can't find the hard drive!! It's like there isn't one installed I already switched HD with the working board and no result, of course everything works fine with vendor bios.
That's odd!
thx, Knut Kujat.
I "solved" it. There are 3 sata cables connected to the board only 1 actually has a hard drive connected to it. Seems like this cable has to be connected to sata 1 and before all others. Is this right? Can someone confirm that pleas?
Bye and THX, Knut Kujat.
Hello,
I still having trouble with "fatal exit" but now I can reproduce the error:
Let's say I have a board running with vendor BIOS and flashing coreboot.rom into it with flashrom, so far everything good. Now I shut the whole system down and turn it on again, and voila coreboot booting without having problems. And I can shut the system down like 100 times and boot again with no trouble. Now I unplugging the board for more than a minute plug it back on and coreboot is unable to find my installed memory and dies with "No Nodes?!" "mct_d: fatal exit". In order to make it boot again with coreboot I have to first flash the vendor BIOS on it and boot it than I can flash and boot coreboot again. That won't be much trouble with 1 or 2 boards but with more than 10...
I'm thinking that there may be some kind of electrical issue because I have a board that used to "fatal exit" down in the cluster but up here in the lab it works fine without any "unplugging and than not working" issues. Is there any way to solve this problem? Maybe ram needs more time to stabilize itself before initializing ?!
Any suggestions ?
Thanks, Knut Kujat.
Hi,
This is pointing to something which is powered from 5VSB voltage. It could be some GPIO settings which sets voltage for ram through some other chip. It could be some powersequencing pin connected as GPIO too, it could be a i2c bus multiplexer operated by some GPIO pin too ;)
I would suggest to dump the superio chip with "isadump" (all logical devices) and all registers powered from the 5VSB well if known. Check for changes on GPIO pins or SuperIO global config.
Check if the fail is caused by missing SPD EPROMS (error SMBus reads) or just by ram itself.
It could be also something from the SB itself, but try with superio first.
Then compare the dumps with that you obtained from coreboot (you will need to program that) You can check from linux with legacy bios, then boot with coreboot and then boot with power unplugged.
Good luck,
Rudolf
Rudolf Marek escribió:
Hi,
This is pointing to something which is powered from 5VSB voltage. It could be some GPIO settings which sets voltage for ram through some other chip. It could be some powersequencing pin connected as GPIO too, it could be a i2c bus multiplexer operated by some GPIO pin too ;)
I would suggest to dump the superio chip with "isadump" (all logical devices) and all registers powered from the 5VSB well if known. Check for changes on GPIO pins or SuperIO global config.
Check if the fail is caused by missing SPD EPROMS (error SMBus reads) or just by ram itself.
It could be also something from the SB itself, but try with superio first.
Then compare the dumps with that you obtained from coreboot (you will need to program that) You can check from linux with legacy bios, then boot with coreboot and then boot with power unplugged.
Good luck,
Rudolf
Hi,
I did a output on status form status = mctRead_SPD(smbaddr, Index); in mct_d.c and it only spits -1 out while on the working coreboot machine it gives me several numbers until index = 64 on those dimms where ram is installed. Is this a possible SPD EPROMS missing error you pointed out? What would be my next steps if so?
Thanks for your effort, Knut Kujat.
Hi,
I did a output on status form status = mctRead_SPD(smbaddr, Index); in mct_d.c and it only spits -1 out while on the working coreboot machine it gives me several numbers until index = 64 on those dimms where ram is installed. Is this a possible SPD EPROMS missing error you pointed out?
Yes this points to some I2C multiplexer device. You need to find out how to control the multiplexer. It might be some GPIO setup or even some i2c device. Try to superiotool in verbose mode to see how the GPIO is setup. You will need either to load the GPIO settings (of superio tool) in coreboot before ram init or just dump it and check for the differences in first place.
in linux, i2cdetect 0 output would also help maybe...
try running sensors-detect it might detect the bus multiplexers.
Rudolf
What would be my next steps if so?
Thanks for your effort, Knut Kujat.
Just FYI:
on our first system with Arima boards in 2002, everything worked well until we started booting 64-bit kernels. I'm not kidding. We did not find the SMBUS MUX on the boards until we had unreliable coreboot boots of 64-bit kernels. For quite some time the boards worked fine. Ollie found the SMBUS MUX by examining schematics.
So the SMBUS mux can appear in strange ways, at strange times. This sounds like one of those times. SMBUS muxes are more common than you might think and the default power-on state is not always very well determined.
ron
Hello,
thx all of you for your comments. Here a little update :)
I now know why the boards worked just fine up here in my lab. To know if the board would work after being unplugged I always "only" unplugged the electrical cable but never the monitor attached to the board I figured out that the monitor is providing enough juice to maintain whatever alive in the board so after plugging the electrical cable on again coreboot started fine. Another thing I figured out is that it seems that the front leds of the board a managed by GPIO as well, is this right? If so it seems that something is wrong with GPIO because the power on led never works with coreboot.
thx, Knut Kujat.
ron minnich escribió:
Just FYI:
on our first system with Arima boards in 2002, everything worked well until we started booting 64-bit kernels. I'm not kidding. We did not find the SMBUS MUX on the boards until we had unreliable coreboot boots of 64-bit kernels. For quite some time the boards worked fine. Ollie found the SMBUS MUX by examining schematics.
So the SMBUS mux can appear in strange ways, at strange times. This sounds like one of those times. SMBUS muxes are more common than you might think and the default power-on state is not always very well determined.
ron
Hi,
I finally know that my issue must be related with the smbus registers because on a vendor bios running machine and using i2cdetect and i2cdump I get several values for different i2c devices detected, I get the same values when I successfully start with coreboot. But when I start with coreboot and fail with mcr_d fatal exit those registers are blank, I know that because I found a nice piece of code dumping smbus registers on the h8dme board :D thx to the autor!!
I also know that reading these registers out may cause them to get lost! I'm not sure why?!
Now my question is how do I initialize these registers with the values known from the vendor BIOS? smb_write_byte doesn't seems to work or maybe I'm using it wrong.
THX, Knut Kujat.
Knut Kujat escribió:
Hello,
thx all of you for your comments. Here a little update :)
I now know why the boards worked just fine up here in my lab. To know if the board would work after being unplugged I always "only" unplugged the electrical cable but never the monitor attached to the board I figured out that the monitor is providing enough juice to maintain whatever alive in the board so after plugging the electrical cable on again coreboot started fine. Another thing I figured out is that it seems that the front leds of the board a managed by GPIO as well, is this right? If so it seems that something is wrong with GPIO because the power on led never works with coreboot.
thx, Knut Kujat.
ron minnich escribió:
Just FYI:
on our first system with Arima boards in 2002, everything worked well until we started booting 64-bit kernels. I'm not kidding. We did not find the SMBUS MUX on the boards until we had unreliable coreboot boots of 64-bit kernels. For quite some time the boards worked fine. Ollie found the SMBUS MUX by examining schematics.
So the SMBUS mux can appear in strange ways, at strange times. This sounds like one of those times. SMBUS muxes are more common than you might think and the default power-on state is not always very well determined.
ron
Dear Knut,
Am Mittwoch, den 10.03.2010, 17:26 +0100 schrieb Knut Kujat:
[…]
I know that because I found a nice piece of code dumping smbus registers on the h8dme board :D thx to the autor!!
is it possible to share that piece of code?
[…]
Anyway, nice to hear you ar making some progress and good luck with the next steps!
Thanks,
Paul
Paul Menzel escribió:
Dear Knut,
Am Mittwoch, den 10.03.2010, 17:26 +0100 schrieb Knut Kujat:
[…]
I know that because I found a nice piece of code dumping smbus registers on the h8dme board :D thx to the autor!!
is it possible to share that piece of code?
[…]
Anyway, nice to hear you ar making some progress and good luck with the next steps!
Thanks,
Paul
Hi,
you can find the function at src/mainboard/supermicro/h8dme/romstage.c line 101: static void dump_smbus_registers(void).
bye! Knut Kujat
On Wed, Mar 10, 2010 at 05:26:47PM +0100, Knut Kujat wrote:
I finally know that my issue must be related with the smbus registers because on a vendor bios running machine and using i2cdetect and i2cdump I get several values for different i2c devices detected, I get the same values when I successfully start with coreboot. But when I start with coreboot and fail with mcr_d fatal exit those registers are blank, I know that because I found a nice piece of code dumping smbus registers on the h8dme board :D thx to the autor!!
That would have been Marc Jones :)
Thanks, Ward.
I finally know that my issue must be related with the smbus registers because on a vendor bios running machine and using i2cdetect and i2cdump I get several values for different i2c devices detected, I get the same values when I successfully start with coreboot. But when I start with coreboot and fail with mcr_d fatal exit those registers are blank, I know that because I found a nice piece of code dumping smbus registers on the h8dme board :D thx to the autor!!
I also know that reading these registers out may cause them to get lost! I'm not sure why?!
There is a multiplexer on SMBus, this confirms my theory. Please check the GPIO.
Imagine the multiplexer acts as some kind of rail switch. The transactions on smbus never reach thhe memory chips (the SPD eeprom). You need to find a pin to control the multiplexer.
Rudolf
Now my question is how do I initialize these registers with the values known from the vendor BIOS? smb_write_byte doesn't seems to work or maybe I'm using it wrong.
THX, Knut Kujat.
Knut Kujat escribió:
Hello,
thx all of you for your comments. Here a little update :)
I now know why the boards worked just fine up here in my lab. To know if the board would work after being unplugged I always "only" unplugged the electrical cable but never the monitor attached to the board I figured out that the monitor is providing enough juice to maintain whatever alive in the board so after plugging the electrical cable on again coreboot started fine. Another thing I figured out is that it seems that the front leds of the board a managed by GPIO as well, is this right? If so it seems that something is wrong with GPIO because the power on led never works with coreboot.
thx, Knut Kujat.
ron minnich escribió:
Just FYI:
on our first system with Arima boards in 2002, everything worked well until we started booting 64-bit kernels. I'm not kidding. We did not find the SMBUS MUX on the boards until we had unreliable coreboot boots of 64-bit kernels. For quite some time the boards worked fine. Ollie found the SMBUS MUX by examining schematics.
So the SMBUS mux can appear in strange ways, at strange times. This sounds like one of those times. SMBUS muxes are more common than you might think and the default power-on state is not always very well determined.
ron
Rudolf Marek escribió:
I finally know that my issue must be related with the smbus registers because on a vendor bios running machine and using i2cdetect and i2cdump I get several values for different i2c devices detected, I get the same values when I successfully start with coreboot. But when I start with coreboot and fail with mcr_d fatal exit those registers are blank, I know that because I found a nice piece of code dumping smbus registers on the h8dme board :D thx to the autor!!
I also know that reading these registers out may cause them to get lost! I'm not sure why?!
There is a multiplexer on SMBus, this confirms my theory. Please check the GPIO.
Imagine the multiplexer acts as some kind of rail switch. The transactions on smbus never reach thhe memory chips (the SPD eeprom). You need to find a pin to control the multiplexer.
Rudolf
Thanks, because of your hints I was able to figure out that I needed to set up the spd_rom in romstage.c I also added the GPIOs settings as read from vendor BIOS and now the power on led works :).
thx, Knut Kujat.
Now my question is how do I initialize these registers with the values known from the vendor BIOS? smb_write_byte doesn't seems to work or maybe I'm using it wrong.
THX, Knut Kujat.
Knut Kujat escribió:
Hello,
thx all of you for your comments. Here a little update :)
I now know why the boards worked just fine up here in my lab. To know if the board would work after being unplugged I always "only" unplugged the electrical cable but never the monitor attached to the board I figured out that the monitor is providing enough juice to maintain whatever alive in the board so after plugging the electrical cable on again coreboot started fine. Another thing I figured out is that it seems that the front leds of the board a managed by GPIO as well, is this right? If so it seems that something is wrong with GPIO because the power on led never works with coreboot.
thx, Knut Kujat.
ron minnich escribió:
Just FYI:
on our first system with Arima boards in 2002, everything worked well until we started booting 64-bit kernels. I'm not kidding. We did not find the SMBUS MUX on the boards until we had unreliable coreboot boots of 64-bit kernels. For quite some time the boards worked fine. Ollie found the SMBUS MUX by examining schematics.
So the SMBUS mux can appear in strange ways, at strange times. This sounds like one of those times. SMBUS muxes are more common than you might think and the default power-on state is not always very well determined.
ron