Hi I am trying to sort out a RAM issue with coreboot on a motherboard we have designed at my work.
The board uses the AMD LX800 and CS5536 companion chipset.
I have been using coreboot-v3 and based my board port on the Alix 1.C code. We have a SODIMM socket, so I used the initram.c from the amd/db800 code because it does SPD.
The symptoms are that sometimes a boot will die during disable_car() in arch/x86/geodelx/stage1.c:
/* OK, here is the theory: we should be able to copy * the data back over itself, and the wbinvd should then * flush to memory. Let's see. */ __asm__ __volatile__("cld; rep movsl" ::"D" (DCACHE_RAM_BASE), "S" (DCACHE_RAM_BASE), "c" (DCACHE_RAM_SIZE/4): "memory"); __asm__ __volatile__ ("wbinvd\n");
Sometimes it boots fine and appears to be quite stable. If I run software like mprime95 to "Torture Test" the system, it doesn't fail.
However, there is one strange phenomenon that I've noticed. If I remove the RTC battery backup and Linux forces a fsck because the last boot time was in the future. If Linux fixes a couple of errors and reboots automatically in 5 seconds, during the reboot it's almost guaranteed that coreboot will get stuck in disable_car(). I don't know if this is useful, or just a coincidence. In my build I have also disabled the CMOS Option Table (CONFIG_OPTION_TABLE) in case something there might be causing the problem.
Initially I thought it was a hardware problem with our PCBs, but have tested with a bios chip from an AMD reference design board (EmbeddedBIOS v5.3) which booted Linux fine without any problems, so I have concluded that my problem must be something I'm doing (or not doing) in coreboot.
I have checked all the SPD values against the datasheet for the Hynix RAM chips (256MiB 333MHz). I tried using the non-SPD method as in the Alix 1.C code and specified all the RAM parameters. I have also tried to follow sdram_enable() and other functions in northbridge/amd/geodelx/raminit.c to check against the AMD LX databook's section 6.1.3 BIOS Initialization Sequence, though I admit I could have easily missed something here.
I would be most grateful of any suggestions for helping me work out what's going wrong.
Regards, Nathan
This is a new one. We've never seen it on the geodes we have.
It is possible your geodes are newer and there is something we are doing wrong for the part.
It is also possible that a "reset" from linux on coreboot is not "complete" enough, as it might be on the embedded bios. In my case I would start there.
ron
On Thu, Nov 5, 2009 at 9:33 AM, ron minnich rminnich@gmail.com wrote:
This is a new one. We've never seen it on the geodes we have.
It is possible your geodes are newer and there is something we are doing wrong for the part.
It is also possible that a "reset" from linux on coreboot is not "complete" enough, as it might be on the embedded bios. In my case I would start there.
Good point. The bios has a number of ways to cause a reset that coreboot doesn't have. I don't know how many different ways Linux will attempt a reset. If it always works when you hit the reset button, you will need to look at the software reset path.
When linux does the reset, is the coreboot output the same? Does it do the "Resetting the processor"?
A few things you can look at are a memory test prior to the wbind and you can also try dumping the MC registers to see if they are getting setup differently on the failing case.
Marc
On Thu, Nov 5, 2009 at 9:10 AM, Marc Jones marcj303@gmail.com wrote:
If it always works when you hit the reset button, you will need to look at the software reset path.
I like that test!
ron
Marc Jones wrote:
When linux does the reset, is the coreboot output the same? Does it do the "Resetting the processor"?
Yes, it does "Resetting the processor after PLL configuration for the changes to take effect"
I captured an example of my issue:
http://coreboot.pastebin.com/m7f5ed367
I made some more observations today:
From Linux, a reboot from the command line works fine. It only seems to die when a fsck check fails on boot and forces a reboot. The motherboard doesn't have a RTC backup battery at the moment, so to test I have been setting the clock in Linux, shutdown, remove the power supply for a few seconds, then boot up again. Because the default time is in 1999, Linux runs a fsck which causes it to restart and die in coreboot.
Once coreboot crashes, a hardware reset doesn't fix it. Coreboot will always stop at the same point. Even removing power from the motherboard doesn't help. However, I did find that by swapping the SODIMM to a different RAM module would boot. I know it doesn't sound very scientific but it's what appeared to happen.
Is it possible that coreboot or maybe SeaBIOS is using incorrect values from non-volatile ram?
Another observation I made was that by setting the debug_level to BIOS_CRIT, instead of dying at the usual spot in disable_car() and stopping, coreboot would reset continuously (cycling every 1-2 seconds)
Another issue that's partly related is the ability for coreboot to set the GeodeLink speed depending on the detected RAM speed. As a work-around, we are only using 333MHz SODIMMs and have set the bootstrap bits for GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU, 333MHz GLIU instead of bypass mode. In bypass mode, the GLIU is 266MHz and some of our 333MHz RAM will fail in disable_car(). As a test, I have experimented with pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to change the GLIU to 333MHz. I probably didn't have the correct bits set, so even though I managed to set GLIU, it failed the last test (DLL) in sdram_enable() and would reset.
Nathan
On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
When linux does the reset, is the coreboot output the same? Does it do the "Resetting the processor"?
Yes, it does "Resetting the processor after PLL configuration for the changes to take effect"
I captured an example of my issue:
http://coreboot.pastebin.com/m7f5ed367
I made some more observations today:
From Linux, a reboot from the command line works fine. It only seems to die when a fsck check fails on boot and forces a reboot. The motherboard doesn't have a RTC backup battery at the moment, so to test I have been setting the clock in Linux, shutdown, remove the power supply for a few seconds, then boot up again. Because the default time is in 1999, Linux runs a fsck which causes it to restart and die in coreboot.
Once coreboot crashes, a hardware reset doesn't fix it. Coreboot will always stop at the same point. Even removing power from the motherboard doesn't help. However, I did find that by swapping the SODIMM to a different RAM module would boot. I know it doesn't sound very scientific but it's what appeared to happen.
Is it possible that coreboot or maybe SeaBIOS is using incorrect values from non-volatile ram?
Another observation I made was that by setting the debug_level to BIOS_CRIT, instead of dying at the usual spot in disable_car() and stopping, coreboot would reset continuously (cycling every 1-2 seconds)
Another issue that's partly related is the ability for coreboot to set the GeodeLink speed depending on the detected RAM speed. As a work-around, we are only using 333MHz SODIMMs and have set the bootstrap bits for GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU, 333MHz GLIU instead of bypass mode. In bypass mode, the GLIU is 266MHz and some of our 333MHz RAM will fail in disable_car(). As a test, I have experimented with pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to change the GLIU to 333MHz. I probably didn't have the correct bits set, so even though I managed to set GLIU, it failed the last test (DLL) in sdram_enable() and would reset.
Your second problem might explain the first. You should look closely at the detection problem. It depends on the reset and the state of the rstpll flags. There could be a corner case or something unusual going on. How did you set the boot strap bits with hardware (straps)? You should use pll_reset(ManualConf) settings to change it with hardware.
Marc
Marc Jones wrote:
On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams nathan@traverse.com.au wrote:
Another observation I made was that by setting the debug_level to BIOS_CRIT, instead of dying at the usual spot in disable_car() and stopping, coreboot would reset continuously (cycling every 1-2 seconds)
Since I needed to have a BIOS that didn't have much debugging enabled for a customer sample, I looked a bit deeper to find the cause of this continuous reset behaviour. Even changing the debug level from BIOS_SPEW to BIOS_DEBUG caused the reset. I tracked it down to a single printk and my attached patch means it works at BIOS_CRIT now, just with a few extra debug lines. Without the printk, the code gets to "missing phase4_read_resources" (just a few lines down from my patch) before restarting.
Another issue that's partly related is the ability for coreboot to set the GeodeLink speed depending on the detected RAM speed. As a work-around, we are only using 333MHz SODIMMs and have set the bootstrap bits for GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU, 333MHz GLIU instead of bypass mode. In bypass mode, the GLIU is 266MHz and some of our 333MHz RAM will fail in disable_car(). As a test, I have experimented with pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to change the GLIU to 333MHz. I probably didn't have the correct bits set, so even though I managed to set GLIU, it failed the last test (DLL) in sdram_enable() and would reset.
Your second problem might explain the first. You should look closely at the detection problem. It depends on the reset and the state of the rstpll flags. There could be a corner case or something unusual going on. How did you set the boot strap bits with hardware (straps)? You should use pll_reset(ManualConf) settings to change it with hardware.
Marc
Sorry, I should have explained that we set the boostrap bits in hardware:
Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz. Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, otherwise low. Bit 5: PW0 pad - part of CPU/GLIU frequency selects. Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects. Bit 3: GNT2# pad - part of CPU/GLIU frequency selects. Bit 2: GNT1# pad - part of CPU/GLIU frequency selects. Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.
We have pulled these pins up or down to be "0010110", which corresponds to CPU 500MHz, GLIU 333MHz in table 6-87. This should also mean that the on reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch (except that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on Power On Reset (POR). So I should be using pll_reset(ManualConf)? I'll try it later today and see if I can get some debugging output.
Regards, Nathan
On Tue, Nov 10, 2009 at 1:26 PM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams nathan@traverse.com.au wrote:
Another observation I made was that by setting the debug_level to BIOS_CRIT, instead of dying at the usual spot in disable_car() and stopping, coreboot would reset continuously (cycling every 1-2 seconds)
Since I needed to have a BIOS that didn't have much debugging enabled for a customer sample, I looked a bit deeper to find the cause of this continuous reset behaviour. Even changing the debug level from BIOS_SPEW to BIOS_DEBUG caused the reset. I tracked it down to a single printk and my attached patch means it works at BIOS_CRIT now, just with a few extra debug lines. Without the printk, the code gets to "missing phase4_read_resources" (just a few lines down from my patch) before restarting.
This sounds like it is probably blowing the stack or the stack hits memory that isn't working correctly.
Another issue that's partly related is the ability for coreboot to set the GeodeLink speed depending on the detected RAM speed. As a work-around, we are only using 333MHz SODIMMs and have set the bootstrap bits for GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU, 333MHz GLIU instead of bypass mode. In bypass mode, the GLIU is 266MHz and some of our 333MHz RAM will fail in disable_car(). As a test, I have experimented with pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to change the GLIU to 333MHz. I probably didn't have the correct bits set, so even though I managed to set GLIU, it failed the last test (DLL) in sdram_enable() and would reset.
Your second problem might explain the first. You should look closely at the detection problem. It depends on the reset and the state of the rstpll flags. There could be a corner case or something unusual going on. How did you set the boot strap bits with hardware (straps)? You should use pll_reset(ManualConf) settings to change it with hardware.
Marc
Sorry, I should have explained that we set the boostrap bits in hardware:
Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz. Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, otherwise low. Bit 5: PW0 pad - part of CPU/GLIU frequency selects. Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects. Bit 3: GNT2# pad - part of CPU/GLIU frequency selects. Bit 2: GNT1# pad - part of CPU/GLIU frequency selects. Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.
We have pulled these pins up or down to be "0010110", which corresponds to CPU 500MHz, GLIU 333MHz in table 6-87. This should also mean that the on reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch (except that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on Power On Reset (POR). So I should be using pll_reset(ManualConf)? I'll try it later today and see if I can get some debugging output.
If it is set by straps, it should be doing the right thing and you don't need to use the ManualConf. There could still be a corner case and you should try trace through the soft reset that is causing the problem. Also, have you diff'd the MC settings between the BIOS and coreboot. I would be interested in discrepancies.
Marc
Marc Jones wrote:
On Tue, Nov 10, 2009 at 1:26 PM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams nathan@traverse.com.au wrote:
Another observation I made was that by setting the debug_level to BIOS_CRIT, instead of dying at the usual spot in disable_car() and stopping, coreboot would reset continuously (cycling every 1-2 seconds)
Since I needed to have a BIOS that didn't have much debugging enabled for a customer sample, I looked a bit deeper to find the cause of this continuous reset behaviour. Even changing the debug level from BIOS_SPEW to BIOS_DEBUG caused the reset. I tracked it down to a single printk and my attached patch means it works at BIOS_CRIT now, just with a few extra debug lines. Without the printk, the code gets to "missing phase4_read_resources" (just a few lines down from my patch) before restarting.
This sounds like it is probably blowing the stack or the stack hits memory that isn't working correctly.
Another issue that's partly related is the ability for coreboot to set the GeodeLink speed depending on the detected RAM speed. As a work-around, we are only using 333MHz SODIMMs and have set the bootstrap bits for GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU, 333MHz GLIU instead of bypass mode. In bypass mode, the GLIU is 266MHz and some of our 333MHz RAM will fail in disable_car(). As a test, I have experimented with pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to change the GLIU to 333MHz. I probably didn't have the correct bits set, so even though I managed to set GLIU, it failed the last test (DLL) in sdram_enable() and would reset.
Your second problem might explain the first. You should look closely at the detection problem. It depends on the reset and the state of the rstpll flags. There could be a corner case or something unusual going on. How did you set the boot strap bits with hardware (straps)? You should use pll_reset(ManualConf) settings to change it with hardware.
Marc
Sorry, I should have explained that we set the boostrap bits in hardware:
Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz. Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, otherwise low. Bit 5: PW0 pad - part of CPU/GLIU frequency selects. Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects. Bit 3: GNT2# pad - part of CPU/GLIU frequency selects. Bit 2: GNT1# pad - part of CPU/GLIU frequency selects. Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.
We have pulled these pins up or down to be "0010110", which corresponds to CPU 500MHz, GLIU 333MHz in table 6-87. This should also mean that the on reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch (except that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on Power On Reset (POR). So I should be using pll_reset(ManualConf)? I'll try it later today and see if I can get some debugging output.
If it is set by straps, it should be doing the right thing and you don't need to use the ManualConf. There could still be a corner case and you should try trace through the soft reset that is causing the problem. Also, have you diff'd the MC settings between the BIOS and coreboot. I would be interested in discrepancies.
Marc
I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
http://coreboot.pastebin.com/m39b22c21
The only differences I can see are related to interrupts, which shouldn't matter in relation to my RAM problems.
I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad.
Nathan
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Tue, Nov 10, 2009 at 1:26 PM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams nathan@traverse.com.au wrote:
Another observation I made was that by setting the debug_level to BIOS_CRIT, instead of dying at the usual spot in disable_car() and stopping, coreboot would reset continuously (cycling every 1-2 seconds)
Since I needed to have a BIOS that didn't have much debugging enabled for a customer sample, I looked a bit deeper to find the cause of this continuous reset behaviour. Even changing the debug level from BIOS_SPEW to BIOS_DEBUG caused the reset. I tracked it down to a single printk and my attached patch means it works at BIOS_CRIT now, just with a few extra debug lines. Without the printk, the code gets to "missing phase4_read_resources" (just a few lines down from my patch) before restarting.
This sounds like it is probably blowing the stack or the stack hits memory that isn't working correctly.
Another issue that's partly related is the ability for coreboot to set the GeodeLink speed depending on the detected RAM speed. As a work-around, we are only using 333MHz SODIMMs and have set the bootstrap bits for GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU, 333MHz GLIU instead of bypass mode. In bypass mode, the GLIU is 266MHz and some of our 333MHz RAM will fail in disable_car(). As a test, I have experimented with pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to change the GLIU to 333MHz. I probably didn't have the correct bits set, so even though I managed to set GLIU, it failed the last test (DLL) in sdram_enable() and would reset.
Your second problem might explain the first. You should look closely at the detection problem. It depends on the reset and the state of the rstpll flags. There could be a corner case or something unusual going on. How did you set the boot strap bits with hardware (straps)? You should use pll_reset(ManualConf) settings to change it with hardware.
Marc
Sorry, I should have explained that we set the boostrap bits in hardware:
Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz. Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, otherwise low. Bit 5: PW0 pad - part of CPU/GLIU frequency selects. Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects. Bit 3: GNT2# pad - part of CPU/GLIU frequency selects. Bit 2: GNT1# pad - part of CPU/GLIU frequency selects. Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.
We have pulled these pins up or down to be "0010110", which corresponds to CPU 500MHz, GLIU 333MHz in table 6-87. This should also mean that the on reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch (except that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on Power On Reset (POR). So I should be using pll_reset(ManualConf)? I'll try it later today and see if I can get some debugging output.
If it is set by straps, it should be doing the right thing and you don't need to use the ManualConf. There could still be a corner case and you should try trace through the soft reset that is causing the problem. Also, have you diff'd the MC settings between the BIOS and coreboot. I would be interested in discrepancies.
Marc
I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
http://coreboot.pastebin.com/m39b22c21
The only differences I can see are related to interrupts, which shouldn't matter in relation to my RAM problems.
I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad.
That looks like just the southbridge cs5536 target. The memory differences would be in the processor geodelx target. Can you send those results?
Marc
Marc Jones wrote:
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams nathan@traverse.com.au wrote:
I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
http://coreboot.pastebin.com/m39b22c21
The only differences I can see are related to interrupts, which shouldn't matter in relation to my RAM problems.
I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad.
That looks like just the southbridge cs5536 target. The memory differences would be in the processor geodelx target. Can you send those results?
Marc
I did some new MSR dumps.
Diff: ./msrtool -t geodelx -t cs5536 -d amd_ref_bios http://coreboot.pastebin.com/m5e487f87
AMD NAS reference BIOS: ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios http://coreboot.pastebin.com/madc04ac
My Coreboot: ./msrtool -t geodelx -t cs5536 -l -s nathan_bios http://coreboot.pastebin.com/m7f35d855
The diffs I did today show some differences with GLCP_DELAY_CONTROLS. Last time I added some code to force it to match the commercial BIOS GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
I also tested all the SODIMMS I have here (about 10) with the commercial BIOS. Each time I did a msrtool diff to one I saved on disk.
Most are 333MHz, but 2 are 400MHz. There weren't any changes to the MSRs.
Could there be an issue with the initialisation sequence that reading MSRs after booting won't show? Also, quite a few MSRs aren't defined in geodelx.c yet. Are there any obvious ones that should be added in?
Regards, Nathan
On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams nathan@traverse.com.au wrote:
I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
http://coreboot.pastebin.com/m39b22c21
The only differences I can see are related to interrupts, which shouldn't matter in relation to my RAM problems.
I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad.
That looks like just the southbridge cs5536 target. The memory differences would be in the processor geodelx target. Can you send those results?
Marc
I did some new MSR dumps.
Diff: ./msrtool -t geodelx -t cs5536 -d amd_ref_bios http://coreboot.pastebin.com/m5e487f87
AMD NAS reference BIOS: ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios http://coreboot.pastebin.com/madc04ac
My Coreboot: ./msrtool -t geodelx -t cs5536 -l -s nathan_bios http://coreboot.pastebin.com/m7f35d855
The diffs I did today show some differences with GLCP_DELAY_CONTROLS. Last time I added some code to force it to match the commercial BIOS GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
I also tested all the SODIMMS I have here (about 10) with the commercial BIOS. Each time I did a msrtool diff to one I saved on disk.
Most are 333MHz, but 2 are 400MHz. There weren't any changes to the MSRs.
Could there be an issue with the initialisation sequence that reading MSRs after booting won't show? Also, quite a few MSRs aren't defined in geodelx.c yet. Are there any obvious ones that should be added in?
--- AMD NAS reference BIOS +++ Nathan's coreboot v3 # # GLCP_DELAY_CONTROLS # -0x4c00000f 0x83f1_00aa_5696_0404 +0x4c00000f 0x8271_005a_ 5696_ 0404
It looks like coreboot and the ref bios detect different dimm configuration. This timing setup could be part of the instability (I don't think it explains the reset problem). Look at the code here: SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets set to see what might be happening. Make sure that MTest is disabled in the ref bios setup. This setting is based on the number of devices (load) there is on the dimm.
I didn't realize that so few registers were in the msr tool for geodelx. You should add these: 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA) 10071007_00000040h Page 227 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
4C00000Fh R/W GLCP I/O Delay Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL) Bootstrap specific Page 554
Marc
Marc Jones wrote:
On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams nathan@traverse.com.au wrote:
I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
http://coreboot.pastebin.com/m39b22c21
The only differences I can see are related to interrupts, which shouldn't matter in relation to my RAM problems.
I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad.
That looks like just the southbridge cs5536 target. The memory differences would be in the processor geodelx target. Can you send those results?
Marc
I did some new MSR dumps.
Diff: ./msrtool -t geodelx -t cs5536 -d amd_ref_bios http://coreboot.pastebin.com/m5e487f87
AMD NAS reference BIOS: ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios http://coreboot.pastebin.com/madc04ac
My Coreboot: ./msrtool -t geodelx -t cs5536 -l -s nathan_bios http://coreboot.pastebin.com/m7f35d855
The diffs I did today show some differences with GLCP_DELAY_CONTROLS. Last time I added some code to force it to match the commercial BIOS GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
I also tested all the SODIMMS I have here (about 10) with the commercial BIOS. Each time I did a msrtool diff to one I saved on disk.
Most are 333MHz, but 2 are 400MHz. There weren't any changes to the MSRs.
Could there be an issue with the initialisation sequence that reading MSRs after booting won't show? Also, quite a few MSRs aren't defined in geodelx.c yet. Are there any obvious ones that should be added in?
--- AMD NAS reference BIOS +++ Nathan's coreboot v3 # # GLCP_DELAY_CONTROLS # -0x4c00000f 0x83f1_00aa_5696_0404 +0x4c00000f 0x8271_005a_ 5696_ 0404
It looks like coreboot and the ref bios detect different dimm configuration. This timing setup could be part of the instability (I don't think it explains the reset problem). Look at the code here: SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets set to see what might be happening. Make sure that MTest is disabled in the ref bios setup. This setting is based on the number of devices (load) there is on the dimm.
I didn't realize that so few registers were in the msr tool for geodelx. You should add these: 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA) 10071007_00000040h Page 227 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
4C00000Fh R/W GLCP I/O Delay Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL) Bootstrap specific Page 554
Marc
I've now added the MSRs and uploaded to pastebin:
AMD NAS: http://coreboot.pastebin.com/m53aed60b
My coreboot: http://coreboot.pastebin.com/md23bc6a
./msrtool -d AMD_NAS: http://coreboot.pastebin.com/m77663de5
Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards just in case there are some hidden hardware issues.
Regards, Nathan
Nathan Williams wrote:
If you want to unclutter output a little, you can wipe the 5536 MSRs from the file after the first run. msrtool only considers the MSRs that are explicitly listed in the input file when run with -d.
(Another alternative is to list relevant MSRs in the file before the first run and run with -s rather than -l -s. The former reads and outputs values only for listed MSRs, the latter reads all known MSRs but has the benefit that no file needs to be created beforehand.)
//Peter
Peter Stuge wrote:
Nathan Williams wrote:
If you want to unclutter output a little, you can wipe the 5536 MSRs from the file after the first run. msrtool only considers the MSRs that are explicitly listed in the input file when run with -d.
(Another alternative is to list relevant MSRs in the file before the first run and run with -s rather than -l -s. The former reads and outputs values only for listed MSRs, the latter reads all known MSRs but has the benefit that no file needs to be created beforehand.)
//Peter
Thanks for the tips. Very helpful.
Nathan
Nathan Williams wrote:
Marc Jones wrote:
On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams nathan@traverse.com.au wrote:
I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
http://coreboot.pastebin.com/m39b22c21
The only differences I can see are related to interrupts, which shouldn't matter in relation to my RAM problems.
I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad.
That looks like just the southbridge cs5536 target. The memory differences would be in the processor geodelx target. Can you send those results?
Marc
I did some new MSR dumps.
Diff: ./msrtool -t geodelx -t cs5536 -d amd_ref_bios http://coreboot.pastebin.com/m5e487f87
AMD NAS reference BIOS: ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios http://coreboot.pastebin.com/madc04ac
My Coreboot: ./msrtool -t geodelx -t cs5536 -l -s nathan_bios http://coreboot.pastebin.com/m7f35d855
The diffs I did today show some differences with GLCP_DELAY_CONTROLS. Last time I added some code to force it to match the commercial BIOS GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
I also tested all the SODIMMS I have here (about 10) with the commercial BIOS. Each time I did a msrtool diff to one I saved on disk.
Most are 333MHz, but 2 are 400MHz. There weren't any changes to the MSRs.
Could there be an issue with the initialisation sequence that reading MSRs after booting won't show? Also, quite a few MSRs aren't defined in geodelx.c yet. Are there any obvious ones that should be added in?
--- AMD NAS reference BIOS +++ Nathan's coreboot v3 # # GLCP_DELAY_CONTROLS # -0x4c00000f 0x83f1_00aa_5696_0404 +0x4c00000f 0x8271_005a_ 5696_ 0404
It looks like coreboot and the ref bios detect different dimm configuration. This timing setup could be part of the instability (I don't think it explains the reset problem). Look at the code here: SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets set to see what might be happening. Make sure that MTest is disabled in the ref bios setup. This setting is based on the number of devices (load) there is on the dimm.
I didn't realize that so few registers were in the msr tool for geodelx. You should add these: 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA) 10071007_00000040h Page 227 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
4C00000Fh R/W GLCP I/O Delay Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL) Bootstrap specific Page 554
Marc
I've now added the MSRs and uploaded to pastebin:
AMD NAS: http://coreboot.pastebin.com/m53aed60b
My coreboot: http://coreboot.pastebin.com/md23bc6a
./msrtool -d AMD_NAS: http://coreboot.pastebin.com/m77663de5
Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards just in case there are some hidden hardware issues.
Regards, Nathan
On the NAS reference board I got the following diff between coreboot and the commercial BIOS:
http://coreboot.pastebin.com/m1353db1a
As you can see there are a lot of latency differences. Unfortunately it was only later that I realised that the differences are because the bootstraps are set to bypass, which means coreboot uses 266 as the speed, where as the commercial bios uses 333. So when I repeat the same on our boards, the only difference in the geodelx MSRs is:
# MC_CFCLK_DBUG -0x2000001d 0x0000000000000000 +0x2000001d 0x0000000000001000 # 12 TRISTATE_DIS TRI-STATE Disable -0: Tri-stating enabled +1: Tri-stating disabled
Nathan
On Fri, Nov 27, 2009 at 2:05 AM, Nathan Williams nathan@traverse.com.au wrote:
Nathan Williams wrote:
Marc Jones wrote:
On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams nathan@traverse.com.au wrote:
I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
http://coreboot.pastebin.com/m39b22c21
The only differences I can see are related to interrupts, which shouldn't matter in relation to my RAM problems.
I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad.
That looks like just the southbridge cs5536 target. The memory differences would be in the processor geodelx target. Can you send those results?
Marc
I did some new MSR dumps.
Diff: ./msrtool -t geodelx -t cs5536 -d amd_ref_bios http://coreboot.pastebin.com/m5e487f87
AMD NAS reference BIOS: ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios http://coreboot.pastebin.com/madc04ac
My Coreboot: ./msrtool -t geodelx -t cs5536 -l -s nathan_bios http://coreboot.pastebin.com/m7f35d855
The diffs I did today show some differences with GLCP_DELAY_CONTROLS. Last time I added some code to force it to match the commercial BIOS GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
I also tested all the SODIMMS I have here (about 10) with the commercial BIOS. Each time I did a msrtool diff to one I saved on disk.
Most are 333MHz, but 2 are 400MHz. There weren't any changes to the MSRs.
Could there be an issue with the initialisation sequence that reading MSRs after booting won't show? Also, quite a few MSRs aren't defined in geodelx.c yet. Are there any obvious ones that should be added in?
--- AMD NAS reference BIOS +++ Nathan's coreboot v3 # # GLCP_DELAY_CONTROLS # -0x4c00000f 0x83f1_00aa_5696_0404 +0x4c00000f 0x8271_005a_ 5696_ 0404
It looks like coreboot and the ref bios detect different dimm configuration. This timing setup could be part of the instability (I don't think it explains the reset problem). Look at the code here: SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets set to see what might be happening. Make sure that MTest is disabled in the ref bios setup. This setting is based on the number of devices (load) there is on the dimm.
I didn't realize that so few registers were in the msr tool for geodelx. You should add these: 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA) 10071007_00000040h Page 227 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
4C00000Fh R/W GLCP I/O Delay Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL) Bootstrap specific Page 554
Marc
I've now added the MSRs and uploaded to pastebin:
AMD NAS: http://coreboot.pastebin.com/m53aed60b
My coreboot: http://coreboot.pastebin.com/md23bc6a
./msrtool -d AMD_NAS: http://coreboot.pastebin.com/m77663de5
Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards just in case there are some hidden hardware issues.
Regards, Nathan
On the NAS reference board I got the following diff between coreboot and the commercial BIOS:
http://coreboot.pastebin.com/m1353db1a
As you can see there are a lot of latency differences. Unfortunately it was only later that I realised that the differences are because the bootstraps are set to bypass, which means coreboot uses 266 as the speed, where as the commercial bios uses 333. So when I repeat the same on our boards, the only difference in the geodelx MSRs is:
# MC_CFCLK_DBUG -0x2000001d 0x0000000000000000 +0x2000001d 0x0000000000001000 # 12 TRISTATE_DIS TRI-STATE Disable -0: Tri-stating enabled +1: Tri-stating disabled
Nathan,
I don't think the tri-state disable bit explains the problems you have seen. Since the memory has the same settings, the problem must be somewhere else. You will need to go back the the reboot path to investigate. It seems like something in the reset isn't doing a complete reset, which causes a problem with the cache disable.
Marc
Marc Jones wrote:
On Fri, Nov 27, 2009 at 2:05 AM, Nathan Williams nathan@traverse.com.au wrote:
Nathan Williams wrote:
Marc Jones wrote:
On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams nathan@traverse.com.au wrote:
Marc Jones wrote:
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams nathan@traverse.com.au wrote: > I managed to get the commercial BIOS to boot on my board and diffed it with coreboot: > > http://coreboot.pastebin.com/m39b22c21 > > The only differences I can see are related to interrupts, which shouldn't matter in relation to > my RAM problems. > > I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot. > The commercial BIOS didn't have any errors, but my coreboot did. So the hardware can't be too bad. That looks like just the southbridge cs5536 target. The memory differences would be in the processor geodelx target. Can you send those results?
Marc
I did some new MSR dumps.
Diff: ./msrtool -t geodelx -t cs5536 -d amd_ref_bios http://coreboot.pastebin.com/m5e487f87
AMD NAS reference BIOS: ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios http://coreboot.pastebin.com/madc04ac
My Coreboot: ./msrtool -t geodelx -t cs5536 -l -s nathan_bios http://coreboot.pastebin.com/m7f35d855
The diffs I did today show some differences with GLCP_DELAY_CONTROLS. Last time I added some code to force it to match the commercial BIOS GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
I also tested all the SODIMMS I have here (about 10) with the commercial BIOS. Each time I did a msrtool diff to one I saved on disk.
Most are 333MHz, but 2 are 400MHz. There weren't any changes to the MSRs.
Could there be an issue with the initialisation sequence that reading MSRs after booting won't show? Also, quite a few MSRs aren't defined in geodelx.c yet. Are there any obvious ones that should be added in?
--- AMD NAS reference BIOS +++ Nathan's coreboot v3 # # GLCP_DELAY_CONTROLS # -0x4c00000f 0x83f1_00aa_5696_0404 +0x4c00000f 0x8271_005a_ 5696_ 0404
It looks like coreboot and the ref bios detect different dimm configuration. This timing setup could be part of the instability (I don't think it explains the reset problem). Look at the code here: SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets set to see what might be happening. Make sure that MTest is disabled in the ref bios setup. This setting is based on the number of devices (load) there is on the dimm.
I didn't realize that so few registers were in the msr tool for geodelx. You should add these: 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA) 10071007_00000040h Page 227 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
4C00000Fh R/W GLCP I/O Delay Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL) Bootstrap specific Page 554
Marc
I've now added the MSRs and uploaded to pastebin:
AMD NAS: http://coreboot.pastebin.com/m53aed60b
My coreboot: http://coreboot.pastebin.com/md23bc6a
./msrtool -d AMD_NAS: http://coreboot.pastebin.com/m77663de5
Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards just in case there are some hidden hardware issues.
Regards, Nathan
On the NAS reference board I got the following diff between coreboot and the commercial BIOS:
http://coreboot.pastebin.com/m1353db1a
As you can see there are a lot of latency differences. Unfortunately it was only later that I realised that the differences are because the bootstraps are set to bypass, which means coreboot uses 266 as the speed, where as the commercial bios uses 333. So when I repeat the same on our boards, the only difference in the geodelx MSRs is:
# MC_CFCLK_DBUG -0x2000001d 0x0000000000000000 +0x2000001d 0x0000000000001000 # 12 TRISTATE_DIS TRI-STATE Disable -0: Tri-stating enabled +1: Tri-stating disabled
Nathan,
I don't think the tri-state disable bit explains the problems you have seen. Since the memory has the same settings, the problem must be somewhere else. You will need to go back the the reboot path to investigate. It seems like something in the reset isn't doing a complete reset, which causes a problem with the cache disable.
Marc
I am suspicious that the reset problem only occurs when I'm using a laptop hard drive off the 44pin IDE connector on our board. I have tried booting with a 3.5" drive and external 12V, but I can't replicate the problem. With the 3.5" drive, a reboot from fsck works fine. Hopefully the next PCB revision should perform better because we've moved the 5V plane further away from the DDR tracks.
I don't know if I mentioned another problem that has similar symptoms. Some RAM causes the same cache disable problem, even if there are no IDE devices connected. This happens from power-up, so it's not a reset issue.
Nathan