(sorry I can't post a proper reply message, I picked that up from the archives)
Nathan Williams <nathan at traverse.com.au> wrote:
I am suspicious that the reset problem only occurs when I'm using a laptop hard drive off the 44pin IDE connector on our board. I have tried booting with a 3.5" drive and external 12V, but I can't replicate the problem. With the 3.5" drive, a reboot from fsck works fine. Hopefully the next PCB revision should perform better because we've moved the 5V plane further away from the DDR tracks.
I don't know if I mentioned another problem that has similar symptoms. Some RAM causes the same cache disable problem, even if there are no IDE devices connected. This happens from power-up, so it's not a reset issue.
I'm facing a very similar issue here on an ALIX.2D board which is based on the same chipset. The problem happens to occur only sometimes, just like you described it, and resetting from Linux gives a higher change of provoking it. However, I also have once out of 20 power-up cycles as well.
What's really strange about that is the fact that sometimes, not even power cycling will fix it - coreboot will always ever stop at the same point (from what I've traced, exactly at the same instructions that you pointed out). Powering off and waiting for ~10 minutes likely brings the board back to life.
Connecting or disconnecting extern IDE44 drives does not appear to affect the probability, though. One more thing that bring some awareness is that the effect is harder to trigger when booting from an external LPC flash emulator (in contrast to coreboot flashed to the internal LPC).
I urgently need to resolve that and would appreciate any more hints about where to add more code for flushing caches and the like. I also suspect the reset vector to not properly flush the hardware, but I'm somewhat lost in this codebase I must admit, and Geode is also nothing I'm terribly familiar with.
Thanks, Daniel
Daniel Mack wrote:
the effect is harder to trigger when booting from an external LPC flash emulator (in contrast to coreboot flashed to the internal LPC).
Then you could experiment with a few different flash chips.
PC Engines makes a nice and neat Flash recovery board, which plugs onto the LPC header, and comes with a PLCC chip in a socket.
Order one or two of these, and order a few different flash chips which are compatible with the board, and start collecting data points.
Each LPC.1A comes with SST49LF040B. On the ALIX.2 and .3 there is an AMIC flash chip. (Different from ALIX.1 which also has SST.)
You can also use Winbond W39V040APZ/080APZ. Note [0-9]A, it must not be e.g. W39V040FAPZ (note [0-9]FA) which is a FWH chip.
Another compatible chip is PMC Pm49FL004T.
//Peter
On Wed, Dec 02, 2009 at 02:59:01PM +0100, Peter Stuge wrote:
Daniel Mack wrote:
the effect is harder to trigger when booting from an external LPC flash emulator (in contrast to coreboot flashed to the internal LPC).
Then you could experiment with a few different flash chips.
PC Engines makes a nice and neat Flash recovery board, which plugs onto the LPC header, and comes with a PLCC chip in a socket.
I doubt the flash chip itself is the problem. Might be I haven't been totally clear about what I observed.
When using the Linux tools to flash an image to the internal LPC, the system most likely won't come up immediately. I need that power-off delay of some minutes to reanimate the board. After that, the bug is very hard to trigger, even though it does happen, especially when powering the device off (by unplugging the supply) and on again right after that.
So my theory is that there is something left in any part of the system which makes coreboot fail in disable_car(). And the same (or maybe just a similar) effect is triggered when the LPC is written.
Does that ring a bell? As I said, I'm pretty lost in debugging this, but I'm sure we're not having a hardware issue.
Thanks, Daniel
No hint, anyone?
On Wed, Dec 02, 2009 at 03:44:15PM +0100, Daniel Mack wrote:
On Wed, Dec 02, 2009 at 02:59:01PM +0100, Peter Stuge wrote:
Daniel Mack wrote:
the effect is harder to trigger when booting from an external LPC flash emulator (in contrast to coreboot flashed to the internal LPC).
Then you could experiment with a few different flash chips.
PC Engines makes a nice and neat Flash recovery board, which plugs onto the LPC header, and comes with a PLCC chip in a socket.
I doubt the flash chip itself is the problem. Might be I haven't been totally clear about what I observed.
When using the Linux tools to flash an image to the internal LPC, the system most likely won't come up immediately. I need that power-off delay of some minutes to reanimate the board. After that, the bug is very hard to trigger, even though it does happen, especially when powering the device off (by unplugging the supply) and on again right after that.
So my theory is that there is something left in any part of the system which makes coreboot fail in disable_car(). And the same (or maybe just a similar) effect is triggered when the LPC is written.
Does that ring a bell? As I said, I'm pretty lost in debugging this, but I'm sure we're not having a hardware issue.
Thanks, Daniel
On Fri, Dec 4, 2009 at 9:47 AM, Daniel Mack daniel@caiaq.de wrote:
No hint, anyone?
Maybe you could zero all the RAM. If you have to power it down for a specific amount of time, that could be the time for the RAM to lose its state. If that works, you could start finding uninitialized variables or a bad pointer somewhere.
As long as you're grasping at straws :)
Thanks, Myles
On Fri, Dec 4, 2009 at 8:47 AM, Daniel Mack daniel@caiaq.de wrote:
No hint, anyone?
Just about every time I had this problem on my geodes it was a problem with dram. Just about every time. It's quite weird how well DRAM can work even if it has not been programmed correctly. The correspondance with disable_car() might just be that there's lots of burst cache traffic to ram when you do this operation and cache is suddenly connected to dram again.
Also, over the years, we have frequently found that DRAM vendors are, well, less-than-honest about their product. One experience was on OLPC. We had three boards, all with nominally the same parts, different vendors however. Boards A&B worked with faster timing; Boards A&C worked with medium timing; and boards B&C only worked with the slowest timing. (I believe in this case it was ras to cas delay)
Yes, indeed, it's not always true that slowing down dram makes it work :-)
Rather than "power off for 10 minutes" -- I assume this is "at the wall plug" -- I wonder if you'd see an improvement if you yanked the DC power at the board. Which were you doing -- AC or DC power off?
Thanks
ron
Hi Ron,
thanks for your answer.
On Fri, Dec 04, 2009 at 09:03:14AM -0800, ron minnich wrote:
On Fri, Dec 4, 2009 at 8:47 AM, Daniel Mack daniel@caiaq.de wrote:
No hint, anyone?
Just about every time I had this problem on my geodes it was a problem with dram. Just about every time. It's quite weird how well DRAM can work even if it has not been programmed correctly. The correspondance with disable_car() might just be that there's lots of burst cache traffic to ram when you do this operation and cache is suddenly connected to dram again.
Help me understanding how the DRAM can be programmed correctly. Is it about timing constraints?
Also, over the years, we have frequently found that DRAM vendors are, well, less-than-honest about their product. One experience was on OLPC. We had three boards, all with nominally the same parts, different vendors however. Boards A&B worked with faster timing; Boards A&C worked with medium timing; and boards B&C only worked with the slowest timing. (I believe in this case it was ras to cas delay)
That could well be an explanation for what I'm seeing, however, I wonder why all boards work totally stable once they booted. Wouldn't wrong DRAM settings result in unpredictable behaviour such as sporadic fails? I don't see anything like that.
Rather than "power off for 10 minutes" -- I assume this is "at the wall plug" -- I wonder if you'd see an improvement if you yanked the DC power at the board. Which were you doing -- AC or DC power off?
I was unplugging the DC jack from the board. There is some blocking capacitors on it, but I doubt they will cause any part of the system to survive much longer than a couple of seconds. But even something like 10s doesn't solve it. Only sometimes though, and I haven't found a reliable pattern yet. Damn, I really wish I could provide more specific input :-/
Thanks, Daniel
On Fri, Dec 4, 2009 at 9:12 AM, Daniel Mack daniel@caiaq.de wrote:
Help me understanding how the DRAM can be programmed correctly. Is it about timing constraints?
it's how you set the timing in the dram controller and how it matches the DRAM, but it's also about the order in which you program things and the timing of how you issue the commands. If you're doing v3 this should all "just work", it certainly used to for me. But I have not touched this code in 9 months or more.
That could well be an explanation for what I'm seeing, however, I wonder why all boards work totally stable once they booted. Wouldn't wrong DRAM settings result in unpredictable behaviour such as sporadic fails? I don't see anything like that.
I wish I knew.
I was unplugging the DC jack from the board. There is some blocking capacitors on it, but I doubt they will cause any part of the system to survive much longer than a couple of seconds. But even something like 10s doesn't solve it. Only sometimes though, and I haven't found a reliable pattern yet. Damn, I really wish I could provide more specific input :-/
This points more to what Myles was saying -- you might want to zero all of memory and see if that helps. Are you using crosstool to build? If not, you should.
ron
On Fri, Dec 4, 2009 at 10:12 AM, Daniel Mack daniel@caiaq.de wrote:
Hi Ron,
thanks for your answer.
On Fri, Dec 04, 2009 at 09:03:14AM -0800, ron minnich wrote:
On Fri, Dec 4, 2009 at 8:47 AM, Daniel Mack daniel@caiaq.de wrote:
No hint, anyone?
Just about every time I had this problem on my geodes it was a problem with dram. Just about every time. It's quite weird how well DRAM can work even if it has not been programmed correctly. The correspondance with disable_car() might just be that there's lots of burst cache traffic to ram when you do this operation and cache is suddenly connected to dram again.
Help me understanding how the DRAM can be programmed correctly. Is it about timing constraints?
Also, over the years, we have frequently found that DRAM vendors are, well, less-than-honest about their product. One experience was on OLPC. We had three boards, all with nominally the same parts, different vendors however. Boards A&B worked with faster timing; Boards A&C worked with medium timing; and boards B&C only worked with the slowest timing. (I believe in this case it was ras to cas delay)
That could well be an explanation for what I'm seeing, however, I wonder why all boards work totally stable once they booted. Wouldn't wrong DRAM settings result in unpredictable behaviour such as sporadic fails? I don't see anything like that.
Rather than "power off for 10 minutes" -- I assume this is "at the wall plug" -- I wonder if you'd see an improvement if you yanked the DC power at the board. Which were you doing -- AC or DC power off?
I was unplugging the DC jack from the board. There is some blocking capacitors on it, but I doubt they will cause any part of the system to survive much longer than a couple of seconds. But even something like 10s doesn't solve it. Only sometimes though, and I haven't found a reliable pattern yet. Damn, I really wish I could provide more specific input :-/
I'm a little confused. Is the failure always at disable_car when you do the flash programming? What does "the system most likely won't come up immediately" mean? This description sounds more like the 5536 being in a bad state, which may or may not have to do with RAM. I have heard of problems with the 5536 getting locked up if power sequencing is not exactly right. Does it work if you unplug, remove the cmos battery, press the power button to remove any capacitance, then plug it back in make it work?
If it always breaks at disable_car(), it could be a memory or cache state problem that wouldn't be seen with the legacy BIOS because it doesn't do CAR. It could still be hardware/power sequence related since we don't see this on every platform. As far as I know, the AMD reference designs and the Artec mainboards don't exhibit this problem.
Marc