It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
Samuel said he thinks it has something to do with one processor overwriting another's space.
Help?
Thanks, Myles
On Thu, Apr 23, 2009 at 03:10:14PM -0600, Myles Watson wrote:
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
Is it this one?
[ 0.000000] ACPI: Interpreter disabled. [ 0.000000] Linux Plug and Play Support v0.97 (c) Adam Belay [ 0.000000] pnp: PnP ACPI: disabled [ 0.000000] PCI: Probing PCI hardware [ 0.000000] PCI: Transparent bridge - 0000:00:06.0 [ 0.000000] PCI: Using IRQ router default [10de/0370] at 0000:00:06.0 [ 0.000000] [ 0.000000] HARDWARE ERROR [ 0.000000] CPU 0: Machine Check Exception: 4 Bank 4: fe28a001fd080813 [ 0.000000] TSC 2eefd49369 ADDR f0050 MISC c0090e7e00000000 [ 0.000000] This is not a software problem! [ 0.000000] Run through mcelog --ascii to decode and contact your hardware vendor [ 0.000000] Kernel panic - not syncing: Machine check
I'm seeing this too on Supermicro h8dme with head (r4198).
I also tried 4022 (the initial commit for h8dme), and that one is fine.
I think we have a broken tree.
Thanks, Ward.
On Thu, 23 Apr 2009 17:28:28 -0400, Ward Vandewege ward@gnu.org wrote:
On Thu, Apr 23, 2009 at 03:10:14PM -0600, Myles Watson wrote:
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
Is it this one?
[ 0.000000] ACPI: Interpreter disabled. [ 0.000000] Linux Plug and Play Support v0.97 (c) Adam Belay [ 0.000000] pnp: PnP ACPI: disabled [ 0.000000] PCI: Probing PCI hardware [ 0.000000] PCI: Transparent bridge - 0000:00:06.0 [ 0.000000] PCI: Using IRQ router default [10de/0370] at 0000:00:06.0 [ 0.000000] [ 0.000000] HARDWARE ERROR [ 0.000000] CPU 0: Machine Check Exception: 4 Bank 4: fe28a001fd080813 [ 0.000000] TSC 2eefd49369 ADDR f0050 MISC c0090e7e00000000 [ 0.000000] This is not a software problem! [ 0.000000] Run through mcelog --ascii to decode and contact your hardware vendor [ 0.000000] Kernel panic - not syncing: Machine check
I'm seeing this too on Supermicro h8dme with head (r4198).
I also tried 4022 (the initial commit for h8dme), and that one is fine.
I think we have a broken tree.
Ahh, this is what I am getting on the Thomson IP1000 "This is not a software problem!" and "Kernel panic - not syncing: Machine check" as of a build last night. But the rest is different, I don't have it in front of me but it is complaining about things like page_faults and USB....
On Thu, Apr 23, 2009 at 05:35:57PM -0400, Joseph Smith wrote:
On Thu, 23 Apr 2009 17:28:28 -0400, Ward Vandewege ward@gnu.org wrote:
On Thu, Apr 23, 2009 at 03:10:14PM -0600, Myles Watson wrote:
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
Is it this one?
[ 0.000000] ACPI: Interpreter disabled. [ 0.000000] Linux Plug and Play Support v0.97 (c) Adam Belay [ 0.000000] pnp: PnP ACPI: disabled [ 0.000000] PCI: Probing PCI hardware [ 0.000000] PCI: Transparent bridge - 0000:00:06.0 [ 0.000000] PCI: Using IRQ router default [10de/0370] at 0000:00:06.0 [ 0.000000] [ 0.000000] HARDWARE ERROR [ 0.000000] CPU 0: Machine Check Exception: 4 Bank 4: fe28a001fd080813 [ 0.000000] TSC 2eefd49369 ADDR f0050 MISC c0090e7e00000000 [ 0.000000] This is not a software problem! [ 0.000000] Run through mcelog --ascii to decode and contact your hardware vendor [ 0.000000] Kernel panic - not syncing: Machine check
I'm seeing this too on Supermicro h8dme with head (r4198).
I also tried 4022 (the initial commit for h8dme), and that one is fine.
I think we have a broken tree.
Ahh, this is what I am getting on the Thomson IP1000 "This is not a software problem!" and "Kernel panic - not syncing: Machine check" as of a build last night. But the rest is different, I don't have it in front of me but it is complaining about things like page_faults and USB....
Full boot log at
http://ward.vandewege.net/coreboot/h8dme/rev4198-mce.log
if you want to compare.
Thanks, Ward.
On Thu, 23 Apr 2009 17:38:36 -0400, Ward Vandewege ward@gnu.org wrote:
On Thu, Apr 23, 2009 at 05:35:57PM -0400, Joseph Smith wrote:
On Thu, 23 Apr 2009 17:28:28 -0400, Ward Vandewege ward@gnu.org wrote:
On Thu, Apr 23, 2009 at 03:10:14PM -0600, Myles Watson wrote:
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
Is it this one?
[ 0.000000] ACPI: Interpreter disabled. [ 0.000000] Linux Plug and Play Support v0.97 (c) Adam Belay [ 0.000000] pnp: PnP ACPI: disabled [ 0.000000] PCI: Probing PCI hardware [ 0.000000] PCI: Transparent bridge - 0000:00:06.0 [ 0.000000] PCI: Using IRQ router default [10de/0370] at
0000:00:06.0
[ 0.000000] [ 0.000000] HARDWARE ERROR [ 0.000000] CPU 0: Machine Check Exception: 4 Bank
4:
fe28a001fd080813 [ 0.000000] TSC 2eefd49369 ADDR f0050 MISC c0090e7e00000000 [ 0.000000] This is not a software problem! [ 0.000000] Run through mcelog --ascii to decode and contact your hardware vendor [ 0.000000] Kernel panic - not syncing: Machine check
I'm seeing this too on Supermicro h8dme with head (r4198).
I also tried 4022 (the initial commit for h8dme), and that one is
fine.
I think we have a broken tree.
Ahh, this is what I am getting on the Thomson IP1000 "This is not a software problem!" and "Kernel panic - not syncing: Machine check" as of
a
build last night. But the rest is different, I don't have it in front of
me
but it is complaining about things like page_faults and USB....
Full boot log at
http://ward.vandewege.net/coreboot/h8dme/rev4198-mce.log
if you want to compare.
Anyone else getting strange Kernel panics with recent builds?
Can someone decode that machine check exception?
Rudolf
Rudolf Marek r.marek@assembler.cz writes:
Can someone decode that machine check exception?
Looks like an MCE error. Which makes sense if this revision introduced changes regarding what memory regions are initialized at bootup.
Arne Georg Gleditsch arne.gleditsch@numascale.com writes:
Looks like an MCE error. Which makes sense if this revision introduced changes regarding what memory regions are initialized at bootup.
Uhm. That was intended to read "Looks like an ECC error." Not enough coffee...
Here is output from MCE log, for that one in ML
marekr2@queeg:~$ echo CPU 0: Machine Check Exception: 4 Bank 4: fe28a001fd080813 TSC 2eefd49369 ADDR f0050 MISC c0090e7e00000000 | /usr/sbin/mcelog --ascii --k8 mcelog: Cannot open /dev/mem for DMI decoding: Permission denied HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = fd51 bit32 = err cpu0 bit45 = uncorrected ecc error bit57 = processor context corrupt bit59 = misc error valid bit61 = error uncorrected bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' STATUS fe28a001fd080813 MCGSTATUS 4
And from Ward: CPU 0 4 northbridge Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = fd51 bit32 = err cpu0 bit45 = uncorrected ecc error bit57 = processor context corrupt bit59 = misc error valid bit61 = error uncorrected bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' STATUS fe28a001fd080813 MCGSTATUS 4
Seems something went wrong with ECC. Maybe because the memory is not cleared anymore?
Can someone test if http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
this change is reverted problem goes away?
Rudolf
On Fri, Apr 24, 2009 at 11:26:55AM +0200, Rudolf Marek wrote:
Can someone test if http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
this change is reverted problem goes away?
Confirmed, reverting that hunk fixes the problem for me.
Thanks, Ward.
On Fri, 24 Apr 2009 11:31:58 -0400, Ward Vandewege ward@gnu.org wrote:
On Fri, Apr 24, 2009 at 11:26:55AM +0200, Rudolf Marek wrote:
Can someone test if
http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
this change is reverted problem goes away?
Confirmed, reverting that hunk fixes the problem for me.
Ward, why do your emails keep coming up as spam?
On Fri, Apr 24, 2009 at 11:38 AM, Joseph Smith joe@settoplinux.org wrote:
On Fri, 24 Apr 2009 11:31:58 -0400, Ward Vandewege ward@gnu.org wrote:
On Fri, Apr 24, 2009 at 11:26:55AM +0200, Rudolf Marek wrote:
Can someone test if
http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
this change is reverted problem goes away?
Confirmed, reverting that hunk fixes the problem for me.
Ward, why do your emails keep coming up as spam?
-- Thanks, Joseph Smith Set-Top-Linux www.settoplinux.org
-- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Hello! A better question is one of, Joe what are you using to read your e-mail from the list with?
----- Gregg C Levine gregg.drwho8@gmail.com "This signature was once found posting rude messages in English in the Moscow subway."
On Fri, Apr 24, 2009 at 12:03 PM, Gregg Levine gregg.drwho8@gmail.comwrote:
On Fri, Apr 24, 2009 at 11:38 AM, Joseph Smith joe@settoplinux.org wrote:
On Fri, 24 Apr 2009 11:31:58 -0400, Ward Vandewege ward@gnu.org wrote:
On Fri, Apr 24, 2009 at 11:26:55AM +0200, Rudolf Marek wrote:
Can someone test if
http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
this change is reverted problem goes away?
Confirmed, reverting that hunk fixes the problem for me.
Ward, why do your emails keep coming up as spam?
-- Thanks, Joseph Smith Set-Top-Linux www.settoplinux.org
-- coreboot mailing list: coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Hello! A better question is one of, Joe what are you using to read your e-mail from the list with?
Or why don't you have an exception in your spam filter for [coreboot] mails? Ward's mails never come up as spam in my gmail.
-Corey
Ward Vandewege wrote:
Confirmed, reverting that hunk fixes the problem for me.
Oops.
So does ECC always have to be scrubbed going out of self refresh? That seems bad.
//Peter
On Fri, Apr 24, 2009 at 10:58 AM, Peter Stuge peter@stuge.se wrote:
Ward Vandewege wrote:
Confirmed, reverting that hunk fixes the problem for me.
Oops.
So does ECC always have to be scrubbed going out of self refresh? That seems bad.
well, were these failures occuring from a resume or cold boot?
If we're doing a cold boot and not clearing memory (i.e. not scrubbing ECC tags), that may be trouble. I get the impression from this patch that that is what we are doing.
The safest way to scrub ECC, if you don't know if you are resuming or cold booting, is to read the memory and write it back. Even if the memory is garbage, the ECC tags will end up with a valid value. This is slower however.
We've always done the clear_memory because we did not have resume. I think having resume is going to affect every cpu type.
But, do we know if it is a resume? Maybe we should have this kind of thing: if (is_resume()) /* or whatever it's called */ clear_memory( _RAMBASE, (CONFIG_LB_MEM_TOPK << 10) - _RAMBASE - DCACHE_RAM_SIZE); else clear_memory(0, ((CONFIG_LB_MEM_TOPK<<10) - DCACHE_RAM_SIZE));
hmm, did I get that backwards? Anyway, hope this makes sense.
ron
On Fri, Apr 24, 2009 at 11:07:04AM -0700, ron minnich wrote:
On Fri, Apr 24, 2009 at 10:58 AM, Peter Stuge peter@stuge.se wrote:
Ward Vandewege wrote:
Confirmed, reverting that hunk fixes the problem for me.
Oops.
So does ECC always have to be scrubbed going out of self refresh? That seems bad.
well, were these failures occuring from a resume or cold boot?
Cold boot. H8dme does not have acpi (yet).
Thanks, Ward.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Cold boot. H8dme does not have acpi (yet).
Hmm so memory scrubbing does not work and this patch just unmasked that. Unfortunately dont know how to fix that.
Ron?
Rudolf
On Fri, Apr 24, 2009 at 11:28 AM, Rudolf Marek r.marek@assembler.cz wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Cold boot. H8dme does not have acpi (yet).
Hmm so memory scrubbing does not work and this patch just unmasked that. Unfortunately dont know how to fix that.
Am I misreading the patch? From my reading, you turned off memory scrubbing completely for low RAM?
"The memory cleared now is just the coreboot memory not the low memory."
If you revert that patch it fixes the problem, right?
My interpretation: you no longer zero low memory because you don't want to zero low memory on a resume. This is good for resume, and very bad for cold boot.
If I got this wrong, let me know.
ron
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
If you revert that patch it fixes the problem, right?
Yes
My interpretation: you no longer zero low memory because you don't want to zero low memory on a resume. This is good for resume, and very bad for cold boot.
Yes seems you are right. But I believe the ECC setup should care about that somewhere?
But why it works for the rest of memory? The rest of memory is never cleared. Only to TOPK. Thats why I assumed its OK to change that and clear memory from _RAMBASE to TOPK and not from 0 to TOPK
Rudolf
On Fri, Apr 24, 2009 at 11:56 AM, Rudolf Marek r.marek@assembler.cz wrote:
But why it works for the rest of memory? The rest of memory is never cleared. Only to TOPK. Thats why I assumed its OK to change that and clear memory from _RAMBASE to TOPK and not from 0 to TOPK
Dont' forget that the systems we boot also do things to memory that can correctly set the ECC tags.
I would have to see the sizes etc. and I can't look right now. But don't forget that linux, as it allocates pages, zeros them. I don't know all the places that memory is initialized in the kernel at this point. but it could be we've been lucky. I don't know.
Anyway, you need to make that code a little smarter and, on cold boot, zero the low memory too.
ron
On Fri, Apr 24, 2009 at 1:01 PM, ron minnich rminnich@gmail.com wrote:
On Fri, Apr 24, 2009 at 11:56 AM, Rudolf Marek r.marek@assembler.cz wrote:
But why it works for the rest of memory? The rest of memory is never cleared. Only to TOPK. Thats why I assumed its OK to change that and clear memory from _RAMBASE to TOPK and not from 0 to TOPK
Dont' forget that the systems we boot also do things to memory that can correctly set the ECC tags.
I would have to see the sizes etc. and I can't look right now. But don't forget that linux, as it allocates pages, zeros them. I don't know all the places that memory is initialized in the kernel at this point. but it could be we've been lucky. I don't know.
Anyway, you need to make that code a little smarter and, on cold boot, zero the low memory too.
Are we going to revert this bit for now, or do we have a way to fix it in the works?
Thanks, Myles
Are we going to revert this bit for now, or do we have a way to fix it in the works?
It depends. I can put around #ifdef resume. But I dont think its a proper fix. I dont know much about ECC so I dont know what is the correct way to fix that. If all ECC tags must be cleared by memory write, the code is also wrong because it zero out memory only to TOPK and not to MEMTOP or MEMTOP2
Rudolf
On Mon, Apr 27, 2009 at 8:25 AM, Rudolf Marek r.marek@assembler.cz wrote:
Are we going to revert this bit for now, or do we have a way to fix it in the works?
It depends. I can put around #ifdef resume. But I dont think its a proper fix. I dont know much about ECC so I dont know what is the correct way to fix that. If all ECC tags must be cleared by memory write, the code is also wrong because it zero out memory only to TOPK and not to MEMTOP or MEMTOP2
It's not that the ECC tags need to be cleared. The issue is that they must, by design, have a value that is correct for the data stored in memory.
to recap: we don't want to zero memory on resume, but we need ECC tags "synced up" on power-on or reset. Setting a valid value in the tags does not require zero'ing memory. It requires a write to memory. That is all. When the CPU writes to memory then the tags get a known value, instead of the semi-random value they have on power-on.
As I said, I believe it is sufficient to make the ECC tags clean to do this:
volatile unsigned long *lp; /* or whatever declaration makes gcc do the right thing */ for(i = 0, cp = start_of_memory; i < (memory_size_in_bytes/ sizeof(*lp); i++, lp++) *lp = *lp;
This will preserve memory contents. It will do the right thing on both resume and power on reset. This will run more slowly than the memset but it will in fact give you clean memory tags. It is useful for recovering system state after a reset. You should not get ECC interrupts on the read because interrupts are disabled.
I also think the code I just showed might not be necessary. It will slow down resume if the DRAM has valid data. Maybe what really needs to be done is more sophisticated code which determines if the dram is valid and need not be initialized.
I think the code needs to be backed out and a new version tested out of the tree. Right now the code in the tree is wrong, as we can see from the fact that it is breaking.
Thanks,
ron
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
The only thing that is "active" without HAVE_ACPI_RESUME == 1 is this one:
http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
Is this what breaks your boot?
On Thu, Apr 23, 2009 at 8:14 PM, Stefan Reinauer stepan@coresystems.de wrote:
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
The only thing that is "active" without HAVE_ACPI_RESUME == 1 is this one:
http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
Is this what breaks your boot?
It doesn't break for me. I don't have the hp Samuel said he'll be back to it on Monday. I was hoping someone else had seen this or had an idea. Maybe Joseph can try it without that change and see if it fixes his problem.
Myles
On Thu, Apr 23, 2009 at 10:24 PM, Myles Watson mylesgw@gmail.com wrote:
On Thu, Apr 23, 2009 at 8:14 PM, Stefan Reinauer stepan@coresystems.de wrote:
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
The only thing that is "active" without HAVE_ACPI_RESUME == 1 is this one:
http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
Is this what breaks your boot?
It doesn't break for me. I don't have the hp Samuel said he'll be back to it on Monday. I was hoping someone else had seen this or had an idea. Maybe Joseph can try it without that change and see if it fixes his problem.
Except that Joseph doesn't have an amd board. I wish I knew better what was going wrong.
Myles
On Thu, 23 Apr 2009 22:26:22 -0600, Myles Watson mylesgw@gmail.com wrote:
On Thu, Apr 23, 2009 at 10:24 PM, Myles Watson mylesgw@gmail.com wrote:
On Thu, Apr 23, 2009 at 8:14 PM, Stefan Reinauer stepan@coresystems.de
wrote:
It turns out that Rev 4099 breaks the hp dl145_g3. It boots into Linux which panics and complains "This is not a software error". Sorry I don't have the exact error message any more.
The only thing that is "active" without HAVE_ACPI_RESUME == 1 is this
one:
http://tracker.coreboot.org/trac/coreboot/changeset/4099/trunk/coreboot-v2/s...
Is this what breaks your boot?
It doesn't break for me. I don't have the hp Samuel said he'll be back to it on Monday. I was hoping someone else had seen this or had an idea. Maybe Joseph can try it without that change and see if it fixes his problem.
Except that Joseph doesn't have an amd board. I wish I knew better what was going wrong.
Yeh I think I may be having a different issue.....
Hi all,
There is one change. The post car code does not clean 0 - TOPK but cleans _RAMBASE - TOPK only. Maybe there is something else wrong.
Also the MCE may be just garbage. Check is registers are clear in coreboot.
Rudolf