Hello everyone,
I'm porting coreboot v4 to a k8-rs780-sb710 based mainboard, and use amd/mahogany and amd/tilapia_fam10 codes as the reference. Now coreboot boots the board and filo loads linux,but the board crashes at a MCE error during booting process. I'm not very know the detail about the MCE, so any suggestions will be appreciated, thanks very much.
The mainboard architecture: CPU: socket F Opteron 2210 EE get_cpu_rev EAX=0x40f13 (1 cpu, dual core) DIMM: DDR2 333M (x1 / x2) HT Link0: off HT Link1: RS780->SB710 HT Link2: off VGA off GFX off PCIE off
coreboot code revision: modified on r5692
The MCE/panic message:
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 0: f658a00000000833 TSC 572507f34 ADDR 6000 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check ------------[ cut here ]------------ WARNING: at kernel/smp.c:331 smp_call_function_mask+0x32/0x1ec() Modules linked in: Supported: Yes Pid: 1, comm: swapper Tainted: G M 2.6.27.19-5-default #1
Call Trace: [<ffffffff8020d9f9>] show_trace_log_lvl+0x41/0x58 [<ffffffff80496a74>] dump_stack+0x69/0x6f [<ffffffff8023bfba>] warn_on_slowpath+0x51/0x77 [<ffffffff8025b1c5>] smp_call_function_mask+0x32/0x1ec [<ffffffff8025b3a8>] smp_call_function+0x29/0x2e [<ffffffff8021a04a>] native_smp_send_stop+0x1a/0x26 [<ffffffff80496b36>] panic+0xbc/0x169 [<ffffffff80216366>] mce_log+0x0/0x7e [<ffffffff80216740>] do_machine_check+0x31e/0x3cd [<ffffffff8020d27f>] machine_check+0x7f/0x90 [<ffffffff802126c8>] setup_trampoline+0x20/0x30 [<ffffffff804919a5>] native_cpu_up+0x31e/0xc64 [<ffffffff80493d17>] _cpu_up+0x9a/0x11c [<ffffffff80493df4>] cpu_up+0x5b/0x6f [<ffffffff8095b708>] kernel_init+0xe1/0x1eb [<ffffffff8020cf49>] child_rip+0xa/0x11
---[ end trace 4eaa2a86a8e2da22 ]---
mcelog --k8 --ascii
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 0: f658a00000000833 TSC 572507f34 ADDR 6000 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR CPU 0 0 data cache TSC 572507f34 Data cache ECC error (syndrome b1) bit45 = uncorrected ecc error bit57 = processor context corrupt bit61 = error uncorrected bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out data read mem transaction memory access, level generic' STATUS f658a00000000833 MCGSTATUS 4 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor
Attached is the detailed boot message.
Hi Liu,
On Mon, Sep 13, 2010 at 8:49 PM, Liu Tao liutao1980@gmail.com wrote:
Hello everyone,
I'm porting coreboot v4 to a k8-rs780-sb710 based mainboard, and use amd/mahogany and amd/tilapia_fam10 codes as the reference. Now coreboot boots the board and filo loads linux,but the board crashes at a MCE error during booting process. I'm not very know the detail about the MCE, so any suggestions will be appreciated, thanks very much.
The mainboard architecture: CPU: socket F Opteron 2210 EE get_cpu_rev EAX=0x40f13 (1 cpu, dual core) DIMM: DDR2 333M (x1 / x2) HT Link0: off HT Link1: RS780->SB710 HT Link2: off VGA off GFX off PCIE off
coreboot code revision: modified on r5692
The MCE/panic message:
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 0: f658a00000000833 TSC 572507f34 ADDR 6000 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check ------------[ cut here ]------------ WARNING: at kernel/smp.c:331 smp_call_function_mask+0x32/0x1ec() Modules linked in: Supported: Yes Pid: 1, comm: swapper Tainted: G M 2.6.27.19-5-default #1
Call Trace: [<ffffffff8020d9f9>] show_trace_log_lvl+0x41/0x58 [<ffffffff80496a74>] dump_stack+0x69/0x6f [<ffffffff8023bfba>] warn_on_slowpath+0x51/0x77 [<ffffffff8025b1c5>] smp_call_function_mask+0x32/0x1ec [<ffffffff8025b3a8>] smp_call_function+0x29/0x2e [<ffffffff8021a04a>] native_smp_send_stop+0x1a/0x26 [<ffffffff80496b36>] panic+0xbc/0x169 [<ffffffff80216366>] mce_log+0x0/0x7e [<ffffffff80216740>] do_machine_check+0x31e/0x3cd [<ffffffff8020d27f>] machine_check+0x7f/0x90 [<ffffffff802126c8>] setup_trampoline+0x20/0x30 [<ffffffff804919a5>] native_cpu_up+0x31e/0xc64 [<ffffffff80493d17>] _cpu_up+0x9a/0x11c [<ffffffff80493df4>] cpu_up+0x5b/0x6f [<ffffffff8095b708>] kernel_init+0xe1/0x1eb [<ffffffff8020cf49>] child_rip+0xa/0x11
---[ end trace 4eaa2a86a8e2da22 ]---
mcelog --k8 --ascii
HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 0: f658a00000000833 TSC 572507f34 ADDR 6000 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor HARDWARE ERROR CPU 0 0 data cache TSC 572507f34 Data cache ECC error (syndrome b1) bit45 = uncorrected ecc error bit57 = processor context corrupt bit61 = error uncorrected bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out data read mem transaction memory access, level generic' STATUS f658a00000000833 MCGSTATUS 4 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor
Attached is the detailed boot message.
I haven't worked with K8 is a while, but it seems like this could be a real CPU problem. Do you have another CPU to test with? The other possibility is that there is a missing errata or workaround for your CPU. You could review the AMD K8 revision guide for cache and MCA/MCE issues. Please let us know what you find.
Marc
On 9/16/10, Marc Jones marcj303@gmail.com wrote:
I haven't worked with K8 is a while, but it seems like this could be a real CPU problem. Do you have another CPU to test with? The other possibility is that there is a missing errata or workaround for your CPU. You could review the AMD K8 revision guide for cache and MCA/MCE issues. Please let us know what you find.
Hi Marc,
I just update coreboot from r5692 to r5814 and the problem resolved, before update the code I have met three types of MCE error: Data cache ECC error L2 cache ECC error Northbridge chipkill ECC error
I have not take a detailed look at the code differences yet, but I think maybe it's related to the recent discussion of amd cache broken?
-- Regards, Liu Tao
On Thu, Sep 16, 2010 at 12:53 AM, Liu Tao liutao1980@gmail.com wrote:
On 9/16/10, Marc Jones marcj303@gmail.com wrote:
I haven't worked with K8 is a while, but it seems like this could be a real CPU problem. Do you have another CPU to test with? The other possibility is that there is a missing errata or workaround for your CPU. You could review the AMD K8 revision guide for cache and MCA/MCE issues. Please let us know what you find.
Hi Marc,
I just update coreboot from r5692 to r5814 and the problem resolved, before update the code I have met three types of MCE error: Data cache ECC error L2 cache ECC error Northbridge chipkill ECC error
I have not take a detailed look at the code differences yet, but I think maybe it's related to the recent discussion of amd cache broken?
Glad that it is workin now. It is probably not the CAR stuff, that was for fam10.
Marc