[coreboot] Hardware diagnostic

Sun Oct 28 00:25:11 CEST 2018

yanvasilij yan wrote:
> I have two Intel Atom E3800 based boards. The first one is older version,
> which woks properly,
..
> The second one is modernized version, we added 89HPES5T5 PCIe I/O
> expansion switch. And WGI210IT based Ethernet ports are connected
> to switch.

So the good news is that your log clearly shows the PCIe switch
working correctly and both NICs behind reachable by software.

> Further we rebuild a bit a power up circuit of the E3800 SOC.

This is quite possibly the root cause, but I wouldn't exclude any
other possibilities.

> Launching stops when starts payload loading.  The launch log of this board
> see in attached “not_working_board_log.txt”.

The log is very clear about why: (down near the end)

--8<--
SELF segment doesn't target RAM: 0x00800000, 4259840 bytes
-->8--

Looking at the coreboot table a little further up, we see:
--8<--
Writing coreboot table at 0x3add3000
 0. 0000000000000000-0000000000000fff: CONFIGURATION TABLES
 1. 0000000000e00000-0000000000e39fff: RAMSTAGE
 2. 000000003ad9e000-000000003adfffff: CONFIGURATION TABLES
 3. 00000000feb00000-00000000fec00fff: RESERVED
 4. 00000000fed01000-00000000fed01fff: RESERVED
 5. 00000000fed03000-00000000fed03fff: RESERVED
 6. 00000000fed05000-00000000fed05fff: RESERVED
 7. 00000000fed08000-00000000fed08fff: RESERVED
 8. 00000000fed0c000-00000000fed0ffff: RESERVED
 9. 00000000fed1c000-00000000fed1cfff: RESERVED
10. 00000000fef00000-00000000feffffff: RESERVED
-->8--

Compare that with your working board:
--8<--
Writing coreboot table at 0x3add3000
 0. 0000000000000000-0000000000000fff: CONFIGURATION TABLES
 1. 0000000000001000-000000000009ffff: RAM
 2. 00000000000a0000-00000000000fffff: RESERVED
 3. 0000000000100000-0000000000dfffff: RAM
 4. 0000000000e00000-0000000000e39fff: RAMSTAGE
 5. 0000000000e3a000-000000003ad9dfff: RAM
 6. 000000003ad9e000-000000003adfffff: CONFIGURATION TABLES
 7. 000000003ae00000-000000003fffffff: RESERVED
 8. 00000000e0000000-00000000efffffff: RESERVED
 9. 00000000feb00000-00000000fec00fff: RESERVED
10. 00000000fed01000-00000000fed01fff: RESERVED
11. 00000000fed03000-00000000fed03fff: RESERVED
12. 00000000fed05000-00000000fed05fff: RESERVED
13. 00000000fed08000-00000000fed08fff: RESERVED
14. 00000000fed0c000-00000000fed0ffff: RESERVED
15. 00000000fed1c000-00000000fed1cfff: RESERVED
16. 00000000fee00000-00000000fee00fff: RESERVED
17. 00000000fef00000-00000000feffffff: RESERVED
-->8--

The new board ends up with no RAM regions in the coreboot table.

That results in the payload loader not finding RAM where the payload is
to be loaded, so boot stops.

Why are there no RAM regions? I don't know.

Looking near the beginning of the log about FSP memory init:
--8<--
Memory Down Data Existed : Enabled                                              
- Speed (0: 800, 1: 1066, 2: 1333, 3: 1600): 1                                  
- Type  (0: DDR3, 1: DDR3L) : 1                                                 
- DIMM0        : Enabled                                                        
- DIMM1        : Disabled                                                       
- Width        : x16                                                            
- Density      : 2Gbit                                                          
- BudWidth     : 64bit                                                          
- Rank #       : 1                                                              
- tCL          : 0B                                                             
- tRPtRCD      : 0B                                                             
- tWR          : 0C                                                             
- tWTR         : 06                                                             
- tRRD         : 06                                                             
- tRTP         : 06                                                             
- tFAW         : 14                                                             
Using 1066 MHz DDR3 settings.                                                   
1 GB Minnowboard Max detected.                                                  
romstage_main_continue status: 0  hob_list_ptr: 3ae20000                        
FSP Status: 0x0                                                                 
PM1_STS = 0x1 PM1_CNT = 0x0 GEN_PMCON1 = 0x1001808                              
romstage_main_continue: prev_sleep_state = S0                                   
Baytrail Chip Variant: Bay Trail-I (ISG/embedded)                               
MRC v0.102                                                                      
1 channels of DDR3 @ 1066MHz                                                    
-->8--

It appears OK - but do check that those numbers actually match the
DRAM chips assembled on the board. Are DRAM parts identical between
old and new?

Were there *any* hardware changes between SoC and RAM?

That's worth checking, but..

> nico_h in the IRC chat noticed that in non-working board appears a starnge
> device with vid/did PCI: 00:00.0 [8086/0000].

The 0000 is a HUGE red sign, screaming to be thoroughly investigated.

This also hints that the power up changes may be the problem.

It's VERY unlikely that Intel has suddenly released a variant of this
particular SoC with PCI DID=0000 when it used to be DID=0f00. In fact
it's really unlikely that 0000 would be used in correct operation at all.

Very likely on the other hand is that the SoC isn't being powered on
correctly, and so it ends up in some half-initialized state, with the
memory controller not working, and while some part of coreboot seems
to notice (because no RAM regions in coreboot table) clearly that
isn't causing a fatal error, which I think is a bug. Oh well.

If you go through every single powerup hardware change together with
a hardware engineer, starting with the previous circuit and manually
applying one change at a time, maybe you can find one or even more
changes causing that device ID symptom. It depends on how many changes
you have there, but with a good hardware engineer you could perhaps go
through them all in a couple days, which would be really fast results
for a problem like this.

Maybe even simpler, hack this into some early part of the code, maybe
even console_init() works, if pci_early is available there.

while (1)
  if (0x0f00 == pci_read_config32(PCI_DEV(0,0,0), PCI_VENDORID))
    printk(BIOS_INFO, "PASS\n");
  else
    printk(BIOS_INFO, "FAIL: want 0f00 is %04x\n", pci_read_config32(PCI_DEV(0,0,0), PCI_VENDORID));

Then hardware engineering can do analysis on their own. But make sure
to confirm that your test is reliable, using the hardware you have
(old and new) before you give a flash image to them.

Oh, and test on multiple new boards, a single unit in a new batch isn't
representative. New PCB; potentially the process has to be tuned. I don't
know how early in bringup you are.

Good luck and have fun! :)

//Peter