-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hello,
It makes me wonder why CAR on APs use same stack? How does this can work? I thought CPUs somehow keep caches coherent between them. I see that Fam10h CAR code allocates 1KB for each AP. But not pre Fam10h.
How this can work?
Rationale for question is to have some kind of mutex for serial console printouts and for Network over console. Secret plan is to print outputs from different CPUs on different UDP port ;)
Second reason is that we really need some inter CPU mutex for PCI access.
In principle is correct that all CPUs once after CAR stage share the cache contents?
Thanks, Rudolf
On 6/6/10 5:01 PM, Rudolf Marek wrote:
Hello,
It makes me wonder why CAR on APs use same stack?
Maybe it is not coherent? Or maybe it doesn't really work?
I see that Fam10h CAR code allocates 1KB for each AP. But not pre Fam10h.
How this can work?
Rationale for question is to have some kind of mutex for serial console printouts and for Network over console. Secret plan is to print outputs from different CPUs on different UDP port ;)
Second reason is that we really need some inter CPU mutex for PCI access.
Or we need to stop doing PCI accesses and console output in SMP before we have RAM.
I somehow can't imagine that the speed improvements we get from doing concurrent PCI config space accesses to some 100 registers really make up for the trouble we get ourselves into by trying to wildly spread configuration tasks among CPUs (and doing it differently on basically every mainboard that uses K8 or K10)
Stefan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I think it is mostly because there is memory init done by APs. Is this true for some board?
Thanks, Rudolf
On 6/6/10 5:42 PM, Rudolf Marek wrote:
I think it is mostly because there is memory init done by APs. Is this true for some board?
Afaik it's "ECC clearing" which is implemented several times in the tree, including stage2. It needs no PCI access nor console output, though... and parallelizing the burden of PCI config space writes is not where the speedup lives.
Stefan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Afaik it's "ECC clearing" which is implemented several times in the tree, including stage2.
Nope, the APs can init the memory controller too. Check CONFIG_MEM_TRAIN_SEQ 0 for BSP only 1 = train_ram_on_node is called from init_cpus 2 = dunno - looks like it is also done in parallel but I could not find how it works.
Lot of boards sets it up for 2 a think only one to 1
Thanks, Rudolf
On 6/6/10 6:29 PM, Rudolf Marek wrote:
Afaik it's "ECC clearing" which is implemented several times in the tree, including stage2.
Nope, the APs can init the memory controller too. Check CONFIG_MEM_TRAIN_SEQ 0 for BSP only 1 = train_ram_on_node is called from init_cpus 2 = dunno - looks like it is also done in parallel but I could not find how it works.
Lot of boards sets it up for 2 a think only one to 1
I think 2 is for calling it from CAR..
Still wondering how much time we save by parallelizing this... Did anyone take a measurement?
I've talked to Marc Jones about this several times over the years.. He can confirm my memory. There is almost no win to parallelizing any of the memory or PCI bus setup. Yes, it's supported in the code, kind of, for some platforms, and maybe it works on some of them, but it's not worth it and it really complicates things.
What is worth it, and we've measured this, is ECC scrubbing. We should focus on that.
So the boot path: BSP does all device tree, DRAM setup, sets up stacks and boot code for APs
APs are woken up and do what they are told, which is in many cases to set themselves up and do ECC scrubbing.
In other words, Stefan is right (again :-)
ron