Hi Marc,
I'm currently working to unify K8 and Fam10 CAR to use the same code at runtime (as opposed to buildtime #ifdefs). While this may not be a goal for v2, I definitely want to try it for v3.
A few questions/comments about the CAR code: - Only Fam10 APs are treated specially. APs of older generations seem to be unhandled. Did older generations treat each core as BSP (code seems to suggest that) or were there other special provisions? - "Errata 193: Disable clean copybacks to L3 cache to allow cached ROM." Erratum 193 seems to be unlisted in public data sheets. If it is the famous L3 problem, we might want to enable the workaround only on affected revisions. - CAR goes from 0xC8000 to 0xCFFFF. Assuming GlobalVarSize=0 (untrue, but easier to calculate), BSP stack will be from 0xCC000 to 0xCFFFF and AP stacks will be below 0xCBFFF. * With the current settings (32k CAR total, 1k per AP, 16K for the BSP) the scheme will fall apart if the highest NodeID shifted by the number of CoreID bits is 16 or higher. The BKDG indicates that the number of CoreID bits is 2, so a NodeID of 4 or higher will break. * There is no good place to store the printk() buffer in CAR. On Geode and i586, the printk buffer runs from the lowest address of the CAR area to the middle. Keeping that design will result in the AP stacks colliding with the printk buffer. Limiting the size of the printk buffer dynamically would work unless there are more than 15 cores in the system, where even a printk buffer of zero size would clobber one AP stack. The other alternative is to keep the printk buffer size fixed and let the AP stacks eat into BSP stack space. - Is there any reason on any K8 or later processor supported by the current CAR code not to use 64k CAR? - Is 1k enough stack for the APs, given some stack-heavy functions in v3? - Can the K8 processors work reliably with 0x1e1e1e1e settings in the fixed MTRR or can the Fam10 processors work with 0x06060606?
Regards, Carl-Daniel
On Wed, Aug 6, 2008 at 6:18 AM, Carl-Daniel Hailfinger c-d.hailfinger.devel.2006@gmx.net wrote:
Hi Marc,
I'm currently working to unify K8 and Fam10 CAR to use the same code at runtime (as opposed to buildtime #ifdefs). While this may not be a goal for v2, I definitely want to try it for v3.
just be careful. It is a lot of work for a gain I am not sure I understand. Once you have CAR working for a given CPU, it seems to me we cast it in stone and leave it forever. I'd like to better understand the value of doing this.
thanks
ron
On Wed, Aug 06, 2008 at 07:09:14AM -0700, ron minnich wrote:
unify K8 and Fam10 CAR to use the same code at runtime
I'd like to better understand the value of doing this.
At least the m57sli and serengeti_cheetah boards can run with either k8 or fam10 processors. It'd be nice to have a single coreboot to rule them all.
//Peter
On Wed, Aug 6, 2008 at 7:30 AM, Peter Stuge peter@stuge.se wrote:
At least the m57sli and serengeti_cheetah boards can run with either k8 or fam10 processors. It'd be nice to have a single coreboot to rule them all.
that's an excellent reason, so I'm now on board.
ron
On 06.08.2008 16:43, ron minnich wrote:
On Wed, Aug 6, 2008 at 7:30 AM, Peter Stuge peter@stuge.se wrote:
At least the m57sli and serengeti_cheetah boards can run with either k8 or fam10 processors. It'd be nice to have a single coreboot to rule them all.
that's an excellent reason, so I'm now on board.
Peter eloquently expressed my motivation. I have nothing more to add.
Regards, Carl-Daniel
oh yeah. Be aware that on v3 we return from car to the main code, so the stack has to be preserved. We may want to move the CAR area to a real memory area as we did on V3. But is this possible?
We also need to get disable_car() working.
ron
On 06.08.2008 16:13, ron minnich wrote:
oh yeah. Be aware that on v3 we return from car to the main code, so the stack has to be preserved. We may want to move the CAR area to a real memory area as we did on V3. But is this possible?
In theory, it should work. In practice, at least some SimNow version choked on the patches I had which moved the CAR area. I shall dig them up again and have someone check them on real hardware. Of course it is possible that my patches were faulty. To be honest, I have the feeling I forgot to change some other code which references the old location.
We also need to get disable_car() working.
Yes, definitely.
Hmmm... found the patch. Attached.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
On 06.08.2008 16:13, ron minnich wrote:
oh yeah. Be aware that on v3 we return from car to the main code, so the stack has to be preserved. We may want to move the CAR area to a real memory area as we did on V3. But is this possible?
In theory, it should work. In practice, at least some SimNow version choked on the patches I had which moved the CAR area. I shall dig them up again and have someone check them on real hardware. Of course it is possible that my patches were faulty. To be honest, I have the feeling I forgot to change some other code which references the old location.
Yes, DCACHE_RAM_BASE is referenced when the sys_info structure is copied from cache to ram, and a couple of other places. So you need to also fix up the mainboard Option.lb
Carl-Daniel Hailfinger wrote:
Hi Marc,
I'm currently working to unify K8 and Fam10 CAR to use the same code at runtime (as opposed to buildtime #ifdefs). While this may not be a goal for v2, I definitely want to try it for v3.
A few questions/comments about the CAR code:
- Only Fam10 APs are treated specially. APs of older generations seem to
be unhandled. Did older generations treat each core as BSP (code seems to suggest that) or were there other special provisions?
I don't know. I haven't used or worked on that code. YH would be the better person to ask. For the fam10 code there are some settings that can only be set from the AP cores.
- "Errata 193: Disable clean copybacks to L3 cache to allow cached ROM."
Erratum 193 seems to be unlisted in public data sheets. If it is the famous L3 problem, we might want to enable the workaround only on affected revisions.
This is an errata for early silicon which is why it isn't in the public rev guide. It is a fix for caching instructions while in CAR mode. It can be removed. All Ax support could be removed.
- CAR goes from 0xC8000 to 0xCFFFF. Assuming GlobalVarSize=0 (untrue,
but easier to calculate), BSP stack will be from 0xCC000 to 0xCFFFF and AP stacks will be below 0xCBFFF.
- With the current settings (32k CAR total, 1k per AP, 16K for the BSP)
the scheme will fall apart if the highest NodeID shifted by the number of CoreID bits is 16 or higher. The BKDG indicates that the number of CoreID bits is 2, so a NodeID of 4 or higher will break.
Yes. This was sufficient for the K8 and was not changed when I added fam10. 8 dual core K8 was the most you could have. It could probably be expended into the rest of the shadow hole (up to FFFFF) if needed. The reason to keep it in the hole is for memory eye finding that will happen from 1MB to TOM.
- There is no good place to store the printk() buffer in CAR. On Geode
and i586, the printk buffer runs from the lowest address of the CAR area to the middle. Keeping that design will result in the AP stacks colliding with the printk buffer. Limiting the size of the printk buffer dynamically would work unless there are more than 15 cores in the system, where even a printk buffer of zero size would clobber one AP stack. The other alternative is to keep the printk buffer size fixed and let the AP stacks eat into BSP stack space.
This was the problem I mentioned when you were doing the printk() buffer. You are not guaranteed the use of the cache.
- Is there any reason on any K8 or later processor supported by the
current CAR code not to use 64k CAR?
To leave room for APs? There may have been some concern about small cache versions be introduced?
- Is 1k enough stack for the APs, given some stack-heavy functions in v3?
I don't know for sure but I would expect it to be ok.
- Can the K8 processors work reliably with 0x1e1e1e1e settings in the
fixed MTRR or can the Fam10 processors work with 0x06060606?
No.
Marc
On 06.08.2008 18:54, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
- "Errata 193: Disable clean copybacks to L3 cache to allow cached ROM."
Erratum 193 seems to be unlisted in public data sheets. If it is the famous L3 problem, we might want to enable the workaround only on affected revisions.
This is an errata for early silicon which is why it isn't in the public rev guide. It is a fix for caching instructions while in CAR mode. It can be removed. All Ax support could be removed.
Does that removal suggestion really mean all Ax support including the newly committed Ax microcode updates and the (older) Ax memory controller stuff?
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
On 06.08.2008 18:54, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
- "Errata 193: Disable clean copybacks to L3 cache to allow cached ROM."
Erratum 193 seems to be unlisted in public data sheets. If it is the famous L3 problem, we might want to enable the workaround only on affected revisions.
This is an errata for early silicon which is why it isn't in the public rev guide. It is a fix for caching instructions while in CAR mode. It can be removed. All Ax support could be removed.
Does that removal suggestion really mean all Ax support including the newly committed Ax microcode updates and the (older) Ax memory controller stuff?
Yes, It is there for completeness. I don't think it needs to be carried forward to v3.
Marc
On 08.08.2008 17:18, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
On 06.08.2008 18:54, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
- "Errata 193: Disable clean copybacks to L3 cache to allow cached
ROM." Erratum 193 seems to be unlisted in public data sheets. If it is the famous L3 problem, we might want to enable the workaround only on affected revisions.
This is an errata for early silicon which is why it isn't in the public rev guide. It is a fix for caching instructions while in CAR mode. It can be removed. All Ax support could be removed.
Does that removal suggestion really mean all Ax support including the newly committed Ax microcode updates and the (older) Ax memory controller stuff?
Yes, It is there for completeness. I don't think it needs to be carried forward to v3.
Thanks for the info.
Remove Family 10h revision Ax support from v3 CAR code.
Signed-off-by: Carl-Daniel Hailfinger c-d.hailfinger.devel.2006@gmx.net
Index: corebootv3-CAR_no_10h_rev_Ax/arch/x86/amd/stage0.S =================================================================== --- corebootv3-CAR_no_10h_rev_Ax/arch/x86/amd/stage0.S (Revision 724) +++ corebootv3-CAR_no_10h_rev_Ax/arch/x86/amd/stage0.S (Arbeitskopie) @@ -212,15 +212,6 @@
#endif
-#ifdef CONFIG_CPU_AMD_K10 - /* Errata 193: Disable clean copybacks to L3 cache to allow cached ROM. - Re-enable it in after RAM is initialized and before CAR is disabled */ - movl $0xc001102a, %ecx - rdmsr - bts $15, %eax - wrmsr -#endif - /* Set MtrrFixDramModEn for clear fixed mtrr */ enable_fixed_mtrr_dram_modify: movl $SYSCFG_MSR, %ecx
On Sat, Aug 09, 2008 at 02:25:02PM +0200, Carl-Daniel Hailfinger wrote:
Remove Family 10h revision Ax support from v3 CAR code.
Signed-off-by: Carl-Daniel Hailfinger c-d.hailfinger.devel.2006@gmx.net
Acked-by: Peter Stuge peter@stuge.se
Index: corebootv3-CAR_no_10h_rev_Ax/arch/x86/amd/stage0.S
--- corebootv3-CAR_no_10h_rev_Ax/arch/x86/amd/stage0.S (Revision 724) +++ corebootv3-CAR_no_10h_rev_Ax/arch/x86/amd/stage0.S (Arbeitskopie) @@ -212,15 +212,6 @@
#endif
-#ifdef CONFIG_CPU_AMD_K10
- /* Errata 193: Disable clean copybacks to L3 cache to allow cached ROM.
Re-enable it in after RAM is initialized and before CAR is disabled */
- movl $0xc001102a, %ecx
- rdmsr
- bts $15, %eax
- wrmsr
-#endif
- /* Set MtrrFixDramModEn for clear fixed mtrr */
enable_fixed_mtrr_dram_modify: movl $SYSCFG_MSR, %ecx
On 09.08.2008 14:53, Peter Stuge wrote:
On Sat, Aug 09, 2008 at 02:25:02PM +0200, Carl-Daniel Hailfinger wrote:
Remove Family 10h revision Ax support from v3 CAR code.
Signed-off-by: Carl-Daniel Hailfinger c-d.hailfinger.devel.2006@gmx.net
Acked-by: Peter Stuge peter@stuge.se
Thanks, r725.
Regards, Carl-Daniel
Hi Yinghai,
can you please look at the problems below?
On 06.08.2008 18:54, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
I'm currently working to unify K8 and Fam10 CAR to use the same code at runtime (as opposed to buildtime #ifdefs). While this may not be a goal for v2, I definitely want to try it for v3.
A few questions/comments about the CAR code:
- Only Fam10 APs are treated specially. APs of older generations seem to
be unhandled. Did older generations treat each core as BSP (code seems to suggest that) or were there other special provisions?
I don't know. I haven't used or worked on that code. YH would be the better person to ask. For the fam10 code there are some settings that can only be set from the AP cores.
Older BKDGs indicate that we should treat all APs specially. That would be a missing feature in the old code.
- CAR goes from 0xC8000 to 0xCFFFF. Assuming GlobalVarSize=0 (untrue,
but easier to calculate), BSP stack will be from 0xCC000 to 0xCFFFF and AP stacks will be below 0xCBFFF.
- With the current settings (32k CAR total, 1k per AP, 16K for the BSP)
the scheme will fall apart if the highest NodeID shifted by the number of CoreID bits is 16 or higher. The BKDG indicates that the number of CoreID bits is 2, so a NodeID of 4 or higher will break.
Yes. This was sufficient for the K8 and was not changed when I added fam10. 8 dual core K8 was the most you could have. It could probably be expended into the rest of the shadow hole (up to FFFFF) if needed. The reason to keep it in the hole is for memory eye finding that will happen from 1MB to TOM.
We may need to revisit this for Family 10h. What's the maximum number of cores in one Family 10h system?
- There is no good place to store the printk() buffer in CAR. On Geode
and i586, the printk buffer runs from the lowest address of the CAR area to the middle. Keeping that design will result in the AP stacks colliding with the printk buffer. Limiting the size of the printk buffer dynamically would work unless there are more than 15 cores in the system, where even a printk buffer of zero size would clobber one AP stack. The other alternative is to keep the printk buffer size fixed and let the AP stacks eat into BSP stack space.
This was the problem I mentioned when you were doing the printk() buffer. You are not guaranteed the use of the cache.
I think I can fix that. If APs are only started after the BSP has initialized DRAM, there is no problem because the printk() buffer is relocated directly after DRAM init.
- Is there any reason on any K8 or later processor supported by the
current CAR code not to use 64k CAR?
To leave room for APs? There may have been some concern about small cache versions be introduced?
The various BKDGs state that L1 cache tag indexes of 00h - FFh are reserved for memory training and recommend to use exactly 48k CAR. I intend to follow that advice. By the way, the AP CAR areas in our code are inside the BSP CAR area.
- Is 1k enough stack for the APs, given some stack-heavy functions in
v3?
I don't know for sure but I would expect it to be ok.
- Can the K8 processors work reliably with 0x1e1e1e1e settings in the
fixed MTRR or can the Fam10 processors work with 0x06060606?
No.
OK, I will work on a generic code sequence for this problem. Does Family 11h need 0x1e or 0x06?
Regards, Carl-Daniel
On Sat, Aug 09, 2008 at 03:08:11PM +0200, Carl-Daniel Hailfinger wrote:
I think I can fix that. If APs are only started after the BSP has initialized DRAM, there is no problem because the printk() buffer is relocated directly after DRAM init.
Don't APs also need to initialize DRAM? Or can the BSP do this "remotely" ?
Is there one printk buffer per core or are all printk() calls always handled by the BSP and it's associated buffer?
//Peter
On 09.08.2008 15:23, Peter Stuge wrote:
On Sat, Aug 09, 2008 at 03:08:11PM +0200, Carl-Daniel Hailfinger wrote:
I think I can fix that. If APs are only started after the BSP has initialized DRAM, there is no problem because the printk() buffer is relocated directly after DRAM init.
Don't APs also need to initialize DRAM? Or can the BSP do this "remotely" ?
Is there one printk buffer per core or are all printk() calls always handled by the BSP and it's associated buffer?
My current design calls for a common printk buffer, maybe with access guarded by a lock. Basically, the same question applies to serial output. I'd prefer to have readable and reliable serial output, even if it means we have to use locking. After all, prink output is what we use for debugging, so it should be intelligible.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
On 09.08.2008 15:23, Peter Stuge wrote:
On Sat, Aug 09, 2008 at 03:08:11PM +0200, Carl-Daniel Hailfinger wrote:
I think I can fix that. If APs are only started after the BSP has initialized DRAM, there is no problem because the printk() buffer is relocated directly after DRAM init.
Don't APs also need to initialize DRAM? Or can the BSP do this "remotely" ?
Is there one printk buffer per core or are all printk() calls always handled by the BSP and it's associated buffer?
My current design calls for a common printk buffer, maybe with access guarded by a lock. Basically, the same question applies to serial output. I'd prefer to have readable and reliable serial output, even if it means we have to use locking. After all, prink output is what we use for debugging, so it should be intelligible.
A single buffer is the way to go. Locking is tricky pre-mem so it isn't done in v2. I guess your global variable work could be used for that.
Marc
On Sat, Aug 9, 2008 at 6:23 AM, Peter Stuge peter@stuge.se wrote:
Don't APs also need to initialize DRAM? Or can the BSP do this "remotely" ?
APs should do it.
I'm getting a supermicro mainboard with 4 sockets and 128 GB memory, and I want that done in parallel :-)
ron
Don't APs also need to initialize DRAM? Or can the BSP do this "remotely" ?
APs should do it.
I'm getting a supermicro mainboard with 4 sockets and 128 GB memory, and I want that done in parallel :-)
In principle, the BSP can initialise multiple memory controllers in parallel as well. Using multiple processors to do it is quite hard, esp. since you cannot do "normal" SMP stuff because memory isn't initialised yet; and you get all the "normal" SMP headaches as well obviously.
If the memory init needs to access resources that it cannot access remotely (like, MSRs on x86), you have no choice but to use APs, of course. This seems to be the case on FAM10h (but not on K8).
Segher
On 10.08.2008 23:36, Segher Boessenkool wrote:
Don't APs also need to initialize DRAM? Or can the BSP do this "remotely" ?
APs should do it.
I'm getting a supermicro mainboard with 4 sockets and 128 GB memory, and I want that done in parallel :-)
In principle, the BSP can initialise multiple memory controllers in parallel as well. Using multiple processors to do it is quite hard, esp. since you cannot do "normal" SMP stuff because memory isn't initialised yet; and you get all the "normal" SMP headaches as well obviously.
If the memory init needs to access resources that it cannot access remotely (like, MSRs on x86), you have no choice but to use APs, of course. This seems to be the case on FAM10h (but not on K8).
The AMD Family 0Fh BKDG says: "The BSP must perform the following tasks: [...] Memory controller initialization on all processor nodes" The AMD Family 10h BKDG is silent on this.
Regards, Carl-Daniel
Segher Boessenkool wrote:
Don't APs also need to initialize DRAM? Or can the BSP do this "remotely" ?
APs should do it.
I'm getting a supermicro mainboard with 4 sockets and 128 GB memory, and I want that done in parallel :-)
In principle, the BSP can initialise multiple memory controllers in parallel as well. Using multiple processors to do it is quite hard, esp. since you cannot do "normal" SMP stuff because memory isn't initialised yet; and you get all the "normal" SMP headaches as well obviously.
If the memory init needs to access resources that it cannot access remotely (like, MSRs on x86), you have no choice but to use APs, of course. This seems to be the case on FAM10h (but not on K8).
Segher
Couldn't have said it better myself. :)
We do some message passing with the APIC registers to indicate when the APs have finnished their init. The init mostly consists of fid/vid setup and getting the APIC IDs set.
Marc
-- coreboot mailing list coreboot@coreboot.org http://www.coreboot.org/mailman/listinfo/coreboot
Carl-Daniel Hailfinger wrote:
Hi Yinghai,
can you please look at the problems below?
On 06.08.2008 18:54, Marc Jones wrote:
Carl-Daniel Hailfinger wrote:
I'm currently working to unify K8 and Fam10 CAR to use the same code at runtime (as opposed to buildtime #ifdefs). While this may not be a goal for v2, I definitely want to try it for v3.
A few questions/comments about the CAR code:
- Only Fam10 APs are treated specially. APs of older generations seem to
be unhandled. Did older generations treat each core as BSP (code seems to suggest that) or were there other special provisions?
I don't know. I haven't used or worked on that code. YH would be the better person to ask. For the fam10 code there are some settings that can only be set from the AP cores.
Older BKDGs indicate that we should treat all APs specially. That would be a missing feature in the old code.
- CAR goes from 0xC8000 to 0xCFFFF. Assuming GlobalVarSize=0 (untrue,
but easier to calculate), BSP stack will be from 0xCC000 to 0xCFFFF and AP stacks will be below 0xCBFFF.
- With the current settings (32k CAR total, 1k per AP, 16K for the BSP)
the scheme will fall apart if the highest NodeID shifted by the number of CoreID bits is 16 or higher. The BKDG indicates that the number of CoreID bits is 2, so a NodeID of 4 or higher will break.
Yes. This was sufficient for the K8 and was not changed when I added fam10. 8 dual core K8 was the most you could have. It could probably be expended into the rest of the shadow hole (up to FFFFF) if needed. The reason to keep it in the hole is for memory eye finding that will happen from 1MB to TOM.
We may need to revisit this for Family 10h. What's the maximum number of cores in one Family 10h system?
8 quadcore cpus would be a large server setup.
- There is no good place to store the printk() buffer in CAR. On Geode
and i586, the printk buffer runs from the lowest address of the CAR area to the middle. Keeping that design will result in the AP stacks colliding with the printk buffer. Limiting the size of the printk buffer dynamically would work unless there are more than 15 cores in the system, where even a printk buffer of zero size would clobber one AP stack. The other alternative is to keep the printk buffer size fixed and let the AP stacks eat into BSP stack space.
This was the problem I mentioned when you were doing the printk() buffer. You are not guaranteed the use of the cache.
I think I can fix that. If APs are only started after the BSP has initialized DRAM, there is no problem because the printk() buffer is relocated directly after DRAM init.
APIC, fid/vid and other msr init happen before memory. See: wait_all_core0_started() and start_other_cores() for details
- Is there any reason on any K8 or later processor supported by the
current CAR code not to use 64k CAR?
To leave room for APs? There may have been some concern about small cache versions be introduced?
The various BKDGs state that L1 cache tag indexes of 00h - FFh are reserved for memory training and recommend to use exactly 48k CAR. I intend to follow that advice. By the way, the AP CAR areas in our code are inside the BSP CAR area.
I think that is ok. IIRC the APs use the BSP cache for their stacks. It also allows them to share the sysinfo struct.
- Is 1k enough stack for the APs, given some stack-heavy functions in
v3?
I don't know for sure but I would expect it to be ok.
- Can the K8 processors work reliably with 0x1e1e1e1e settings in the
fixed MTRR or can the Fam10 processors work with 0x06060606?
No.
OK, I will work on a generic code sequence for this problem. Does Family 11h need 0x1e or 0x06?
I think that Fam11 will have 0x1e cache type.
Marc