Andi,
Please check the patch regarding apicid lifting.
For some reason, we need to lift AP apicid but keep the BSP apicid to 0....
Also it solve the E0 later single but have apic id reorder problem...
YH
diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c --- a/arch/x86_64/kernel/setup.c +++ b/arch/x86_64/kernel/setup.c @@ -786,13 +786,24 @@ static void __init amd_detect_cmp(struct #ifdef CONFIG_SMP int cpu = smp_processor_id(); unsigned bits; + int cores_vir; #ifdef CONFIG_NUMA int node = 0; - unsigned apicid = phys_proc_id[cpu]; + unsigned initial_apicid = phys_proc_id[cpu]; + unsigned apicid = hard_smp_processor_id(); #endif + + cores_vir = c->x86_max_cores; + if(cores_vir == 1) { + unsigned level = cpuid_eax(1); + /* double check if it is E0 later, only E0 later can reorder apicid for single core */ + if ((level & 0xf0f00) >= 0x20f00) { + cores_vir = 2; + } + }
bits = 0; - while ((1 << bits) < c->x86_max_cores) + while ((1 << bits) < cores_vir) bits++;
/* Low order bits define the core id (index of core in socket) */ @@ -802,32 +813,23 @@ static void __init amd_detect_cmp(struct
#ifdef CONFIG_NUMA node = phys_proc_id[cpu]; - if (apicid_to_node[apicid] != NUMA_NO_NODE) - node = apicid_to_node[apicid]; + + if (apicid_to_node[apicid] == NUMA_NO_NODE) + apicid_to_node[apicid] = node; + if (!node_online(node)) { - /* Two possibilities here: + /* One possibilities here: - The CPU is missing memory and no node was created. - In that case try picking one from a nearby CPU - - The APIC IDs differ from the HyperTransport node IDs - which the K8 northbridge parsing fills in. - Assume they are all increased by a constant offset, - but in the same order as the HT nodeids. - If that doesn't result in a usable node fall back to the - path for the previous case. */ - int ht_nodeid = apicid - (phys_proc_id[0] << bits); - if (ht_nodeid >= 0 && - apicid_to_node[ht_nodeid] != NUMA_NO_NODE) - node = apicid_to_node[ht_nodeid]; - /* Pick a nearby node */ - if (!node_online(node)) - node = nearby_node(apicid); + In that case try picking one from a nearby CPU */ + node = nearby_node(apicid); } + numa_set_node(cpu, node); +#endif
printk(KERN_INFO "CPU %d(%d) -> Node %d -> Core %d\n", cpu, c->x86_max_cores, node, cpu_core_id[cpu]); #endif -#endif }
static int __init init_amd(struct cpuinfo_x86 *c)
On Mon, Nov 21, 2005 at 01:49:09PM -0800, yhlu wrote:
Andi,
Please check the patch regarding apicid lifting.
For some reason, we need to lift AP apicid but keep the BSP apicid to 0....
Also it solve the E0 later single but have apic id reorder problem...
Can you please explain clearly:
- What are you changing. - What was the problem with the old behaviour - Why that particular change - Why can't that APIC number setup not be done by the BIOS itself
Thanks.
Please note there is a high barrier of entry for any kind of BIOS workarounds - in particular for LinuxBIOS i'm not very motivated because you guys can just fix the BIOS.
-Andi
Can you please explain clearly:
- What are you changing.
1. use core_vir instead of x86_max_cores, for E0 later single core, core_vir could be 2, and x86_max_cores still is 1. So it makes node calculation right. 2. not assuming that lifted apic id is continous. We can get exact node id and core id from initial apicid.
- What was the problem with the old behaviour
1. for E0 single core, node 2, initial apicid is 4, and old cold will get node=4 instead of 2. 2. if the lifted apicid is not continous, it will assign strange node id to the cpu.
- Why that particular change
1. We can get exact node id and core id from initial apicid for E0 later.
- Why can't that APIC number setup not be done by the BIOS itself
1. That patch the code more generic. and don't assume the lifted apicid is continous.
Thanks
YH
On Mon, Nov 21, 2005 at 02:17:35PM -0800, yhlu wrote:
Can you please explain clearly:
- What are you changing.
- use core_vir instead of x86_max_cores, for E0 later single core,
core_vir could be 2, and x86_max_cores still is 1. So it makes node calculation right.
max_cores should be 2 here.
- What was the problem with the old behaviour
- for E0 single core, node 2, initial apicid is 4, and old cold will
get node=4 instead of 2. 2. if the lifted apicid is not continous, it will assign strange node id to the cpu.
Is there a good reason in the BIOS to not make it contiguous?
- Why that particular change
- We can get exact node id and core id from initial apicid for E0 later.
- Why can't that APIC number setup not be done by the BIOS itself
- That patch the code more generic. and don't assume the lifted
apicid is continous.
It's only the last resort fallback anyways. I would prefer to not make it more complicated. The recommend way is you supplying a SRAT table, then you're independent of any such fallback heuristics and just tell the kernel the right mapping.
-Andi
On Mon, Nov 21, 2005 at 02:31:44PM -0800, yhlu wrote:
max_cores should be 2 here.
No, For E0 single core, x86_max_cores will be 1, the initial apicid can not be shifted to node id....
Is there a good reason in the BIOS to not make it contiguous?
amd8111, if i lift the bsp apic id, the jiffies will not be moving....,
It works for other BIOS, so something must be wrong in your setup. Better root cause that.
-Andi
OK, but the patch make your code more generic and also it support E0 single core... YH
On 11/22/05, Andi Kleen ak@suse.de wrote:
On Mon, Nov 21, 2005 at 02:31:44PM -0800, yhlu wrote:
max_cores should be 2 here.
No, For E0 single core, x86_max_cores will be 1, the initial apicid can not be shifted to node id....
Is there a good reason in the BIOS to not make it contiguous?
amd8111, if i lift the bsp apic id, the jiffies will not be moving....,
It works for other BIOS, so something must be wrong in your setup. Better root cause that.
-Andi
-- LinuxBIOS mailing list LinuxBIOS@openbios.org http://www.openbios.org/mailman/listinfo/linuxbios
Andi Kleen wrote:
Please note there is a high barrier of entry for any kind of BIOS workarounds - in particular for LinuxBIOS i'm not very motivated because you guys can just fix the BIOS.
Hi Andi, just wanted to let you know, that I do agree that this is a good policy in general. In terms of LinuxBIOS, now that we're starting to approach 2M nodes out in the field, fixing it is geting a wee bit harder. Again, I'm not disagreeing with the point above, just mentioning that "just fix the BIOS" is not as easy as it was when we had all the LinuxBIOS nodes in the world -- all 13 of them -- in my lab :-)
This APIC lifting thing has been a real mess, and IIRC what really pushed it originally was the island aruma, with its 32 PCI busses. It's amazing how PC architectures always seem to involve over-running bit-fields -- 4 bits, 6 bits, 8 bits, 10 bits, whatever.
Getting it all to work has involved lots of backtracking, as we found that fixing this problem HERE broke that legacy system THERE -- where legacy seems to mean "more than 3 weeks old". The mail traffic on the linuxbios list on this issue has been interesting, and in some cases, more than I can keep up with. Part of the issue is that we all have mutually exclusive hardware, and we keep running into hardware limitations that don't seem to be known to even the guys who make the chips. So we think we have the permanent fix, and somebody pops up to report we just broke their mainboard -- and they're the only ones with that mainboard, so testing is hard.
At the same time, we seem to be treading in territory where the fuctory BIOSes have not yet been. We're in the weird position, at times, of finding things out before the proprietary BIOSes get there.
Sometimes the ease of updating the BIOS can cause troubles you don't expect. Fuctory BIOSes seem to count on infrequent updates, forked code bases, and so on, so that you have to update each mainboard source base individually -- they have the disadvantage of a forked code base, but the one advantage is that a mod to fix one platform won't ever break another.
At some point I had understood that linux was going to be able to function without resorting to SRAT tables -- has that changed? Is this patch really intrusive enough that it is not acceptable? The issue is that we get LinuxBIOS right on a platform, and then some new rev of the CPU comes along, and LinuxBIOS gets updated in a way that is not obviously going to cause trouble for the older stuff -- but then it does, for some other reason. I am hoping this apic lifting will settle down in the next while, but it's been hard.
thanks
ron
sth about SRAT in LinuxBIOS, I have put SRAT dynamically support in LinuxBIOS, but the whole acpi support still need dsdt, current we only have dsdt for AMD chipset in LB. And we can not have the access the dsdt asl from Nvidia chipset yet...
YH
On 11/23/05, Ronald G Minnich rminnich@lanl.gov wrote:
Andi Kleen wrote:
Please note there is a high barrier of entry for any kind of BIOS
workarounds -
in particular for LinuxBIOS i'm not very motivated because you guys can just fix the BIOS.
Hi Andi, just wanted to let you know, that I do agree that this is a good policy in general. In terms of LinuxBIOS, now that we're starting to approach 2M nodes out in the field, fixing it is geting a wee bit harder. Again, I'm not disagreeing with the point above, just mentioning that "just fix the BIOS" is not as easy as it was when we had all the LinuxBIOS nodes in the world -- all 13 of them -- in my lab :-)
This APIC lifting thing has been a real mess, and IIRC what really pushed it originally was the island aruma, with its 32 PCI busses. It's amazing how PC architectures always seem to involve over-running bit-fields -- 4 bits, 6 bits, 8 bits, 10 bits, whatever.
Getting it all to work has involved lots of backtracking, as we found that fixing this problem HERE broke that legacy system THERE -- where legacy seems to mean "more than 3 weeks old". The mail traffic on the linuxbios list on this issue has been interesting, and in some cases, more than I can keep up with. Part of the issue is that we all have mutually exclusive hardware, and we keep running into hardware limitations that don't seem to be known to even the guys who make the chips. So we think we have the permanent fix, and somebody pops up to report we just broke their mainboard -- and they're the only ones with that mainboard, so testing is hard.
At the same time, we seem to be treading in territory where the fuctory BIOSes have not yet been. We're in the weird position, at times, of finding things out before the proprietary BIOSes get there.
Sometimes the ease of updating the BIOS can cause troubles you don't expect. Fuctory BIOSes seem to count on infrequent updates, forked code bases, and so on, so that you have to update each mainboard source base individually -- they have the disadvantage of a forked code base, but the one advantage is that a mod to fix one platform won't ever break another.
At some point I had understood that linux was going to be able to function without resorting to SRAT tables -- has that changed? Is this patch really intrusive enough that it is not acceptable? The issue is that we get LinuxBIOS right on a platform, and then some new rev of the CPU comes along, and LinuxBIOS gets updated in a way that is not obviously going to cause trouble for the older stuff -- but then it does, for some other reason. I am hoping this apic lifting will settle down in the next while, but it's been hard.
thanks
ron
-- LinuxBIOS mailing list LinuxBIOS@openbios.org http://www.openbios.org/mailman/listinfo/linuxbios
On Wed, Nov 23, 2005 at 09:19:59AM -0800, yhlu wrote:
sth about SRAT in LinuxBIOS, I have put SRAT dynamically support in LinuxBIOS, but the whole acpi support still need dsdt, current we only have dsdt for AMD chipset in LB. And we can not have the access the dsdt asl from Nvidia chipset yet...
You probably don't need most of it. Just a basic SRAT table (no AML methods) and enough to keep the ACPI interpreter from aborting early.
Or alternatively just fix the bug that caused you to go with discontig APICs in the first place.
-Andi
is there any way to make the kernel use apci but still use pci irq routing from mptable?
YH
On 11/23/05, Andi Kleen ak@suse.de wrote:
On Wed, Nov 23, 2005 at 09:19:59AM -0800, yhlu wrote:
sth about SRAT in LinuxBIOS, I have put SRAT dynamically support in LinuxBIOS, but the whole acpi support still need dsdt, current we only have dsdt for AMD chipset in LB. And we can not have the access the dsdt asl from Nvidia chipset yet...
You probably don't need most of it. Just a basic SRAT table (no AML methods) and enough to keep the ACPI interpreter from aborting early.
Or alternatively just fix the bug that caused you to go with discontig APICs in the first place.
-Andi
* yhlu yinghailu@gmail.com [051123 18:40]:
is there any way to make the kernel use apci but still use pci irq routing from mptable?
Yes, don't provide any of MADT, DSDT, FADT.
Stefan
only RSDT+SRAT?, I will try it....
YH
On 11/23/05, Stefan Reinauer stepan@openbios.org wrote:
- yhlu yinghailu@gmail.com [051123 18:40]:
is there any way to make the kernel use apci but still use pci irq routing from mptable?
Yes, don't provide any of MADT, DSDT, FADT.
Stefan
it doesn't work. At that case must disable the apci in kernel...(acpi=off)
YH
On 11/23/05, yhlu yinghailu@gmail.com wrote:
only RSDT+SRAT?, I will try it....
YH
On 11/23/05, Stefan Reinauer stepan@openbios.org wrote:
- yhlu yinghailu@gmail.com [051123 18:40]:
is there any way to make the kernel use apci but still use pci irq routing from mptable?
Yes, don't provide any of MADT, DSDT, FADT.
Stefan
On Wed, Nov 23, 2005 at 10:35:19AM -0800, yhlu wrote:
it doesn't work. At that case must disable the apci in kernel...(acpi=off)
Shouldn't be very hard to fix in the kernel though (to use only SRAT and nothing else of ACPI). I can look into it if nobody beats me to it
-Andi
* Andi Kleen ak@suse.de [051123 18:36]:
You probably don't need most of it. Just a basic SRAT table (no AML methods) and enough to keep the ACPI interpreter from aborting early.
Or alternatively just fix the bug that caused you to go with discontig APICs in the first place.
Andi,
I really like your insisting way, but what we tried to express is that there is hardware that just forces you to have discontiguous APIC ids, so either you disable parts of the hardware or you are forced to do nasty things.
Wrt the ACPI tables a good rule of thumb is that if you start to have some of them you have to have them all. For example if you have a logical subset of them and try to cover the rest with PIRQ or MPTABLE you will fail because Linux moans about incorrect tables without even looking at them. And no, there is no reason for not reading a HPET table when there's no MADT available. And no, I'm not going to send a fix since I'm really not motivated to dig into that code any minute more than absolutely necessary.
I agree that there's a reason that the Linux ACPI code is as it is, but in fact as it is a reaction to zillions of buggy bioses it is not always the best solution to have clean firmware not working with it "fixed" to behave like the others out there.
Stefan
yhlu wrote:
sth about SRAT in LinuxBIOS, I have put SRAT dynamically support in LinuxBIOS, but the whole acpi support still need dsdt, current we only have dsdt for AMD chipset in LB. And we can not have the access the dsdt asl from Nvidia chipset yet...
yeah, this is the great thing about ACPI, it has put us into a whole new era of copyrighted stuff. ACPI tables describe hardware, and are copyright bios vendors. The question of which ACPI bits we can use in linuxbios is unresolved. AMD has committed to open-source ACPI tables, but ... what about companies like nvidia? unknown. And, to add to the fun, the mainboard vendors don't own their own ACPI tables -- the BIOS vendors do. So the mainboard vendor has their hardware design encoded into ACPI tables, which are copyright the bios vendor, not the mainboard vendor.
ACPI is a looming problem for all the open-source bios efforts out there.
I don't much like ACPI. It's just another mechanism for hiding information and limiting its distribution.
ron
On Wed, Nov 23, 2005 at 01:23:41PM -0700, Ronald G Minnich wrote:
yeah, this is the great thing about ACPI, it has put us into a whole new era of copyrighted stuff. ACPI tables describe hardware, and are copyright bios vendors. The question of which ACPI bits we can use in linuxbios is unresolved. AMD has committed to open-source ACPI tables, but ... what about companies like nvidia? unknown. And, to add to the fun, the mainboard vendors don't own their own ACPI tables -- the BIOS vendors do. So the mainboard vendor has their hardware design encoded into ACPI tables, which are copyright the bios vendor, not the mainboard vendor.
I don't think it's as bad as you describe. Once you have a free reference DSL it shouldn't be very difficult to vary it for specific platforms. I guess that is what the proprietary BIOS writers are doing too.
Some systems have very complex ACPI tables, but for others they can be quite simple and a lot of the complexity can be just ignored.
I suppose you could even write a generic translator from mptables to ACPI tables (although I suspect more and more setups cannot be described in the old tables)
BTW there are other reasons now to support ACPI, like the MCFG tables that are needed for extended config space accesses (necessary e.g. for PCI Express error handling) or the HPET table for the HPET timer.
-Andi
At the same time, we seem to be treading in territory where the fuctory BIOSes have not yet been. We're in the weird position, at times, of finding things out before the proprietary BIOSes get there.
You're saying that Arima machine only runs with LinuxBIOS so far?
Sometimes the ease of updating the BIOS can cause troubles you don't expect. Fuctory BIOSes seem to count on infrequent updates, forked code bases, and so on, so that you have to update each mainboard source base individually -- they have the disadvantage of a forked code base, but the one advantage is that a mod to fix one platform won't ever break another.
At some point I had understood that linux was going to be able to function without resorting to SRAT tables -- has that changed? Is this
SRAT is definitely the wave of the future. I'm going to keep the old code working as long as it's relatively cleanly possible. But this needs to make some assumptions, and in particular your discontiguous APIC IDs broke that. I don't plan to try to handle every weird case in this case - if your setup is really weird use SRAT.
Regarding the BSP 0 - you probably just have a stupid bug somewhere that breaks the timer interrupt. I don't know of any hardware issue that would prevent the timer interrupt going to a APIC ID != 0. There are some troubles with timer interrupts, but they are different issues. IMHO the discontig APIC IDs was just a workaround because you didn't want to fix the interrupt routing properly, with the burden on the kernel.
You have two options now:
- Trace down why interrupt 0 doesn't work with BSP LAPIC != 0 and fix that. - Provide SRAT.
First is probably better, although the second also wouldn't hurt.
patch really intrusive enough that it is not acceptable? The issue is
Yes it complicates the logic enough and makes it more fragile, which would be a long term maintenance issue. The only way to keep the fallback code alive at all is to keep it simple and clean.
-Andi
fallback code it not needed, because for AMD optern, at that point you can figure out the node id and core id from initial apic id exactly....
YH
On Wed, Nov 23, 2005 at 09:43:12AM -0800, yhlu wrote:
fallback code it not needed, because for AMD optern, at that point you can figure out the node id and core id from initial apic id exactly....
AFAIK there is no foolproof way to figure out the HT node id from the initial APIC ID. One is in the Northbridge, the other is in CPUID, but there is no directly visible connection.
If you know one please share it
-Andi
NB_CFG bit 54 for E0 stepping later can be set.
When it is set the initial apic id will be node0/core0 : 0 node0/core1 : 1 node1/core0 : 2 node1/core1 : 3
So you can shift the initial apic id to the node id, but the problem is you need to have right cores_vir, because for single you need to use 2 to shift instead of 1 the core num that you read from msr.
For LinuxBIOS, we go further, that we use NB_CFG bit 54 directly, instead of check cpu version, because only E0 later can be set.
please check the code in LinuxBIOS that we are using to get node id...
YH
static inline unsigned int read_nb_cfg_54(void) { msr_t msr; msr = rdmsr(NB_CFG_MSR); return ( ( msr.hi >> (54-32)) & 1); }
struct node_core_id { unsigned nodeid; unsigned coreid; };
static inline unsigned get_initial_apicid(void) { return ((cpuid_ebx(1) >> 24) & 0xf); }
static inline struct node_core_id get_node_core_id(unsigned nb_cfg_54) { struct node_core_id id; // get the apicid via cpuid(1) ebx[27:24] if( nb_cfg_54) { // when NB_CFG[54] is set, nodid = ebx[27:25], coreid = ebx[24] id.coreid = (cpuid_ebx(1) >> 24) & 0xf; id.nodeid = (id.coreid>>1); id.coreid &= 1; } else { // when NB_CFG[54] is clear, nodeid = ebx[26:24], coreid = ebx[27] id.nodeid = (cpuid_ebx(1) >> 24) & 0xf; id.coreid = (id.nodeid>>3); id.nodeid &= 7; } return id; }
static inline unsigned get_core_num(void) { return (cpuid_ecx(0x80000008) & 0xff); }
static inline struct node_core_id get_node_core_id_x(void) {
return get_node_core_id( read_nb_cfg_54() ); // for pre_e0() nb_cfg_54 always be 0 }
On Wed, Nov 23, 2005 at 10:01:51AM -0800, yhlu wrote:
NB_CFG bit 54 for E0 stepping later can be set.
That MSR is not even in my docs. Sounds very stepping specific. Probably nothing that a kernel should access. The k8topology code is already far too machine specific and making it more so would be a mistake.
-Andi
Andi Kleen wrote:
That MSR is not even in my docs. Sounds very stepping specific.
or your docs predate that MSR?
ron
* Andi Kleen ak@suse.de [051123 18:35]:
At the same time, we seem to be treading in territory where the fuctory BIOSes have not yet been. We're in the weird position, at times, of finding things out before the proprietary BIOSes get there.
You're saying that Arima machine only runs with LinuxBIOS so far?
Island/Aruma. The only production level firmware is LinuxBIOS, indeed.
Others have failed.
Stefan
Andi Kleen wrote:
At the same time, we seem to be treading in territory where the fuctory BIOSes have not yet been. We're in the weird position, at times, of finding things out before the proprietary BIOSes get there.
You're saying that Arima machine only runs with LinuxBIOS so far?
The Cray XD-1, opteron-based machine, is linuxbios only. Another machine, DRC, can only use linuxbios. Island/Aruma was mentioned. These are the three I know of, there may well be others.
DRC used linuxbios because fuctory bios could not configure HT correctly in their application.
(well, I know of one other, but can't tell you about it).
ron