Patrick Rudolph has uploaded this change for review. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
cpu/x86/mp_init: Wait longer with serial enabled
In case of slow serial wait longer for MP init to pickup jobs.
This fixes a crash observed on KBL with 8core CPU as the task is stored on the BSP stack, which is reused for something else if the APs are to slow.
Tested on SuperMicro X11SSH-TF.
Change-Id: Id47df02a9238e66f2b628b9d6805858724bf30a9 Signed-off-by: Patrick Rudolph patrick.rudolph@9elements.com --- M src/cpu/x86/mp_init.c 1 file changed, 13 insertions(+), 0 deletions(-)
git pull ssh://review.coreboot.org:29418/coreboot refs/changes/70/34570/1
diff --git a/src/cpu/x86/mp_init.c b/src/cpu/x86/mp_init.c index 9528149..8ff214e 100644 --- a/src/cpu/x86/mp_init.c +++ b/src/cpu/x86/mp_init.c @@ -892,6 +892,19 @@ } mfence();
+ if (CONFIG(CONSOLE_SERIAL)) { + /* The serial console is slow and has a spinlock. + * It can take a while for all APs to pickup the work, + * especially with modern multicore CPUs. + * + * Wait longer than requested, otherwise we will return early + * and the stack containing the AP work will be reused for + * something else, causing the APs to crash. + * + * 100msec seems to be sufficent on 8core platform. + */ + expire_us = 100 * USECS_PER_MSEC; + } /* Wait for all the APs to signal back that call has been accepted. */ if (expire_us > 0) stopwatch_init_usecs_expire(&sw, expire_us);
Subrata Banik has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1: Code-Review+1
Furquan Shaikh has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@906 PS1, Line 906: expire_us = 100 * USECS_PER_MSEC; What if run_ap_work() is called with > 100 msec? e.g. mp_park_aps. This would end up reducing the expiry time requested by caller.
Subrata Banik has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1: -Code-Review
yes, better to increase at calling function itself with higher timeout ?
Richard Spiegel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@895 PS1, Line 895: CONSOLE_SERIAL Several other platforms use CONSOLE_SERIAL, most don't have the problem you describe. This would than cause an unnecessary delay for other platforms, which is bad. Create a config SLOW_SERIAL_INIT, depending on CONSOLE_SERIAL, and use it here instead of CONSOLE_SERIAL. Also, add a description of the problem on the help line of this new config.
Richard Spiegel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
Patch Set 1: -Code-Review
yes, better to increase at calling function itself with
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@906 PS1, Line 906: expire_us = 100 * USECS_PER_MSEC;
What if run_ap_work() is called with > 100 msec? e.g. mp_park_aps. […]
You could also check current value of expire_us and only change it if it was less then 100...
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(2 comments)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@895 PS1, Line 895: CONSOLE_SERIAL
Several other platforms use CONSOLE_SERIAL, most don't have the problem you describe. […]
Who would select SLOW_SERIAL_INIT? Seems to make no sense.
Please explain why it would introduce a delay? If it runs into that timeout, something is broken on your platform, isn't it?
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@906 PS1, Line 906: expire_us = 100 * USECS_PER_MSEC;
What if run_ap_work() is called with > 100 msec? e.g. mp_park_aps. […]
Right, it should set a minimal limit.
Aaron Durbin has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG Commit Message:
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG@13 PS1, Line 13: to too
Richard Spiegel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@895 PS1, Line 895: CONSOLE_SERIAL
Who would select SLOW_SERIAL_INIT? Seems to make no sense. […]
Only your platform would need to select it. I don't want the delay increased for any other platform that also uses CONSOLE_SERIAL and don't need the extra delay. Maybe name the config differently if you don't like my suggestion, but do something that only affects your platform (and could be used in future by another platform that also presents that problem).
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@895 PS1, Line 895: CONSOLE_SERIAL
Only your platform would need to select it. […]
There's no delay and thus does not increase any delay. It increases the timeout for all APs to pick up the job.
Paul Menzel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(4 comments)
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG Commit Message:
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG@9 PS1, Line 9: pickup The verb is spelled with a space: pick up.
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG@15 PS1, Line 15: SuperMicro Supermicro
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@896 PS1, Line 896: /* The serial console is slow and has a spinlock. Please use an allowed comment style.
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@897 PS1, Line 897: pickup pick up
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@896 PS1, Line 896: /* The serial console is slow and has a spinlock.
Please use an allowed comment style.
It's allowed.
Furquan Shaikh has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@895 PS1, Line 895: CONSOLE_SERIAL
There's no delay and thus does not increase any delay. […]
The increase in expiry time does not result in boot time increase unless the APs really need more time to pick the job. If the job is not picked within the timeout, there could be other silent failures that might occur in the system. Few months back I had run into a similar issue where parking of APs timed out and no one had paid attention to BIOS logs. Eventually, it started showing issues when running suspend stress tests.
Richard Spiegel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(2 comments)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@895 PS1, Line 895: CONSOLE_SERIAL
The increase in expiry time does not result in boot time increase unless the APs really need more ti […]
Ack
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@896 PS1, Line 896: /* The serial console is slow and has a spinlock.
It's allowed.
Paul probably wants: /* * The serial console is slow and has a spinlock.
Paul Menzel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@896 PS1, Line 896: /* The serial console is slow and has a spinlock.
Paul probably wants: […]
That or the concise style.
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1: Code-Review-1
While a bigger timeout fixes the issue I'm seeing and it's related to slow serial, the fix is incorrect. The main issue is that `mp_run_on_all_cpus` doesn't wait for CPUs to finish execution.
That means consecutive calls to mp_run_on_all_cpus expect all CPUs to pick-up the work, but they are still busy with the old job (due to high core count and console spinlock).
I guess that's the reason mp_park_aps has this big timeout.
An idea how to fix it: Wait it mp_run_on_aps for all APs to signal idle state before trying to launch a new job.
Subrata Banik has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 1:
Patch Set 1: Code-Review-1
While a bigger timeout fixes the issue I'm seeing and it's related to slow serial, the fix is incorrect. The main issue is that `mp_run_on_all_cpus` doesn't wait for CPUs to finish execution.
That means consecutive calls to mp_run_on_all_cpus expect all CPUs to pick-up the work, but they are still busy with the old job (due to high core count and console spinlock).
I guess that's the reason mp_park_aps has this big timeout.
An idea how to fix it: Wait it mp_run_on_aps for all APs to signal idle state before trying to launch a new job.
we need to think do we really need to have all serial print in multi-core environment for example, i'm running same function foo() [where i'm having bunch of serial debug print] in all cores (BSP + AP) and expecting to see all those log in "n" numbers of time. What is the benefit in that approach. if we could only limit serial log for BSP then we can avoid such race case where due to slower serial uart ultimately my AP function giving error, there is no issue with my cores (APs) only due to slower uart, AP has to compromise.
if we write something like this for all functions suppose to run in multi threaded envionment as below
function foo() { if(bsp) printk(...) }
thoughts ?
Hello Kyösti Mälkki, Naresh Solanki, Aaron Durbin, Alexandru Gagniuc, Subrata Banik, Julius Werner, Lee Leahy, Marshall Dawson, Richard Spiegel, build bot (Jenkins), Furquan Shaikh,
I'd like you to reexamine a change. Please visit
https://review.coreboot.org/c/coreboot/+/34570
to look at the new patch set (#2).
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
cpu/x86/mp_init: Wait longer with serial enabled
In case of slow serial wait longer for MP init to pick up jobs.
This just hides a bug in mp_init, that doesn't wait for APs to finish the previous job.
TODO: Add a spinlock and wait for APs to finish their previous work instead.
However it fixes a crash observed on KBL with 8core CPU as the task is stored on the BSP stack, which is reused for something else if the APs are too slow.
Tested on Supermicro X11SSH-TF.
Change-Id: Id47df02a9238e66f2b628b9d6805858724bf30a9 Signed-off-by: Patrick Rudolph patrick.rudolph@9elements.com --- M src/cpu/x86/mp_init.c 1 file changed, 15 insertions(+), 0 deletions(-)
git pull ssh://review.coreboot.org:29418/coreboot refs/changes/70/34570/2
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2:
(5 comments)
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG Commit Message:
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG@13 PS1, Line 13: to
too
Done
https://review.coreboot.org/c/coreboot/+/34570/1//COMMIT_MSG@15 PS1, Line 15: SuperMicro
Supermicro
Done
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@896 PS1, Line 896: /* The serial console is slow and has a spinlock.
That or the concise style.
According to https://doc.coreboot.org/coding_style.html C89 or C99 are mandatory. There's no mandatory coding style for comments, only preferred ones.
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@897 PS1, Line 897: pickup
pick up
Done
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@906 PS1, Line 906: expire_us = 100 * USECS_PER_MSEC;
Right, it should set a minimal limit.
Done
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2:
Patch Set 1:
Patch Set 1: Code-Review-1
While a bigger timeout fixes the issue I'm seeing and it's related to slow serial, the fix is incorrect. The main issue is that `mp_run_on_all_cpus` doesn't wait for CPUs to finish execution.
That means consecutive calls to mp_run_on_all_cpus expect all CPUs to pick-up the work, but they are still busy with the old job (due to high core count and console spinlock).
I guess that's the reason mp_park_aps has this big timeout.
An idea how to fix it: Wait it mp_run_on_aps for all APs to signal idle state before trying to launch a new job.
we need to think do we really need to have all serial print in multi-core environment for example, i'm running same function foo() [where i'm having bunch of serial debug print] in all cores (BSP + AP) and expecting to see all those log in "n" numbers of time. What is the benefit in that approach. if we could only limit serial log for BSP then we can avoid such race case where due to slower serial uart ultimately my AP function giving error, there is no issue with my cores (APs) only due to slower uart, AP has to compromise.
if we write something like this for all functions suppose to run in multi threaded envionment as below
function foo() { if(bsp) printk(...) }
thoughts ?
That works, but: * Requires lots of changes in common code * Only fixes current code, but doesn't prevent future developers from doing the same error again * Might break again as soon as more code (delay) as added on secondary APs * Might break again as soon as core count increases
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2:
Patch Set 2:
Patch Set 1:
Patch Set 1: Code-Review-1
While a bigger timeout fixes the issue I'm seeing and it's related to slow serial, the fix is incorrect. The main issue is that `mp_run_on_all_cpus` doesn't wait for CPUs to finish execution.
That means consecutive calls to mp_run_on_all_cpus expect all CPUs to pick-up the work, but they are still busy with the old job (due to high core count and console spinlock).
I guess that's the reason mp_park_aps has this big timeout.
An idea how to fix it: Wait it mp_run_on_aps for all APs to signal idle state before trying to launch a new job.
we need to think do we really need to have all serial print in multi-core environment for example, i'm running same function foo() [where i'm having bunch of serial debug print] in all cores (BSP + AP) and expecting to see all those log in "n" numbers of time. What is the benefit in that approach. if we could only limit serial log for BSP then we can avoid such race case where due to slower serial uart ultimately my AP function giving error, there is no issue with my cores (APs) only due to slower uart, AP has to compromise.
if we write something like this for all functions suppose to run in multi threaded envionment as below
function foo() { if(bsp) printk(...) }
thoughts ?
That works, but:
- Requires lots of changes in common code
- Only fixes current code, but doesn't prevent future developers from doing the same error again
- Might break again as soon as more code (delay) as added on secondary APs
- Might break again as soon as core count increases
Alternative solution: https://review.coreboot.org/c/coreboot/+/34587
Paul Menzel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2:
(1 comment)
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c File src/cpu/x86/mp_init.c:
https://review.coreboot.org/c/coreboot/+/34570/1/src/cpu/x86/mp_init.c@896 PS1, Line 896: /* The serial console is slow and has a spinlock.
According to https://doc.coreboot.org/coding_style.html C89 or C99 are mandatory. […]
Well, preferred in the sense that you should follow it if there is no strong reason to deviate. I do not see a reason here, why you should deviate.
Christian Walter has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2: Code-Review+1
Richard Spiegel has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2:
I don't like a workaround that does not address the true problem. I do understand that sometimes it's needed to quick solve a issue when there's time constrains such as a board bring up, but to often it stays there for several years without the real issue being fixed... and when it finally gets fixed (by someone else), they are not aware of the workaround placed years before and now it's there without being needed.
Under normal circumstances I might have approved, but now that it came up it's a workaround, I will not.
Furquan Shaikh has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2:
Patch Set 1: Code-Review-1
While a bigger timeout fixes the issue I'm seeing and it's related to slow serial, the fix is incorrect. The main issue is that `mp_run_on_all_cpus` doesn't wait for CPUs to finish execution.
That means consecutive calls to mp_run_on_all_cpus expect all CPUs to pick-up the work, but they are still busy with the old job (due to high core count and console spinlock).
I guess that's the reason mp_park_aps has this big timeout.
An idea how to fix it: Wait it mp_run_on_aps for all APs to signal idle state before trying to launch a new job.
How long would this wait be? If a timed wait is added for APs to signal idle, wouldn't it mean that the timeout is just moved from "AP accepting work" to "AP finishing old work"?
Patrick Rudolph has posted comments on this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Patch Set 2:
Patch Set 2:
Patch Set 1: Code-Review-1
While a bigger timeout fixes the issue I'm seeing and it's related to slow serial, the fix is incorrect. The main issue is that `mp_run_on_all_cpus` doesn't wait for CPUs to finish execution.
That means consecutive calls to mp_run_on_all_cpus expect all CPUs to pick-up the work, but they are still busy with the old job (due to high core count and console spinlock).
I guess that's the reason mp_park_aps has this big timeout.
An idea how to fix it: Wait it mp_run_on_aps for all APs to signal idle state before trying to launch a new job.
How long would this wait be? If a timed wait is added for APs to signal idle, wouldn't it mean that the timeout is just moved from "AP accepting work" to "AP finishing old work"?
Yes, true the timeout would just be moved within this function. There's no benefit to current code.
Patrick Rudolph has abandoned this change. ( https://review.coreboot.org/c/coreboot/+/34570 )
Change subject: cpu/x86/mp_init: Wait longer with serial enabled ......................................................................
Abandoned
replaced by CB:35312