Hi,
as we move into the multicore world, messages get harder and harder to read when they are interspersed byte-wise. I propose to add optional locking to printk.
However, this locking has to work when some CPUs are in CAR and others are in RAM. And that's the big problem.
AFAICS we can't simply use atomic normal memory acceses since the CPUs in CAR won't see that. Atomic register modifications are not possible across processors AFAICS. The only idea I have is abusing some MMIO chipset/CPU space which is usable as scratch space. In theory, that should work. But do CPUs or chipsets have such MMIO areas?
Of course, I'm aware that locking (if possible at all) will be highly CPU/chipset dependent, but even if we can make locking work on only a subset of platforms, it is still a lot better than nothing.
Ideas? Utter rejection? Tell me.
Regards, Carl-Daniel
as we move into the multicore world, messages get harder and harder to read when they are interspersed byte-wise. I propose to add optional locking to printk.
This should only matter for the time during RAM, right? Is it a larger issue?
However, this locking has to work when some CPUs are in CAR and others are in RAM. And that's the big problem.
Yuck.
AFAICS we can't simply use atomic normal memory acceses since the CPUs in CAR won't see that. Atomic register modifications are not possible across processors AFAICS. The only idea I have is abusing some MMIO chipset/CPU space which is usable as scratch space. In theory, that should work. But do CPUs or chipsets have such MMIO areas?
They do. It's hard to know if someone is already (ab)using them, though.
Of course, I'm aware that locking (if possible at all) will be highly CPU/chipset dependent, but even if we can make locking work on only a subset of platforms, it is still a lot better than nothing.
Are you thinking of putting it in the ops of the cpu device? I guess that only works after CAR.
Another option would be to only let the BSP print messages by default. That would clean up most people's logs most of the time.
Thanks, Myles
On Fri, Jun 19, 2009 at 9:13 AM, Myles Watsonmylesgw@gmail.com wrote:
Another option would be to only let the BSP print messages by default. That would clean up most people's logs most of the time.
yes, that is why I did that struct-based stack in my v3 SMP startup. The struct formed the base of the AP stack. We could put a simple print buffer in there and require that the BSP print out AP boot messages. This would be a bit better than trying to resolve this locking issue. I never got this done, I only got the "AP post code" working. But overall I think my SMP startup prototype was much cleaner than what is in v2 today.
I don't think we want to put locks in printk. If an AP gets part way up, takes the lock, and fails, everyone is going to stick on that lock. Not fun.
We have made a decision that the BSP is always assumed to work. Any strategy should be build around this assumption, and the further assumption that we ought to contain the AP so that it can not prevent the BSP from doing its job.
ron
yes, that is why I did that struct-based stack in my v3 SMP startup. The struct formed the base of the AP stack. We could put a simple print buffer in there and require that the BSP print out AP boot messages.
So each AP has some part of RAM to copy the buffer to?
I guess that doesn't help with RAM initialization issues, or am I missing something? Maybe that's a small price to pay. Maybe we should allow "Everyone prints" as an option for RAM init debugging.
Thanks, Myles
On Fri, Jun 19, 2009 at 9:50 AM, Myles Watsonmylesgw@gmail.com wrote:
yes, that is why I did that struct-based stack in my v3 SMP startup. The struct formed the base of the AP stack. We could put a simple print buffer in there and require that the BSP print out AP boot messages.
So each AP has some part of RAM to copy the buffer to?
the way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram. That's why APs have a stack in the first place.
APs have a working stack when they are setting up their own RAM.
ron
ron minnich wrote:
On Fri, Jun 19, 2009 at 9:50 AM, Myles Watsonmylesgw@gmail.com wrote:
yes, that is why I did that struct-based stack in my v3 SMP startup. The struct formed the base of the AP stack. We could put a simple print buffer in there and require that the BSP print out AP boot messages.
So each AP has some part of RAM to copy the buffer to?
the way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram. That's why APs have a stack in the first place.
APs have a working stack when they are setting up their own RAM.
The way this works on amd64 is that the AP comes up, goes to cache as ram, finds it is an AP and goes to sleep again. Then it wakes up again in stage2 when the BSP sends an IPI. At this point (at least remote) RAM is available. They never set up their own ram (in terms of Jedec init, or setting up a ram controller), but only have to clear it, in case of ECC memory.
Stefan
the way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram. That's why APs have a stack in the first place.
APs have a working stack when they are setting up their own RAM.
The way this works on amd64 is that the AP comes up, goes to cache as ram, finds it is an AP and goes to sleep again. Then it wakes up again in stage2 when the BSP sends an IPI. At this point (at least remote) RAM is available. They never set up their own ram (in terms of Jedec init, or setting up a ram controller), but only have to clear it, in case of ECC memory.
OK. I'd thought that they each initialized their own RAM, then went to sleep. Thanks.
Myles
So each AP has some part of RAM to copy the buffer to?
the way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram.
For Opterons where each node has its own RAM this is a little bit more complex, right?
That's why APs have a stack in the first place.
APs have a working stack when they are setting up their own RAM.
I was wondering how you make the APs not conflict in the part of RAM they copy their buffers to. I was also wondering how it would affect interleaving, etc. That kind of thing seems difficult to debug, and is the reason I'd want to see the APs messages.
Thanks, Myles
On Fri, Jun 19, 2009 at 10:55 AM, Myles Watsonmylesgw@gmail.com wrote:
So each AP has some part of RAM to copy the buffer to?
the way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram.
For Opterons where each node has its own RAM this is a little bit more complex, right?
but each AP has access to the BSP ram, or at least that is how I read the code.
I was wondering how you make the APs not conflict in the part of RAM they copy their buffers to. I was also wondering how it would affect interleaving, etc. That kind of thing seems difficult to debug, and is the reason I'd want to see the APs messages.
you give each AP a seperate stack. All that code is in there already.
ron
Myles Watson wrote:
So each AP has some part of RAM to copy the buffer to?
the way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram.
For Opterons where each node has its own RAM this is a little bit more complex, right?
No, not really. This is already the complex case. Many systems only have one memory controller. But on all coreboot systems, even those with multiple memory controllers, the controllers are all set up by the BSP. Parallelizing here makes only very little sense.
The reason we're parallelizing is you have to clear memory if you have ECC on some (older?) systems where the memory controller is incapable of doing that automatically. So when we have to clear 32G at a rate of 3 or 6GB/s we want to put that load on several CPUs, so it's 1-2s instead of 5-10. But the ram is all there at this point, and can be transparently accessed by all CPUs.
Stefan
So each AP has some part of RAM to copy the buffer to?
the way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram.
For Opterons where each node has its own RAM this is a little bit more complex, right?
No, not really. This is already the complex case. Many systems only have one memory controller. But on all coreboot systems, even those with multiple memory controllers, the controllers are all set up by the BSP. Parallelizing here makes only very little sense.
The reason we're parallelizing is you have to clear memory if you have ECC on some (older?) systems where the memory controller is incapable of doing that automatically. So when we have to clear 32G at a rate of 3 or 6GB/s we want to put that load on several CPUs, so it's 1-2s instead of 5-10. But the ram is all there at this point, and can be transparently accessed by all CPUs.
I think what we really need is a tool called email2docs. :)
Thanks for the answers.
Thanks, Myles
On 19.06.2009 20:16, Myles Watson wrote:
I think what we really need is a tool called email2docs. :)
I tried to gather the snippets which will be useful for others after our discussion at http://www.coreboot.org/Early_SMP_startup . Feel free to rename/change/etc, I hope it will be useful for others. It is certainly useful for me.
Thanks for the answers.
Yes, thanks from me as well. The explanations really helped me understand early startup a lot better.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
On 19.06.2009 20:16, Myles Watson wrote:
I think what we really need is a tool called email2docs. :)
I tried to gather the snippets which will be useful for others after our discussion at http://www.coreboot.org/Early_SMP_startup . Feel free to rename/change/etc, I hope it will be useful for others. It is certainly useful for me.
Awesome! Thanks a lot!
On 19.06.2009 18:35, ron minnich wrote:
On Fri, Jun 19, 2009 at 9:13 AM, Myles Watsonmylesgw@gmail.com wrote:
Another option would be to only let the BSP print messages by default. That would clean up most people's logs most of the time.
yes, that is why I did that struct-based stack in my v3 SMP startup. The struct formed the base of the AP stack. We could put a simple print buffer in there and require that the BSP print out AP boot messages. This would be a bit better than trying to resolve this locking issue.
Neat idea. Should we make printk on the BSP check AP buffers after each printk and print them as well? What happens if the APs print stuff faster than the BSP can poll? By the way, managing that buffer is a locking problem as well, but we might get away with it. The bigger problem with AP buffers is that they may be invisible to the BSP if the APs have independent cache.
I never got this done, I only got the "AP post code" working. But overall I think my SMP startup prototype was much cleaner than what is in v2 today.
I don't understand v2 startup anyway (well, except for your new code).
I don't think we want to put locks in printk. If an AP gets part way up, takes the lock, and fails, everyone is going to stick on that lock. Not fun.
If we can get locking to work, auto-busting a lock is almost trivial. Try to get the lock for n iterations, after that print anyway and complain "lock held too long". Give me a trylock() function and I'll give you auto-busting code.
We have made a decision that the BSP is always assumed to work. Any strategy should be build around this assumption, and the further assumption that we ought to contain the AP so that it can not prevent the BSP from doing its job.
Absolutely agreed. Auto-busting the lock would solve this nicely and even detect hangs.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
I never got this done, I only got the "AP post code" working. But overall I think my SMP startup prototype was much cleaner than what is in v2 today.
I don't understand v2 startup anyway (well, except for your new code).
Then please go ahead and learn to read it... It's important that we understand what's going on before changing it.
The new code just duplicates what was there, but it adds a couple of new breakages. It doesn't even add any conceptional differences.
I don't think we want to put locks in printk. If an AP gets part way up, takes the lock, and fails, everyone is going to stick on that lock. Not fun.
If we can get locking to work, auto-busting a lock is almost trivial. Try to get the lock for n iterations, after that print anyway and complain "lock held too long". Give me a trylock() function and I'll give you auto-busting code.
Please, please, don't get fancy. There's no real problem, we've just been doing too much cut and paste in the past without testing the new code. This made us end up with different versions of printk, some with locking, some without. And, no, porting the code from v3 over is not an option at this point. It does too much different stuff. Let's rather start dropping unneeded implementations in v2 until things look sane again and then we can decide what implementation we want.
We have made a decision that the BSP is always assumed to work. Any strategy should be build around this assumption, and the further assumption that we ought to contain the AP so that it can not prevent the BSP from doing its job.
Absolutely agreed. Auto-busting the lock would solve this nicely and even detect hangs.
I'm not sure I can interpret this into the above sentence. What would lock auto-busting gain us? We'd be debugging printk, but nothing more?
Stefan
On Fri, Jun 19, 2009 at 10:41 AM, Stefan Reinauerstepan@coresystems.de wrote:
Then please go ahead and learn to read it... It's important that we understand what's going on before changing it.
I think I understand it but it is very much C from assembly. It can be better.
The new code just duplicates what was there, but it adds a couple of new breakages. It doesn't even add any conceptional differences.
well, I partly agree. The point was to stay as much the same, but also allow the BSP to better monitor (and control) what was going on. I would still claim the structure of the code is a big improvement. The v2 SMP startup is not an easy read.
I think auto-busting a lock is a bad thing to put in. It's either a lock or it's not. If you need to auto-bust a lock it means your design is working on bad assumptions or is broken.
ron
On 19.06.2009 19:41, Stefan Reinauer wrote:
Carl-Daniel Hailfinger wrote:
I never got this done, I only got the "AP post code" working. But overall I think my SMP startup prototype was much cleaner than what is in v2 today.
I don't understand v2 startup anyway (well, except for your new code).
Then please go ahead and learn to read it... It's important that we understand what's going on before changing it.
I tried multiple times and gave up after it became obvious that v3 startup did similar stuff, but was a lot more understandable. I suck at understanding linker scripts. Lots of linker symbols and scripts and other funky trickery combined with code includes made the code unnecessarily hard to read. I know you understand early startup very well and I admire that.
Anyway, given two code versions A and B which seem to be functionally identical, but only one of them is easily understood, I'll always opt for replacing the complicated version with the easy version unless there are technical reasons not to do so (broken functionality etc.).
The new code just duplicates what was there, but it adds a couple of new breakages. It doesn't even add any conceptional differences.
Even if the differences are only conceptual, the coding style (and the IMHO over-usage of linker magic) is so radically different that I dread the current v2 code (well, not the brand new stuff from Ron). Once I know what's broken with the new code, I actually have a chance to understand and fix it.
Firmware is magic. Early CPU startup is even more magic. (I mean this in the best possible way.) We are magicians, but some of us are more advanced than others. I'm still learning. Having code which is understood by more than one magician is good. Don't get me wrong, I'm not advocating dumbing down the code, but if we have functionally equivalent code variants, the more readable/debuggable variant should be used IMHO.
Please, please, don't get fancy. There's no real problem, we've just been doing too much cut and paste in the past without testing the new code. This made us end up with different versions of printk, some with locking, some without.
The v3 printk has some neat features over what I can see in v2, at least from what I remember offhand: - detection of NULL dereference (helped find a bug) - detection of near-NULL dereference (same) - detection of non-ASCII characters in case of corruption (same) - reuse of vtxprintf for sprintf and printk - printk buffer - multiple output devices: USBDebug, Serial, Buffer
Since my intention is to keep v3 alive and to improve v2, I want to make very sure that we find out what the best variant of printk is (this may well be a fusion of v2 and v3 functionality), then have both versions use identical code. If there are technical reasons not to do so, I'd like to hear about them.
And, no, porting the code from v3 over is not an option at this point. It does too much different stuff. Let's rather start dropping unneeded implementations in v2 until things look sane again and then we can decide what implementation we want.
I had assumed the v3 printk was universally better than all of the v2 variants. Unifying all v2 printk would indeed be a huge step forward, but as long as I don't know which features we want, I risk eliminating the wrong variants. One thing v3 seems to handle better is multiple output modes (USB Debug, Serial, Buffer), where v2 seems to have sprouted a printk variant for each.
What would lock auto-busting gain us? We'd be debugging printk, but nothing more?
Sorry, it was a bad idea to bring up, mostly addressed at worries that printk may hang. Though if that happens, we have much more pressing problems than locking. Please forget I brought up lock-busting. Thanks.
Regards, Carl-Daniel
ron minnich wrote:
On Fri, Jun 19, 2009 at 9:13 AM, Myles Watsonmylesgw@gmail.com wrote:
Another option would be to only let the BSP print messages by default. That would clean up most people's logs most of the time.
yes, that is why I did that struct-based stack in my v3 SMP startup. The struct formed the base of the AP stack.
This sounds dangerous.. a stack is never "struct"ured.
We could put a simple print buffer in there and require that the BSP print out AP boot messages.
This is complexity from hell. Why not a decent locking mechanism. Plus, reducing the incredible amounts of useless output coreboot produces even at DEBUG level.
This would be a bit better than trying to resolve this locking issue.
Why? I disagree. Because we still need locking and inter cpu communication in order to print a string.
Do we really want to re-implement IPC and shared memory in coreboot? I hear we are becoming a kernel.
I never got this done, I only got the "AP post code"
v2 has / used to have working locking code since it was first ported to opteron. It may be that it broke while adding 5 more printks but it is there somewhere.
We have made a decision that the BSP is always assumed to work. Any strategy should be build around this assumption, and the further assumption that we ought to contain the AP so that it can not prevent the BSP from doing its job.
Making the BSP poll for the APs (which is what we would do if we need to check the APs shared memory) basically renders the BSP unusable to do stuff while waiting for the APs.
With simple locking, everything can run in parallel, and only serial output needs to get synced. Which is what we actually want.
Stefan
On 19.06.2009 19:33, Stefan Reinauer wrote:
This is complexity from hell. Why not a decent locking mechanism.
I see no reason not to use locking if it actually works. That's the reason I asked in the first place.
Plus, reducing the incredible amounts of useless output coreboot produces even at DEBUG level.
Worthwhile goal, but orthogonal to this discussion.
Do we really want to re-implement IPC and shared memory in coreboot? I hear we are becoming a kernel.
Not if it can be avoided. Shared memory may be good for big global variables which we don't want to duplicate, but it should not be used as a band-aid.
ron minnich wrote:
I never got this done, I only got the "AP post code"
v2 has / used to have working locking code since it was first ported to opteron. It may be that it broke while adding 5 more printks but it is there somewhere.
Any hints on where to look are appreciated. I'd like to use locking inside printk ASAP.
Making the BSP poll for the APs (which is what we would do if we need to check the APs shared memory) basically renders the BSP unusable to do stuff while waiting for the APs.
Agreed.
With simple locking, everything can run in parallel, and only serial output needs to get synced. Which is what we actually want.
As long as we don't perform byte-wise serial output locking (which would defeat the whole purpose of the lock which is having readable messages), I'm all for it. Show me the locking primitives and I'll add them to printk.
Regards, Carl-Daniel
src/smp/spinlock.h
ron
On 20.06.2009 00:20, ron minnich wrote:
src/smp/spinlock.h
Thanks, reading it now.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
v2 has / used to have working locking code since it was first ported to opteron. It may be that it broke while adding 5 more printks but it is there somewhere.
Any hints on where to look are appreciated. I'd like to use locking inside printk ASAP.
Where does the locking algorithm fail? My guess is, it's in romcc or CAR code. Where we are out of luck with all approaches of synchronization so far.
do_printk is defined in src/console/printk.c as:
int do_printk(int msg_level, const char *fmt, ...) { va_list args; int i;
if (msg_level >= console_loglevel) { return 0; }
spin_lock(&console_lock);
va_start(args, fmt); i = vtxprintf(console_tx_byte, fmt, args); va_end(args);
console_tx_flush();
spin_unlock(&console_lock);
return i; }
(we should just rename this to printk and use it the v3 way instead of the printk_<loglevel> layers on top, but that's orthogonal, too)
Stefan
On 20.06.2009 00:47, Stefan Reinauer wrote:
Carl-Daniel Hailfinger wrote:
v2 has / used to have working locking code since it was first ported to opteron. It may be that it broke while adding 5 more printks but it is there somewhere.
Any hints on where to look are appreciated. I'd like to use locking inside printk ASAP.
Where does the locking algorithm fail? My guess is, it's in romcc or CAR code. Where we are out of luck with all approaches of synchronization so far.
Well, if the stack for the BSP and each AP are at different locations and there are no cache overlaps, using the BSP stack for lock variables should work even in the partial-CAR case. At least that's what I hope.
do_printk is defined in src/console/printk.c as:
int do_printk(int msg_level, const char *fmt, ...) { [...] spin_lock(&console_lock);
va_start(args, fmt); i = vtxprintf(console_tx_byte, fmt, args); va_end(args); console_tx_flush(); spin_unlock(&console_lock);
[...] }
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose. A quick test abuild with #error inserted in the lockless function shows we indeed use it for every freaking x86 target. That explains the supermicro/h8dme intertwined printk messages Ward is seeing.
Besides that, do we know where static spinlock_t console_lock is placed? I can't figure out from the linker where it is.
(we should just rename this to printk and use it the v3 way instead of the printk_<loglevel> layers on top, but that's orthogonal, too)
Yes please!
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
This is the version that was intended to be used when CONFIG_USE_INIT is set.
Does the _flush actually do much? I had the impression that we flush after each character anyways, but I'm no terminal expert.
Other than that, we could unify those versions by just defining an empty (for now) version of the spinlock functions in raminit stage. Then think about where we can place our locking for those platforms that need it this early...?
A quick test abuild with #error inserted in the lockless function shows we indeed use it for every freaking x86 target. That explains the supermicro/h8dme intertwined printk messages Ward is seeing.
They're from ram init stage afaict ...
Besides that, do we know where static spinlock_t console_lock is placed?
In RAM
grep console_lock *.map coreboot_ram.map:00122058 d console_lock smm.map:000a13fc t console_lock
(raminit stage symbols are in coreboot.map)
Stefan
On 20.06.2009 02:17, Stefan Reinauer wrote:
Carl-Daniel Hailfinger wrote:
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
This is the version that was intended to be used when CONFIG_USE_INIT is set.
Ah.
Does the _flush actually do much? I had the impression that we flush after each character anyways, but I'm no terminal expert.
_flush calls tx_flush for each driver, but all console drivers in the tree lack such a tx_flush function, so it's essentially an expensive no-op because all console driver structs are walked each time. Kill?
Other than that, we could unify those versions by just defining an empty (for now) version of the spinlock functions in raminit stage. Then think about where we can place our locking for those platforms that need it this early...?
Sounds like a plan.
A quick test abuild with #error inserted in the lockless function shows we indeed use it for every freaking x86 target. That explains the supermicro/h8dme intertwined printk messages Ward is seeing.
They're from ram init stage afaict ...
If we need them instead of the generic variants, we should know a reason for such usage.
Besides that, do we know where static spinlock_t console_lock is placed?
In RAM
So we'd get uninitialied data for any pre-RAM spinlock access? The v3 global variable mechanism should solve this nicely. At least it was designed for that.
grep console_lock *.map coreboot_ram.map:00122058 d console_lock smm.map:000a13fc t console_lock
(raminit stage symbols are in coreboot.map)
Thanks for the hint, now I know where to look.
Regards, Carl-Daniel
On Fri, Jun 19, 2009 at 5:46 PM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
_flush calls tx_flush for each driver, but all console drivers in the tree lack such a tx_flush function, so it's essentially an expensive no-op because all console driver structs are walked each time. Kill?
no. It does no harm. And we may need it someday. The structure of it makes sense.
Other than that, we could unify those versions by just defining an empty (for now) version of the spinlock functions in raminit stage. Then think about where we can place our locking for those platforms that need it this early...?
Sounds like a plan.
it's more than spinlock.
you must also fix the use of the console_drivers struct. There are several things you need to get right to make this work.
I am not sure I see the point. You're going to make lots of changes and in the end, using cpp magic, have a unified function which acts differently depending on how it's compiled. eek.
It makes the most sense to me to have a RAM printk and a ROM printk which are for different purposes. They already share the most important common piece, which is vtxprintf.
But if you insist on unifying them you need to read the two parallel functions carefully and I guess do the #ifdef dance.
So we'd get uninitialied data for any pre-RAM spinlock access? The v3 global variable mechanism should solve this nicely. At least it was designed for that.
Again, please look at this code and make sure you know all the differences before you tear the wall open :-)
ron
On 20.06.2009 02:56, ron minnich wrote:
On Fri, Jun 19, 2009 at 5:46 PM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
Other than that, we could unify those versions by just defining an empty (for now) version of the spinlock functions in raminit stage. Then think about where we can place our locking for those platforms that need it this early...?
Sounds like a plan.
it's more than spinlock.
you must also fix the use of the console_drivers struct. There are several things you need to get right to make this work.
v3 uses a hack which does not depend on the device tree at all. Basically, if you add a console driver to the output list in Kconfig, it will be called always. I need to dig up my patches to postpone/discard output before driver init.
coreboot-v3/lib/console.c:console_tx_byte() is the code I'm talking about. Maybe not the greatest code in the world, but without any RAM dependencies.
I am not sure I see the point. You're going to make lots of changes and in the end, using cpp magic, have a unified function which acts differently depending on how it's compiled. eek.
That's the nature of #if/#ifdef. I'm not advocating advanced cpp trickery because down that road lies madness and frustration.
It makes the most sense to me to have a RAM printk and a ROM printk which are for different purposes. They already share the most important common piece, which is vtxprintf.
But if you insist on unifying them you need to read the two parallel functions carefully and I guess do the #ifdef dance.
I advocate the compile once, use everywhere stance of v3.
So we'd get uninitialied data for any pre-RAM spinlock access? The v3 global variable mechanism should solve this nicely. At least it was designed for that.
Again, please look at this code and make sure you know all the differences before you tear the wall open :-)
Will do so tomorrow.
Regards, Carl-Daniel
On Fri, Jun 19, 2009 at 6:15 PM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
v3 uses a hack which does not depend on the device tree at all. Basically, if you add a console driver to the output list in Kconfig, it will be called always. I need to dig up my patches to postpone/discard output before driver init.
v2 does not depend on the device tree at all. It depends on linker sets, specifically the linker set for console_drivers.
V3 -- I am not sure I understand you. printk in v3 calls console_tx_byte, which does this:
void console_tx_byte(unsigned char byte, void *arg) { #ifdef CONFIG_CONSOLE_BUFFER buffer_tx_byte(byte, arg); #endif #ifdef CONFIG_CONSOLE_SERIAL if (byte == '\n') { uart8250_tx_byte(TTYSx_BASE, '\r'); #ifdef CONFIG_CONSOLE_PREFIX uart8250_tx_byte(TTYSx_BASE, '\n'); uart8250_tx_byte(TTYSx_BASE, '('); uart8250_tx_byte(TTYSx_BASE, 'C'); uart8250_tx_byte(TTYSx_BASE, 'B'); uart8250_tx_byte(TTYSx_BASE, ')'); uart8250_tx_byte(TTYSx_BASE, ' '); return; #endif } uart8250_tx_byte(TTYSx_BASE, byte); #endif }
The only choices I know if in v3 are serail console and buffer console. How do we add others? Form what I can tell, by adding code. Is there something in there I am not aware of?
coreboot-v3/lib/console.c:console_tx_byte() is the code I'm talking about. Maybe not the greatest code in the world, but without any RAM dependencies.
Because it does not allow the addition of new consoles via linker sets as we do in v2. And it doesn't do locks. It's not smp-safe.
I am not opposed to getting rid of the v2 linker set approach for consoles. I think it would simplify life quite a bit. But let's take this a little slowly. I'd rather we get (e.g.) all of cbfs working before we take on more surgery.
There are four console drivers available in v2: src/console/vga_console.c src/console/logbuf_console.c src/console/uart8250_console.c src/console/btext_console.c
I wonder how many people even know it works this way? Or what __console means :-)
I advocate the compile once, use everywhere stance of v3.
But we need to face some facts here. v3 does not have the range of capability of v2. There is much that v2 does that v3 does not. it's not always a fair comparison for v3 to say "we are so much better" given that v2 can say "we support way more hardware" :-)
BTW, I mis-spoke: all the various printks do end upcalling do_printk. But it's a DIFFERENT printk for ROM and RAM code. ROM code calls the arch-dependent printk. RAM code calls the arch-independent printk in src/console/printk.c
The src/console/printk.c absolutely depends on linker sets, presence of RAM, spin_locks, and so on, as it calls src/console/console.c::console_tx_byte, which tests the initialized variable and uses console_drivers.
Again, I have no problem with changing this, at a later time, when we can think it through a bit more. But the RAM code arguably does what we want now -- SMP-safe printing. So let's get back to Ward's problem :-)
ron
On 20.06.2009 2:56 Uhr, ron minnich wrote:
it's more than spinlock.
you must also fix the use of the console_drivers struct. There are several things you need to get right to make this work.
I am not sure I see the point. You're going to make lots of changes and in the end, using cpp magic, have a unified function which acts differently depending on how it's compiled. eek.
The point is to have a single version of a printk like function. Two printk like functions would work nicely, too. But at the moment there's what, 6 implementations times 10 log levels? There's a lot of layers that we don't really need.
It makes the most sense to me to have a RAM printk and a ROM printk which are for different purposes. They already share the most important common piece, which is vtxprintf.
Yes. And a non-CAR printk, which is print, which works in romcc ;)
But if you insist on unifying them you need to read the two parallel functions carefully and I guess do the #ifdef dance.
that was never the discussion. at least from my side. The idea is to drop the macros that look like
#define printk_debug(foo...) do_printk(BIOS_DEBUG, foo)
and replace them by
#define printk(x...) do_printk(x)
and start using printk(BIOS_DEBUG, "%x\n", var); in stage 2 code. That also works nicely for the CAR case.
Just the ROMCC case needs to stay old, but that's fine. It's only there for old and broken architectures that can't do CAR, so it is allowed to look ugly.
On Sat, Jun 20, 2009 at 1:26 AM, Stefan Reinauerstepan@coresystems.de wrote:
#define printk_debug(foo...) do_printk(BIOS_DEBUG, foo)
and replace them by
#define printk(x...) do_printk(x)
That is fine. That's a simple and straightforward change. How about we start with that. But let's not do THAT change until we fix ward's problem, and this has nothing to do with Ward's problem.
thanks
ron
On Sat, Jun 20, 2009 at 10:13:04AM -0700, ron minnich wrote:
That is fine. That's a simple and straightforward change. How about we start with that. But let's not do THAT change until we fix ward's problem, and this has nothing to do with Ward's problem.
Sorry for opening this can of worms ;)
So, Stepan thinks that perhaps the stack is too small for the lzma decompression. I'm going to test next week with a 32KB stack (right now its at the default 8KB). This might solve booting, but I'll still have APs and BSP talking all at once, which I'm also seeing on K10.
Thanks, Ward.
On 20.06.2009 20:08, Ward Vandewege wrote:
On Sat, Jun 20, 2009 at 10:13:04AM -0700, ron minnich wrote:
That is fine. That's a simple and straightforward change. How about we start with that. But let's not do THAT change until we fix ward's problem, and this has nothing to do with Ward's problem.
Sorry for opening this can of worms ;)
So, Stepan thinks that perhaps the stack is too small for the lzma decompression. I'm going to test next week with a 32KB stack (right now its at the default 8KB).
Wait a second. We have two different possible problems to consider. 1. Every core which uses LZMA decompression needs at least 24k stack. 2. AP stacks are a lot smaller than the BSP stack by design, so even if you have a 24k stack for the BSP, it is unlikely you have sufficient stack size for the APs.
Can you test with 32k BSP stack size, but uncompressed AP code and default AP stack size? Unless CBFS code has some internal design/code limitations (and I didn't see any regarding the stack), the lzma stack requirements should not show up on APs for uncompressed AP code.
This might solve booting, but I'll still have APs and BSP talking all at once, which I'm also seeing on K10.
That's the locking problem which we can fix separately.
Regards, Carl-Daniel
On 20.06.2009 20:08, Ward Vandewege wrote:
[...] but I'll still have APs and BSP talking all at once, which I'm also seeing on K10.
So mangled printk problem is visible on K8 and Fam10?
I just read through the M57SLI early code path again and it seems we really have a race condition which causes stack corruption on SMP, possibly only visible if there is more than one AP. Or caches of each core are independent during early startup and the "every AP can use RAM" line of thought needs to be revised.
v3 is not a that different in this regard, but I see an easier way out there.
Anyway, I'll wait for someone else to show where I'm wrong before I proceed. Each of the situations above wants entirely different fixes for printk locking (or even booting).
Regards, Carl-Daniel
On Sun, Jun 21, 2009 at 02:39:31AM +0200, Carl-Daniel Hailfinger wrote:
On 20.06.2009 20:08, Ward Vandewege wrote:
[...] but I'll still have APs and BSP talking all at once, which I'm also seeing on K10.
So mangled printk problem is visible on K8 and Fam10?
Yes. Here's a K8 boot that works reliably on h8dme (r4314) but has some garbled output. Check right after it says "Clearing initial memory region: Done".
http://ward.vandewege.net/coreboot/h8dme/minicom-20090618q.cap
And here's a K10 boot that does not boot because there's something going on with the latest microcode for this cpu (but that's another issue I think). It has the garbled output right after "Start node 01 done." and eventually soft resets:
http://ward.vandewege.net/coreboot/h8dmr/fam10/h8dmr-am.cap
I just read through the M57SLI early code path again and it seems we really have a race condition which causes stack corruption on SMP, possibly only visible if there is more than one AP. Or caches of each core are independent during early startup and the "every AP can use RAM" line of thought needs to be revised.
Interesting. The K8 machine above is a supermicro h8dme with 2x dual core, and the K10 machine is a supermicro h8dmr with 2x quad core. The m57sli only has one cpu socket of course but I use it regularly with dual core cpus, and I have never noticed garbled output. I could be overlooking it of course, it's relatively subtle in the K8 example.
Thanks, Ward.
On Sat, Jun 20, 2009 at 02:08:08PM -0400, Ward Vandewege wrote:
On Sat, Jun 20, 2009 at 10:13:04AM -0700, ron minnich wrote:
That is fine. That's a simple and straightforward change. How about we start with that. But let's not do THAT change until we fix ward's problem, and this has nothing to do with Ward's problem.
Sorry for opening this can of worms ;)
So, Stepan thinks that perhaps the stack is too small for the lzma decompression. I'm going to test next week with a 32KB stack (right now its at the default 8KB). This might solve booting, but I'll still have APs and BSP talking all at once, which I'm also seeing on K10.
I tried with a 32KB stack and a 64KB stack, no change, the machine still resets itself:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624e.cap
So, something else is going on.
Thanks, Ward.
at the default 8KB). This might solve booting, but I'll still have APs
and
BSP talking all at once, which I'm also seeing on K10.
I tried with a 32KB stack and a 64KB stack, no change, the machine still resets itself:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624e.cap
Maybe this is a silly suggestion, but how hard would it be to write a script that unscrambled the output?
If I only knew how to find it on Google (I tried briefly), I'm sure something similar has already been written.
It seems like it is a simpler problem because we know the maximum number of processors and the possible format strings. I wish I had the time to play with it because I think it would be fun to write.
It seems like the priorities are: 1. Stop the APs from executing when they shouldn't. 2. Fix locking if it's still a problem.
I think the garbled output provides a good hint (although it is annoying) at what's wrong.
Thanks, Myles
On 24.06.2009 18:29, Myles Watson wrote:
at the default 8KB). This might solve booting, but I'll still have APs
and
BSP talking all at once, which I'm also seeing on K10.
I tried with a 32KB stack and a 64KB stack, no change, the machine still resets itself:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624e.cap
Maybe this is a silly suggestion, but how hard would it be to write a script that unscrambled the output?
I did it by hand. Not pleasant, but see below. [...] Copying data from cache to RAM -- switching to use RAM as stack... Done testx = 5a5a5a5a Disabling cache as ram now Clearing initial memory region: Done Jumping to image. Jumping to image. Check CBFS header at fffe0fd0 Check CBFS header at fffe0fd0 magic is 4f524243 Found CBFS header at fffe0fd0 magic is 4f524243 Found CBFS header at fffe0fd0 Check normal/payload CBFS: follow chain: Check normal/payload ffCfB0F0S0:0 0f o+l l2o8w +c h1a0i0n7:6 f+f fa0l0i0g0n0 -+> 2f8f f+1 0100a0076 +Ch eaclki gnno r-m>a lf/fcfo1r0e0bao0ot _Crhaemck
It's obvious that the CBFS code is called on each core. Note that even if CBFS can handle this, concurrent lzma decompression to the same area is guaranteed to kill the machine: lzma uses the destination memory range as scratch space during decompression.
Regards, Carl-Daniel
I did it by hand. Not pleasant, but see below.
Too bad we both did it.
[...] Copying data from cache to RAM -- switching to use RAM as stack... Done testx = 5a5a5a5a Disabling cache as ram now Clearing initial memory region: Done
I think it's interesting that this is the first duplicated message:
Jumping to image. Jumping to image.
Have you looked at what happens there that might be releasing an AP?
Also, didn't you say it was dual node, dual core? I wonder why there are only two copies of the messages. Do you think they're both on the same node, or both the first processor per node?
Thanks, Myles
On Wed, Jun 24, 2009 at 11:24:10AM -0600, Myles Watson wrote:
I did it by hand. Not pleasant, but see below.
Too bad we both did it.
Yeah, sorry guys :/ I had written a little script to print every other character but it didn't come out as clean as you both did it.
[...] Copying data from cache to RAM -- switching to use RAM as stack... Done testx = 5a5a5a5a Disabling cache as ram now Clearing initial memory region: Done
I think it's interesting that this is the first duplicated message:
Jumping to image. Jumping to image.
Have you looked at what happens there that might be releasing an AP?
Also, didn't you say it was dual node, dual core? I wonder why there are only two copies of the messages. Do you think they're both on the same node, or both the first processor per node?
Correct, this is a dual node, dual core system, so 4 cores in total, two per cpu.
Thanks, Ward.
So, I got it to boot by setting CONFIG_AP_CODE_IN_CAR to off. That way the APs don't stumble over one another, and the machine boots just fine.
Here's the boot log:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624g.cap
You'll see a couple garbled lines, but I think that's just each of the APs saying they are ready.
So, conclusion: commit 4315 broke K8 when CONFIG_AP_CODE_IN_CAR is enabled.
Patrick, do you know why that would be?
Also, what is the benefit, if any, of having CONFIG_AP_CODE_IN_CAR enabled?
Thanks, Ward.
So, I got it to boot by setting CONFIG_AP_CODE_IN_CAR to off. That way the APs don't stumble over one another, and the machine boots just fine.
Here's the boot log:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624g.cap
You'll see a couple garbled lines, but I think that's just each of the APs saying they are ready.
I'm glad it works, but it still looks broken to me. Those lines look like ... TrainDQS RdWrPos1: ... in there.
Myles
On Wed, Jun 24, 2009 at 12:51:29PM -0600, Myles Watson wrote:
So, I got it to boot by setting CONFIG_AP_CODE_IN_CAR to off. That way the APs don't stumble over one another, and the machine boots just fine.
Here's the boot log:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624g.cap
You'll see a couple garbled lines, but I think that's just each of the APs saying they are ready.
I'm glad it works, but it still looks broken to me. Those lines look like ... TrainDQS RdWrPos1: ... in there.
I think you are right. So, we really need to get some sort of printk locking in there so that we can diagnose what's going on with less guesswork.
There's this comment in src/mainboard/tyan/s2912_fam10/cache_as_ram_auto.c:
/* wait for all the APs core0 started by finalize_node_setup. */ /* FIXME: A bunch of cores are going to start output to serial at once. * It would be nice to fixup prink spinlocks for ROM XIP mode. * I think it could be done by putting the spinlock flag in the cache * of the BSP located right after sysinfo. */ wait_all_core0_started();
I got a little lost in the whole locking discussion; is the above a way to ungarble the output in this particular case?
Thanks, Ward.
On Wed, Jun 24, 2009 at 12:51:29PM -0600, Myles Watson wrote:
So, I got it to boot by setting CONFIG_AP_CODE_IN_CAR to off. That way the APs don't stumble over one another, and the machine boots just fine.
Here's the boot log:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624g.cap
You'll see a couple garbled lines, but I think that's just each of the APs saying they are ready.
I'm glad it works, but it still looks broken to me. Those lines look like ... TrainDQS RdWrPos1: ... in there.
Ugh, I spoke too soon. Booting with CONFIG_AP_CODE_IN_CAR off reduced me to 1 core. Sigh.
Patrick, I'm really interested in figuring out why you are seeing none of these problems with the K8 boards you tested 4315 on.
Thanks, Ward.
On Wed, Jun 24, 2009 at 03:05:26PM -0400, Ward Vandewege wrote:
On Wed, Jun 24, 2009 at 12:51:29PM -0600, Myles Watson wrote:
So, I got it to boot by setting CONFIG_AP_CODE_IN_CAR to off. That way the APs don't stumble over one another, and the machine boots just fine.
Here's the boot log:
http://ward.vandewege.net/coreboot/h8dme/minicom-20090624g.cap
You'll see a couple garbled lines, but I think that's just each of the APs saying they are ready.
I'm glad it works, but it still looks broken to me. Those lines look like ... TrainDQS RdWrPos1: ... in there.
Ugh, I spoke too soon. Booting with CONFIG_AP_CODE_IN_CAR off reduced me to 1 core. Sigh.
Disregard that - caused by something else. All cores are visible.
Thanks, Ward.
On Wed, Jun 24, 2009 at 1:05 PM, Ward Vandewege ward@gnu.org wrote:
Patrick, I'm really interested in figuring out why you are seeing none of these problems with the K8 boards you tested 4315 on.
Has anyone else seen these garbled messages? Your 4314 logs had them too. This looks like an unrelated problem that got exposed by the change to me.
Thanks, Myles
Ward,
At the point where it starts to get garbled, can you insert an endless loop for a processor that isn't the BSP? Or insert a loop that prints the processor's number in a loop. Just a sanity check.
Thanks, Myles
On Wed, Jun 24, 2009 at 10:46 AM, Myles Watson mylesgw@gmail.com wrote:
Ward,
At the point where it starts to get garbled, can you insert an endless loop for a processor that isn't the BSP? Or insert a loop that prints the processor's number in a loop. Just a sanity check.
Thanks, Myles
Found CBFS 4f524243
Check normal/payload CBFS: follow chain: fff00000 + 28 + 10076 + align -> fff100a6 Check normal/coreboot_ram CBFS: follow chain: fff100a0 + 38 + 4a01c + align -> fff5a100 Check normal/coreboot_apc CBFS: follow chain: fff5a100 + 38 + 2be8 + align -> fff5cd20 Check fallback/payload CBFS: follow chain: fff5cd20 + 38 + 10076 + align -> fff6cdd0 Check fallback/corebot_ram Stage: load @ 1048576/696320 bytes, enter @ 100000 4a01c + : after memset 0on -stack variables at 001fedf8 and 001fee04 cbfs_decompress: algo: 0c8000 : uncompressed
I'm done doing that by hand for a while :)
So one processor loads coreboot_ram and jumps there, then another does memset on the same area so that it can load it?
Thanks, Myles
On 20.06.2009 2:46 Uhr, Carl-Daniel Hailfinger wrote:
A quick test abuild with #error inserted in the lockless function shows we indeed use it for every freaking x86 target. That explains the supermicro/h8dme intertwined printk messages Ward is seeing.
They're from ram init stage afaict ...
If we need them instead of the generic variants, we should know a reason for such usage.
Yes, the reasons are
- no console "drivers" in the stage2/coreboot_ram style at this level - no spinlocking because we have no ram that is available to and shared by all cpus
Besides that, do we know where static spinlock_t console_lock is placed?
In RAM
So we'd get uninitialied data for any pre-RAM spinlock access? The v3 global variable mechanism should solve this nicely. At least it was designed for that.
But was it designed for safe IPC (Inter Processor Communication ;-) ?
Stefan
It has become clear to me in this discussion that we are, each of us, talking about different things.
Let's step back from the edge of the cliff and start over, please. I just got really, really confused.
ron
On 20.06.2009 19:14, ron minnich wrote:
It has become clear to me in this discussion that we are, each of us, talking about different things.
Yes, it seems so.
Let's step back from the edge of the cliff and start over, please. I just got really, really confused.
Agreed. I'll try to write down precisely what I propose to fix Ward's problem.
Regards, Carl-Daniel
On Fri, Jun 19, 2009 at 5:17 PM, Stefan Reinauerstepan@coresystems.de wrote:
Carl-Daniel Hailfinger wrote:
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
This is the version that was intended to be used when CONFIG_USE_INIT is set.
reminder. do_printk is (or should be) used ONLY in the CAR code. It does not use variables that are only available in the RAM code, such as the console_drivers structure . Don't assume you can just stop using it. It is that way for a reason, as I found out very recently with the qemu CAR changes.
Note that printk calls console_tx_byte, which does this: static void __console_tx_byte(unsigned char byte) { struct console_driver *driver; for(driver = console_drivers; driver < econsole_drivers; driver++) { driver->tx_byte(byte); } }
do_printk calls its very own console_tx_byte, which is this:
void console_tx_byte(unsigned char byte) { if (byte == '\n') uart8250_tx_byte(TTYS0_BASE, '\r'); uart8250_tx_byte(TTYS0_BASE, byte); }
So it's a lot more than just not calling console_tx_flush. These are not insignificant differences.
I am just trying to wave a caution flag here.
You must take some care before you decide you can replace do_printk with printk. They're not even compiled into the same stage (or should not be).
If you want locking in do_printk, it will have to work in CAR. I actually think this is an area you need to be very careful in changing.
ron
On 20.06.2009 02:50, ron minnich wrote:
reminder. do_printk is (or should be) used ONLY in the CAR code.
AFAICS all printk_* calls are #defined to do_printk.
It does not use variables that are only available in the RAM code, such as the console_drivers structure . Don't assume you can just stop using it. It is that way for a reason, as I found out very recently with the qemu CAR changes.
Note that printk calls console_tx_byte, which does this: static void __console_tx_byte(unsigned char byte) { struct console_driver *driver; for(driver = console_drivers; driver < econsole_drivers; driver++) { driver->tx_byte(byte); } }
do_printk calls its very own console_tx_byte, which is this:
void console_tx_byte(unsigned char byte) { if (byte == '\n') uart8250_tx_byte(TTYS0_BASE, '\r'); uart8250_tx_byte(TTYS0_BASE, byte); }
So it's a lot more than just not calling console_tx_flush. These are not insignificant differences.
I am just trying to wave a caution flag here.
And I have trouble understanding you. Either we are looking at different source trees or it's too late at night for me to understand. Is there any variant of printk_* which does not use do_printk()? Or did you mean the two different variants of do_printk()?
You must take some care before you decide you can replace do_printk with printk. They're not even compiled into the same stage (or should not be).
If you want locking in do_printk, it will have to work in CAR. I actually think this is an area you need to be very careful in changing.
I firmly believe we can rig something up. It might involve a lot of v3 code, though.
Regards, Carl-Daniel
On 20.06.2009 2:50 Uhr, ron minnich wrote:
On Fri, Jun 19, 2009 at 5:17 PM, Stefan Reinauerstepan@coresystems.de wrote:
Carl-Daniel Hailfinger wrote:
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
This is the version that was intended to be used when CONFIG_USE_INIT is set.
reminder. do_printk is (or should be) used ONLY in the CAR code. It does not use variables that are only available in the RAM code, such as the console_drivers structure .
All versions of printk_debug() use a do_printk() implementation.
Then there's print_debug, being the weaker version of printk_debug. But print_debug which can not do variables should not be used in CAR code. Instead you should set the CONFIG_USE_PRINTK_IN_CAR (?) option and use printk_debug instead.
printk_debug is a wrapper around do_printk() which calls do_printk with BIOS_DEBUG as first parameter, that's it.
There is a do_printk for CAR in src/arch/i386/lib/printk_init.c however and one for stage2 in src/console/printk.c
Don't assume you can just stop using it. It is that way for a reason, as I found out very recently with the qemu CAR changes.
Note that printk calls console_tx_byte, which does this: static void __console_tx_byte(unsigned char byte) { struct console_driver *driver; for(driver = console_drivers; driver < econsole_drivers; driver++) { driver->tx_byte(byte); } }
do_printk calls its very own console_tx_byte, which is this:
void console_tx_byte(unsigned char byte) { if (byte == '\n') uart8250_tx_byte(TTYS0_BASE, '\r'); uart8250_tx_byte(TTYS0_BASE, byte); }
So it's a lot more than just not calling console_tx_flush. These are not insignificant differences.
I am just trying to wave a caution flag here.
Absolutely agreed, and we shouldn't change anything on the struct console_driver level. At all
You must take some care before you decide you can replace do_printk with printk.
Nope... I started doing some replacements in our internal tree a few weeks/months back and it works like a charme.
They're not even compiled into the same stage (or should not be).
There does not even exist a printk in coreboot v2, right now. So they are only compiled into different projects ;)
Stefan
On Fri, Jun 19, 2009 at 4:46 PM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
it's very useful. It can be used in the rom/CAR code before consoles really work.
This kind of thing is common. Just about all OSes have a "printk before printk" function.
So be careful. I use this one already in qemu.
(we should just rename this to printk and use it the v3 way instead of the printk_<loglevel> layers on top, but that's orthogonal, too)
Yes please!
one thing at a time. That is on a list, but let's focus on this thread.
thanks
ron
On 20.06.2009 02:39, ron minnich wrote:
On Fri, Jun 19, 2009 at 4:46 PM, Carl-Daniel Hailfinger wrote
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
it's very useful. It can be used in the rom/CAR code before consoles really work.
Hm. AFAICS both versions have the same console requirements and just differ in locking (if you ignore the _flush no-op).
This kind of thing is common. Just about all OSes have a "printk before printk" function.
True, but our console requirements for both versions are pretty minimal and I think we showed in v3 that a one-size-fits-all printk can work fine.
So be careful. I use this one already in qemu.
Since the spinlock seems to live outside the CAR area, locking will break for any processor not having working RAM access. The solution is v3-style global variables as outlined in my other mail.
(we should just rename this to printk and use it the v3 way instead of the printk_<loglevel> layers on top, but that's orthogonal, too)
Yes please!
one thing at a time. That is on a list, but let's focus on this thread.
Agreed. Besides that, such a conversion would allow me to use Coccinelle again, so I'll probably try to do that myself once I have time.
Regards, Carl-Daniel
On Fri, Jun 19, 2009 at 5:57 PM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
On 20.06.2009 02:39, ron minnich wrote:
On Fri, Jun 19, 2009 at 4:46 PM, Carl-Daniel Hailfinger wrote
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
it's very useful. It can be used in the rom/CAR code before consoles really work.
Hm. AFAICS both versions have the same console requirements and just differ in locking (if you ignore the _flush no-op).
not at all. They are quite different in purpose as my earlier email points out. printk prints to all console_drivers and is intended for use when RAM is up and working. do_printk is meant for pre-RAM usage.
I actually did not write any of this stuff, but I got acquainted with it in the last two weeks.
True, but our console requirements for both versions are pretty minimal and I think we showed in v3 that a one-size-fits-all printk can work fine.
I do not recall that v3 printk was heavily tested with smp. In fact it's not smp safe from my reading of the code.
v3 printk looks most like do_printk in v2. Part of the reason is that we had decided, long ago, that the v3 console would only be serial. So we gave up the flexibility of v2 but gained simplicity.
Actually, I can verify that v3 printk works badly on multi-core, since I saw the intermixed output during my abortive attempt at getting core 2 going on v3. v3 does NOT solve the problem.
I'll say it again: getting this right is going to require careful thought and ripping up of the first couple ideas. It only looks simple. CAR makes things a bit messy, and the wide variant in CAR behavior among different vendor's CPUs makes it really messy. One size will NOT fit all.
ron
On 20.06.2009 03:04, ron minnich wrote:
True, but our console requirements for both versions are pretty minimal and I think we showed in v3 that a one-size-fits-all printk can work fine.
I do not recall that v3 printk was heavily tested with smp. In fact it's not smp safe from my reading of the code.
Actually, I can verify that v3 printk works badly on multi-core, since I saw the intermixed output during my abortive attempt at getting core 2 going on v3. v3 does NOT solve the problem.
True, because it does not use locking. v3 should be easy to convert to locking which actually works.
Regards, Carl-Daniel
On Fri, Jun 19, 2009 at 6:18 PM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
True, because it does not use locking. v3 should be easy to convert to locking which actually works.
we're back where we started. The question is, is there a locking construct that will work on CAR in common for every CPU. I've come to believe that's a lot harder than I thought it was.
You also pointed this problem out and were looking at using other means -- cpu registers, etc -- to achieve the goal of locking.
The locking for the RAM part is a solved problem, and it works in v2, so I see no need to revisit that question.
The question is the locking for the ROM code, given all the different types of CAR we deal with. Do you have a solution to that problem? Let's start there first.
ron
On 20.06.2009 2:57 Uhr, Carl-Daniel Hailfinger wrote:
On 20.06.2009 02:39, ron minnich wrote:
On Fri, Jun 19, 2009 at 4:46 PM, Carl-Daniel Hailfinger wrote
do_printk is defined in src/arch/i386/lib/printk_init.c as almost identical function, but without console_tx_flush and without locking. If only one CPU uses the lockless variant, we lose.
it's very useful. It can be used in the rom/CAR code before consoles really work.
Hm. AFAICS both versions have the same console requirements and just differ in locking (if you ignore the _flush no-op).
This kind of thing is common. Just about all OSes have a "printk before printk" function.
True, but our console requirements for both versions are pretty minimal and I think we showed in v3 that a one-size-fits-all printk can work fine.
It didn't, that's the point.. No locking, and it destroyed output of i.e. a GDB interface by appending (CB) to each line. Plus, the ifdef mess is really not (much?) nicer to look at than the linker sets.
One thing I think v3 does right is to not come with a VGA console. I still think that every line of coreboot output is always a debugging message. If your machine is so b0rked you can't get to the payload (but survive far enough to enable vga, which wis realistic in 0% of the cases, except you're a developer working on the table creation code) you shouldn't be bothered with text messages on vga but instead connect a serial console.
Stefan
On 20.06.2009 10:31, Stefan Reinauer wrote:
On 20.06.2009 2:57 Uhr, Carl-Daniel Hailfinger wrote:
True, but our console requirements for both versions are pretty minimal and I think we showed in v3 that a one-size-fits-all printk can work fine.
It didn't, that's the point.. No locking,
True. I worked a lot on printk and the biggest reasons I didn't add locking were: 1. I have no SMP hardware (and I don't know if we support SMP Qemu) 2. I simply didn't think of it and I can't remember anyone demanded locking for printk.
and it destroyed output of i.e. a GDB interface by appending (CB) to each line. Plus, the ifdef mess is really not (much?) nicer to look at than the linker sets.
Fortunately, (CB) is default off. Back then, I argued against this particular feature, but it was desired by some v3 developers (don't exactly remember who). If there is no such demand anymore, we can drop this piece of code.
Regards, Carl-Daniel
ron minnich wrote:
yes, that is why I did that struct-based stack in my v3 SMP startup.
So, which problem does this solve, the pre-ram locking or the post-ram locking?
- the pre-ram locking can't be done with a stack, because the cache between CPUs is not always necessarily in the same state. -the post-ram code does not need it, works quite nicely already.
This approach scares me... it doesn't solve the one case but makes the other case much more complex.
Please explain... I must be getting something wrong.
On Fri, Jun 19, 2009 at 10:49 AM, Stefan Reinauerstepan@coresystems.de wrote:
ron minnich wrote:
yes, that is why I did that struct-based stack in my v3 SMP startup.
So, which problem does this solve, the pre-ram locking or the post-ram locking?
it solves the problem of the fact that many people find the v2 smp startup code unreadable.
- the pre-ram locking can't be done with a stack, because the cache
between CPUs is not always necessarily in the same state.
well, that may be true on intel stuff. The AMD startup (at least as I understand it) depends on the BSP memory being functional enough to provide the APs with a stack.
-the post-ram code does not need it, works quite nicely already.
actually, this is only partially true. It is still possible for a malfunctioning AP to lock the BSP out. It's just not something we've seen much of.
This approach scares me... it doesn't solve the one case but makes the other case much more complex.
Please explain... I must be getting something wrong.
The struct-based stack is a direct copy of the v2 startup. Rather than using lots of fiddly offsets onto a memory area, it provides a struct which contains variables and a stack. The variables are shared between the AP and the BSP. It is a more C-like way to do it. Once I did it I realized that the on-stack variables could be used as a communications path from the AP to the BSP, most important one being a POST variable that could be set by the AP and monitored by the BSP. This is a bit better than what we have in v2, where we get one bit back from the AP which tells us "done" or "not done". No real progress indication is available. At some point, the BSP times out the AP, but there is no error code. Plus, the way in which the shared variable is set up in v2 is not very straightforward.
Anyway, I'm OK if we want to stick with what we have, but if we start hearing more about busting locks, then let's reexamine our assumptions.
ron
On 19.06.2009 18:13, Myles Watson wrote:
as we move into the multicore world, messages get harder and harder to read when they are interspersed byte-wise. I propose to add optional locking to printk.
This should only matter for the time during RAM, right? Is it a larger issue?
Once we have printk in CAR for almost all CPUs, it matters even for CAR stage.
However, this locking has to work when some CPUs are in CAR and others are in RAM. And that's the big problem.
Yuck.
At least that's what I believe can happen on K8 and later platforms.
AFAICS we can't simply use atomic normal memory acceses since the CPUs in CAR won't see that. Atomic register modifications are not possible across processors AFAICS. The only idea I have is abusing some MMIO chipset/CPU space which is usable as scratch space. In theory, that should work. But do CPUs or chipsets have such MMIO areas?
They do. It's hard to know if someone is already (ab)using them, though.
Well, a quick (or not so quick) coreboot code search should help find out if the areas in question are already used. Do you know of an example fur such a MMIO area for the K8 with 690/SB600? That would help me implement a proof of concept.
Of course, I'm aware that locking (if possible at all) will be highly CPU/chipset dependent, but even if we can make locking work on only a subset of platforms, it is still a lot better than nothing.
Are you thinking of putting it in the ops of the cpu device? I guess that only works after CAR.
Actually, I'd use it as a function outside the device tree inside CPU and/or chipset code. The device tree is available too late to be useful here.
Another option would be to only let the BSP print messages by default. That would clean up most people's logs most of the time.
Except for the problem Ward is seeing right now. For that, we really want the AP messages. And preferably they should be readable (not intertwined).
Regards, Carl-Daniel
This should only matter for the time during RAM, right? Is it a larger issue?
Once we have printk in CAR for almost all CPUs, it matters even for CAR stage.
I meant the RAM init time when you're in CAR. After CAR the APs should be stopped until CPU init time, which is relatively short.
They do. It's hard to know if someone is already (ab)using them,
though.
Well, a quick (or not so quick) coreboot code search should help find out if the areas in question are already used. Do you know of an example fur such a MMIO area for the K8 with 690/SB600? That would help me implement a proof of concept.
From the BKD for Opterons:
3.5.13 Scratch Register Scratch Register Function 2: Offset 9Ch
Bits Mnemonic Function R/W Reset 31-0 Data Scratch Data R/W 0h
Field Descriptions Scratch Data (Data)-Bits 31-0.
So you get one per socket, first one at PCI_DEV(0,0x18,2) 0x9C
Another option would be to only let the BSP print messages by default.
That
would clean up most people's logs most of the time.
Except for the problem Ward is seeing right now. For that, we really want the AP messages. And preferably they should be readable (not intertwined).
You're right. He needs to know that his APs are still running even though they should be stopped. I guess the intertwined messages tell him that.
Thanks, Myles
On 19.06.2009 19:08, Myles Watson wrote:
This should only matter for the time during RAM, right? Is it a larger issue?
Once we have printk in CAR for almost all CPUs, it matters even for CAR stage.
I meant the RAM init time when you're in CAR. After CAR the APs should be stopped until CPU init time, which is relatively short.
Ah. That part was never clear to me. Thanks for the info.
Do you know of an example fur such a MMIO area for the K8 with 690/SB600? That would help me implement a proof of concept.
From the BKD for Opterons:
3.5.13 Scratch Register Scratch Register Function 2: Offset 9Ch
Bits Mnemonic Function R/W Reset 31-0 Data Scratch Data R/W 0h
Field Descriptions Scratch Data (Data)-Bits 31-0.
So you get one per socket, first one at PCI_DEV(0,0x18,2) 0x9C
AFAICS this is not MMIO, so it's unusable for locking. I did look at the Family 0Fh BKDG. Should I have looked elsewhere?
Regards, Carl-Daniel
Field Descriptions Scratch Data (Data)-Bits 31-0.
So you get one per socket, first one at PCI_DEV(0,0x18,2) 0x9C
AFAICS this is not MMIO, so it's unusable for locking.
? I thought you were looking for config space. Why does it need to be MMIO?
I did look at the Family 0Fh BKDG. Should I have looked elsewhere?
I'm not an expert. I'd just seen this scratch register recently.
Thanks, Myles
On 20.06.2009 00:02, Myles Watson wrote:
Field Descriptions Scratch Data (Data)-Bits 31-0.
So you get one per socket, first one at PCI_DEV(0,0x18,2) 0x9C
AFAICS this is not MMIO, so it's unusable for locking.
? I thought you were looking for config space. Why does it need to be MMIO?
I'm unaware of any method to perform atomic value swaps or atomic increments in config space. Unless I'm mistaken, such atomic operations are required for locking.
I did look at the Family 0Fh BKDG. Should I have looked elsewhere?
I'm not an expert. I'd just seen this scratch register recently.
Thanks for digging it up. It seems that some CPU revisions change the definition to be partially a scratch register.
Regards, Carl-Daniel
On Fri, Jun 19, 2009 at 9:02 AM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
However, this locking has to work when some CPUs are in CAR and others are in RAM. And that's the big problem.
Let's go back to this original statement. I think it's not correct. We had decided, on opteron, that bsp does all memory setup -- i.e. sets up all memory controllers on all sockets. IIRC the opteron code starts APs up when all of memory is functional. I believe our Intel startup does this too: BSP sets up memory, and when APs start, memory is working and usable for stack etc.
If what I am saying is true, that all APs run AFTER the BSP has set up memory, then there is no issue: all locking can be based in memory, since NO AP needs to run in CAR mode -- memory is working. IIRC this is how K8 actually works.
So can we confirm my idea, that no AP need ever run in CAR? If they do need to run in CAR, why?
thanks
ron
On 20.06.2009 21:38, ron minnich wrote:
On Fri, Jun 19, 2009 at 9:02 AM, Carl-Daniel Hailfinger c-d.hailfinger.devel.2006@gmx.net wrote:
However, this locking has to work when some CPUs are in CAR and others are in RAM. And that's the big problem.
Let's go back to this original statement. I think it's not correct. We had decided, on opteron, that bsp does all memory setup -- i.e. sets up all memory controllers on all sockets. IIRC the opteron code starts APs up when all of memory is functional. I believe our Intel startup does this too: BSP sets up memory, and when APs start, memory is working and usable for stack etc.
Unless I'm misreading the K8/Fam10 CAR code in v2, they are a bit peculiar. Please correct me if I'm wrong. 1. K8 code does not care at all if it runs on BSP or AP. Each core which enters v2/src/cpu/amd/car/cache_as_ram.inc locks the cache, clobbers the same area (all of CAR), then proceeds to call cache_as_ram_main() on each core which entered. This means if any AP ever enters v2/src/cpu/amd/car/cache_as_ram.inc, we either have each of them overwriting the stack of all other APs if they can access RAM, or they are indeed in CAR mode when the BSP is already in RAM mode (bad). 2. Fam10 code cares if it runs on BSP or AP. It takes care to give every core its own stack area (different locations), does not clobber the CAR area at all, then proceeds to call cache_as_ram_main() on each core which entered.
So if there is some code which makes sure no AP ever enters v2/src/cpu/amd/car/cache_as_ram.inc, we can kill off all AP handling code in there. However, if APs can enter v2/src/cpu/amd/car/cache_as_ram.inc, we either have some real corruption race conditions on K8 or your statement above (regarding usable memory) contradicts the code.
If what I am saying is true, that all APs run AFTER the BSP has set up memory, then there is no issue: all locking can be based in memory, since NO AP needs to run in CAR mode -- memory is working. IIRC this is how K8 actually works.
Then the only issue for locking is to find out if we are in CAR (ignore locking) or in RAM (use locking).
So can we confirm my idea, that no AP need ever run in CAR? If they do need to run in CAR, why?
Please see above. I'm now thoroughly confused, but if the situation is better than I hoped, I'm the first to celebrate.
It is really great that we talk about this, now I'm starting to understand multicore startup. Thanks!
Regards, Carl-Daniel
On Sat, Jun 20, 2009 at 4:49 PM, Carl-Daniel Hailfingerc-d.hailfinger.devel.2006@gmx.net wrote:
Unless I'm misreading the K8/Fam10 CAR code in v2, they are a bit peculiar. Please correct me if I'm wrong.
- K8 code does not care at all if it runs on BSP or AP. Each core which
enters v2/src/cpu/amd/car/cache_as_ram.inc locks the cache, clobbers the
is your assumption that they all enter that code valid? You need to see what address the startup IPI from BSP to AP contains. IIRC it contains an address in DRAM. I did study that code but it was a few months ago. Take a look at how the BSP wakes up the AP and where the startup IPI address is.
In fact each AP gets its own stack. in the code I wrote for v3, potentially, each AP could get its own code block for startup.
Anyway you can double check me.
I think you guys are on to something,but we need to study this code a bit more.
thanks
ron