On Tue, 2008-04-08 at 01:13 +0200, Peter Stuge wrote:
True, but it's not very easy to regularly poke the timer without an interrupt - or the code will be full of try_poke() every other line.
What about the mostly existing code that pokes port 0x80? make it a macro that also pokes the WDT?
If there are "long" sections for a single POST code, (like clearing 4GB ECC ram), having a bit toggling would be less boring to watch anyhow.
I was reading datasheet for Intel 845 chipset, there seems to be a lot of registers outside the reset domain of the processor. Adding in a 3rd poke of the POST code to a scratch register like that, would mean that the last POST code before the WDT resets the CPU would be available to the next incarnation (perhaps fallback).
Hmm, very interesting.
Hi Joe,
your quoting seems to be messed up somehow.
On 16.04.2008 17:21, joe@settoplinux.org wrote:
On Tue, 2008-04-08 at 01:13 +0200, Peter Stuge wrote:
True, but it's not very easy to regularly poke the timer without an interrupt - or the code will be full of try_poke() every other line.
What about the mostly existing code that pokes port 0x80? make it a macro that also pokes the WDT?
If there are "long" sections for a single POST code, (like clearing 4GB ECC ram), having a bit toggling would be less boring to watch anyhow.
I was reading datasheet for Intel 845 chipset, there seems to be a lot of registers outside the reset domain of the processor. Adding in a 3rd poke of the POST code to a scratch register like that, would mean that the last POST code before the WDT resets the CPU would be available to the next incarnation (perhaps fallback).
Hmm, very interesting.
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.
Regards, Carl-Daniel
On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.
Perhaps a config option? The same function/inline/macro could be put in the code, but de-toothed as far as toggling, for those with a fancier POST card.
Toggling aside, the original idea was, to point out there is one thing which pokes "keepalive" info already, and perhaps it's just a matter of directing that poke to include the TCO timer.
What is the longest amount of time between any two POST code updates? The only think I can think of is ECC ram clearing.
Hi Jeremy,
I appreciate that you and others are looking into putting the watchdog to good use, but most use cases break horribly when you would need the watchdog timer most. See below.
On 16.04.2008 17:55, Jeremy Jackson wrote:
On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.
Perhaps a config option? The same function/inline/macro could be put in the code, but de-toothed as far as toggling, for those with a fancier POST card.
We have some places where we output POST codes in a tight loop in a die() function (or whatever it is called). Coupling watchdog timer reset with outputting POST codes makes sure that the machine will NEVER reboot on serious errors and we might as well disable the watchdog altogether.
Toggling aside, the original idea was, to point out there is one thing which pokes "keepalive" info already, and perhaps it's just a matter of directing that poke to include the TCO timer.
What is the longest amount of time between any two POST code updates? The only think I can think of is ECC ram clearing.
One hour, probably longer. A payload is not guaranteed to poke the TCO timer, so as long as the payload is running, the TCO timer could fire in theory.
To be honest, I care mostly about v3, so if there is no strong opposition to this in v2, I won't authoritatively veto it although I think even risking an interrupt before we can handle it is bad design.
However, once that code comes anywhere near v3, expect it to explode badly, especially when the machine is already unable to boot and loads recovery code over serial. Do not expect us to sprinkle watchdog pokes all over the codebase in v3.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
One hour, probably longer. A payload is not guaranteed to poke the TCO timer, so as long as the payload is running, the TCO timer could fire in theory.
Yes, in the end we have to disable the watchdog before jumping to the payload, or we will only be able to support payloads we get our hands on.
To be honest, I care mostly about v3, so if there is no strong opposition to this in v2, I won't authoritatively veto it although I think even risking an interrupt before we can handle it is bad design.
The watchdog is quite unrelated to interrupts in the common sense.
However, once that code comes anywhere near v3, expect it to explode badly, especially when the machine is already unable to boot and loads recovery code over serial. Do not expect us to sprinkle watchdog pokes all over the codebase in v3.
It might be worth thinking about this, though. It's not uglier than sprinkling the code with post_ outputs or print_debug and it would force developers to think carefully about their loops and exit conditions.
Stefan
On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:
On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.
We have some places where we output POST codes in a tight loop in a die() function (or whatever it is called). Coupling watchdog timer reset with outputting POST codes makes sure that the machine will NEVER reboot on serious errors and we might as well disable the watchdog altogether.
How does this outputing in a loop interact with the above mentioned POST cards that store a history?
What is the purpose of these outputs in die() loops?
On 16.04.2008 18:36, Jeremy Jackson wrote:
On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:
On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.
We have some places where we output POST codes in a tight loop in a die() function (or whatever it is called). Coupling watchdog timer reset with outputting POST codes makes sure that the machine will NEVER reboot on serious errors and we might as well disable the watchdog altogether.
How does this outputing in a loop interact with the above mentioned POST cards that store a history?
What is the purpose of these outputs in die() loops?
Ron had the problem that accessing a CPU with a FS2 was difficult if no I/O operations were done. So we now die with continuous output.
Regards, Carl-Daniel
On Wed, 2008-04-16 at 18:44 +0200, Carl-Daniel Hailfinger wrote:
Ron had the problem that accessing a CPU with a FS2 was difficult if no I/O operations were done. So we now die with continuous output.
Sounds like it could be reworked, maybe a separate macro for OUTPUT_SOMETHING_IO-LIKE vs OUTPUT_POST_AND_MAYBE_TCO
If not, sounds like the FS2 and the fancy post card are mutually exclusive. Would a config option resolve that?
On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:
What is the longest amount of time between any two POST code updates? The only think I can think of is ECC ram clearing.
One hour, probably longer. A payload is not guaranteed to poke the TCO timer, so as long as the payload is running, the TCO timer could fire in theory.
This is another feature to add to the list of payload flags. PAYLOAD_FLAG_HANDLES_TCO
assumes payload is specific to the board's TCO timer,
or just disable it for the *payload*... still useful for normal/fallback handling.
On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:
However, once that code comes anywhere near v3, expect it to explode badly, especially when the machine is already unable to boot and loads recovery code over serial. Do not expect us to sprinkle watchdog pokes
I guess there are two ways to look at the watchdog timer:
1) it's a mystery, due to lack of board schematics, board vendor support, chip schematics, chip vendor support. This leads to headaches "why is this happening... it keeps resetting when we turned watchdog off.. or is there *another* disable register we didn't get told about" and we're battling N number of other mystery registers already.
2) armed with board schematics and chip vendor support, and perfectly designed reset domains, we imagine a software for our sattelite that can be upgraded from the ground, with no fear of "bricking" it, that is self correcting etc.
so #1 is more "present reality", while #2 is something to start laying the groundwork for.
Carl-Daniel Hailfinger wrote:
Hi Joe,
your quoting seems to be messed up somehow.
On 16.04.2008 17:21, joe@settoplinux.org wrote:
On Tue, 2008-04-08 at 01:13 +0200, Peter Stuge wrote:
True, but it's not very easy to regularly poke the timer without an interrupt - or the code will be full of try_poke() every other line.
What about the mostly existing code that pokes port 0x80? make it a macro that also pokes the WDT?
If there are "long" sections for a single POST code, (like clearing 4GB ECC ram), having a bit toggling would be less boring to watch anyhow.
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.
Full ack. In addition, showing post codes and triggering a watchdog are two fundamentally different things. I think we should not let the one implicitly do the other.
Stefan
On Wed, 2008-04-16 at 18:10 +0200, Stefan Reinauer wrote:
Full ack. In addition, showing post codes and triggering a watchdog are two fundamentally different things. I think we should not let the one implicitly do the other.
Ok after engaging brain, this appears to be wisdom. Isn't the most trivial error an infinite loop? Definitely *don't* want to poke the TCO timer in a loop unless it's very carefully done.
Stuff Learned The Hard Way: experience I have had with fallback/normal and timers suggests this:
bios should set the watchdog timer, very early, to fire in, say, 5 minutes
the final payload (linux) should clear the timer, but not until it hits a runlevel such that it is network reachable (and hence manageable), and then, the remote network manager system initiates the watchdog reset. The decision should not be made locally.
So, you have to look at the system -- at what point is the system reachable from remote? - not when the bios tries to boot a payload (could be a bug in bios that breaks payload) - not when payload tries to boot os (ditto) - not when os is booting (ditto) - not when os is in /etc/rc or equivalent
but only when os is "up", and presumed working. In fact, the best way to reset the watchdog? From remote: ssh node watchdog-reset
Why is this? because the bios, payload, and kernel, and kernel runtime, have to be taken as a whole. The system is not really considered viable until it's totally booted. The kernel might be on flash, and you might have just tweaked it, and broken it -- which I have done, on 1024 machines, and thanked my lucky stars for fallback!
if you can't do "reset the watchdog" from a remote node, over the network, the node is not up in any useful sense. Let it crash into fallback.
Obviously this logic does not apply to standalone nodes :-) (well,not completely: for standalone, don't clear the timer until you hit runlevel 3 or equivalant)
ron
On 18.04.2008 17:46, ron minnich wrote:
Stuff Learned The Hard Way: experience I have had with fallback/normal and timers suggests this:
bios should set the watchdog timer, very early, to fire in, say, 5 minutes
the final payload (linux) should clear the timer, but not until it hits a runlevel such that it is network reachable (and hence manageable), and then, the remote network manager system initiates the watchdog reset. The decision should not be made locally.
So, you have to look at the system -- at what point is the system reachable from remote?
- not when the bios tries to boot a payload (could be a bug in bios
that breaks payload)
- not when payload tries to boot os (ditto)
- not when os is booting (ditto)
- not when os is in /etc/rc or equivalent
but only when os is "up", and presumed working. In fact, the best way to reset the watchdog? From remote: ssh node watchdog-reset
Why is this? because the bios, payload, and kernel, and kernel runtime, have to be taken as a whole. The system is not really considered viable until it's totally booted. The kernel might be on flash, and you might have just tweaked it, and broken it -- which I have done, on 1024 machines, and thanked my lucky stars for fallback!
if you can't do "reset the watchdog" from a remote node, over the network, the node is not up in any useful sense. Let it crash into fallback.
Good writeup geared at the multiple servers scenario. Putting it in the wiki under a page called "Watchdog" would be great.
Obviously this logic does not apply to standalone nodes :-) (well,not completely: for standalone, don't clear the timer until you hit runlevel 3 or equivalant)
I beg to differ. What happens if the user wants to run an OS which will not reset the timer? Think Windows, think installation/live CD.
We already have USE_WATCHDOG_ON_BOOT in v2 and we could use that config var in v3.
Regards, Carl-Daniel
On Sat, 2008-04-19 at 04:16 +0200, Carl-Daniel Hailfinger wrote:
Obviously this logic does not apply to standalone nodes :-) (well,not completely: for standalone, don't clear the timer until you hit runlevel 3 or equivalant)
I beg to differ. What happens if the user wants to run an OS which will not reset the timer? Think Windows, think installation/live CD.
We already have USE_WATCHDOG_ON_BOOT in v2 and we could use that config var in v3.
What about making that per-payload?