On 18.04.2008 17:46, ron minnich wrote:
Stuff Learned The Hard Way: experience I have had with fallback/normal and timers suggests this:
bios should set the watchdog timer, very early, to fire in, say, 5 minutes
the final payload (linux) should clear the timer, but not until it hits a runlevel such that it is network reachable (and hence manageable), and then, the remote network manager system initiates the watchdog reset. The decision should not be made locally.
So, you have to look at the system -- at what point is the system reachable from remote?
- not when the bios tries to boot a payload (could be a bug in bios
that breaks payload)
- not when payload tries to boot os (ditto)
- not when os is booting (ditto)
- not when os is in /etc/rc or equivalent
but only when os is "up", and presumed working. In fact, the best way to reset the watchdog? From remote: ssh node watchdog-reset
Why is this? because the bios, payload, and kernel, and kernel runtime, have to be taken as a whole. The system is not really considered viable until it's totally booted. The kernel might be on flash, and you might have just tweaked it, and broken it -- which I have done, on 1024 machines, and thanked my lucky stars for fallback!
if you can't do "reset the watchdog" from a remote node, over the network, the node is not up in any useful sense. Let it crash into fallback.
Good writeup geared at the multiple servers scenario. Putting it in the wiki under a page called "Watchdog" would be great.
Obviously this logic does not apply to standalone nodes :-) (well,not completely: for standalone, don't clear the timer until you hit runlevel 3 or equivalant)
I beg to differ. What happens if the user wants to run an OS which will not reset the timer? Think Windows, think installation/live CD.
We already have USE_WATCHDOG_ON_BOOT in v2 and we could use that config var in v3.
Regards, Carl-Daniel