Re: [coreboot] [PATCH] Halt TCO timer on Intel 3100 chipset

List overview All Threads
Download

newer

older

problem about ioapic

Re: [coreboot] [PATCH] v3: fix...

joe＠settoplinux.org

16 Apr 2008 16 Apr '08

10:21 a.m.

On Tue, 2008-04-08 at 01:13 +0200, Peter Stuge wrote:

...

True, but it's not very easy to regularly poke the timer without an interrupt - or the code will be full of try_poke() every other line.

What about the mostly existing code that pokes port 0x80? make it a macro that also pokes the WDT?

If there are "long" sections for a single POST code, (like clearing 4GB ECC ram), having a bit toggling would be less boring to watch anyhow.

I was reading datasheet for Intel 845 chipset, there seems to be a lot of registers outside the reset domain of the processor. Adding in a 3rd poke of the POST code to a scratch register like that, would mean that the last POST code before the WDT resets the CPU would be available to the next incarnation (perhaps fallback).

Hmm, very interesting.

-- Thanks, Joseph Smith Set-Top-Linux www.settoplinux.org

Show replies by date

Carl-Daniel Hailfinger

16 Apr 16 Apr

10:33 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

Hi Joe,

your quoting seems to be messed up somehow.

On 16.04.2008 17:21, joe@settoplinux.org wrote:

...

On Tue, 2008-04-08 at 01:13 +0200, Peter Stuge wrote:

...
True, but it's not very easy to regularly poke the timer without an interrupt - or the code will be full of try_poke() every other line.

What about the mostly existing code that pokes port 0x80? make it a macro that also pokes the WDT?

If there are "long" sections for a single POST code, (like clearing 4GB ECC ram), having a bit toggling would be less boring to watch anyhow.

I was reading datasheet for Intel 845 chipset, there seems to be a lot of registers outside the reset domain of the processor. Adding in a 3rd poke of the POST code to a scratch register like that, would mean that the last POST code before the WDT resets the CPU would be available to the next incarnation (perhaps fallback).

Hmm, very interesting.

Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.

Regards, Carl-Daniel

Jeremy Jackson

10:55 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:

...

Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.

Perhaps a config option? The same function/inline/macro could be put in the code, but de-toothed as far as toggling, for those with a fancier POST card.

Toggling aside, the original idea was, to point out there is one thing which pokes "keepalive" info already, and perhaps it's just a matter of directing that poke to include the TCO timer.

What is the longest amount of time between any two POST code updates? The only think I can think of is ECC ram clearing.

-- Jeremy Jackson Coplanar Networks (519)489-4903 http://www.coplanar.net jerj@coplanar.net

Carl-Daniel Hailfinger

11:12 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

Hi Jeremy,

I appreciate that you and others are looking into putting the watchdog to good use, but most use cases break horribly when you would need the watchdog timer most. See below.

On 16.04.2008 17:55, Jeremy Jackson wrote:

...

On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:

...
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.

Perhaps a config option? The same function/inline/macro could be put in the code, but de-toothed as far as toggling, for those with a fancier POST card.

We have some places where we output POST codes in a tight loop in a die() function (or whatever it is called). Coupling watchdog timer reset with outputting POST codes makes sure that the machine will NEVER reboot on serious errors and we might as well disable the watchdog altogether.

...

Toggling aside, the original idea was, to point out there is one thing which pokes "keepalive" info already, and perhaps it's just a matter of directing that poke to include the TCO timer.

What is the longest amount of time between any two POST code updates? The only think I can think of is ECC ram clearing.

One hour, probably longer. A payload is not guaranteed to poke the TCO timer, so as long as the payload is running, the TCO timer could fire in theory.

To be honest, I care mostly about v3, so if there is no strong opposition to this in v2, I won't authoritatively veto it although I think even risking an interrupt before we can handle it is bad design.

However, once that code comes anywhere near v3, expect it to explode badly, especially when the machine is already unable to boot and loads recovery code over serial. Do not expect us to sprinkle watchdog pokes all over the codebase in v3.

Regards, Carl-Daniel

Stefan Reinauer

11:19 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

Carl-Daniel Hailfinger wrote:

...

One hour, probably longer. A payload is not guaranteed to poke the TCO timer, so as long as the payload is running, the TCO timer could fire in theory.

Yes, in the end we have to disable the watchdog before jumping to the payload, or we will only be able to support payloads we get our hands on.

...

To be honest, I care mostly about v3, so if there is no strong opposition to this in v2, I won't authoritatively veto it although I think even risking an interrupt before we can handle it is bad design.

The watchdog is quite unrelated to interrupts in the common sense.

...

However, once that code comes anywhere near v3, expect it to explode badly, especially when the machine is already unable to boot and loads recovery code over serial. Do not expect us to sprinkle watchdog pokes all over the codebase in v3.

It might be worth thinking about this, though. It's not uglier than sprinkling the code with post_ outputs or print_debug and it would force developers to think carefully about their loops and exit conditions.

Stefan

Jeremy Jackson

11:36 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:

...

...
On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:

...
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.

We have some places where we output POST codes in a tight loop in a die() function (or whatever it is called). Coupling watchdog timer reset with outputting POST codes makes sure that the machine will NEVER reboot on serious errors and we might as well disable the watchdog altogether.

How does this outputing in a loop interact with the above mentioned POST cards that store a history?

What is the purpose of these outputs in die() loops?

-- Jeremy Jackson Coplanar Networks (519)489-4903 http://www.coplanar.net jerj@coplanar.net

Carl-Daniel Hailfinger

11:44 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On 16.04.2008 18:36, Jeremy Jackson wrote:

...

On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:

...
...
On Wed, 2008-04-16 at 17:33 +0200, Carl-Daniel Hailfinger wrote:

...
Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.

We have some places where we output POST codes in a tight loop in a die() function (or whatever it is called). Coupling watchdog timer reset with outputting POST codes makes sure that the machine will NEVER reboot on serious errors and we might as well disable the watchdog altogether.

How does this outputing in a loop interact with the above mentioned POST cards that store a history?

What is the purpose of these outputs in die() loops?

Ron had the problem that accessing a CPU with a FS2 was difficult if no I/O operations were done. So we now die with continuous output.

Regards, Carl-Daniel

Jeremy Jackson

11:55 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On Wed, 2008-04-16 at 18:44 +0200, Carl-Daniel Hailfinger wrote:

...

Ron had the problem that accessing a CPU with a FS2 was difficult if no I/O operations were done. So we now die with continuous output.

Sounds like it could be reworked, maybe a separate macro for OUTPUT_SOMETHING_IO-LIKE vs OUTPUT_POST_AND_MAYBE_TCO

If not, sounds like the FS2 and the fancy post card are mutually exclusive. Would a config option resolve that?

-- Jeremy Jackson Coplanar Networks (519)489-4903 http://www.coplanar.net jerj@coplanar.net

Jeremy Jackson

11:40 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:

...

...
What is the longest amount of time between any two POST code updates? The only think I can think of is ECC ram clearing.

One hour, probably longer. A payload is not guaranteed to poke the TCO timer, so as long as the payload is running, the TCO timer could fire in theory.

This is another feature to add to the list of payload flags. PAYLOAD_FLAG_HANDLES_TCO

assumes payload is specific to the board's TCO timer,

or just disable it for the *payload*... still useful for normal/fallback handling.

-- Jeremy Jackson Coplanar Networks (519)489-4903 http://www.coplanar.net jerj@coplanar.net

Jeremy Jackson

11:52 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On Wed, 2008-04-16 at 18:12 +0200, Carl-Daniel Hailfinger wrote:

...

However, once that code comes anywhere near v3, expect it to explode badly, especially when the machine is already unable to boot and loads recovery code over serial. Do not expect us to sprinkle watchdog pokes

I guess there are two ways to look at the watchdog timer:

1) it's a mystery, due to lack of board schematics, board vendor support, chip schematics, chip vendor support. This leads to headaches "why is this happening... it keeps resetting when we turned watchdog off.. or is there *another* disable register we didn't get told about" and we're battling N number of other mystery registers already.

2) armed with board schematics and chip vendor support, and perfectly designed reset domains, we imagine a software for our sattelite that can be upgraded from the ground, with no fear of "bricking" it, that is self correcting etc.

so #1 is more "present reality", while #2 is something to start laying the groundwork for.

-- Jeremy Jackson Coplanar Networks (519)489-4903 http://www.coplanar.net jerj@coplanar.net

Stefan Reinauer

11:10 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

Carl-Daniel Hailfinger wrote:

...

Hi Joe,

your quoting seems to be messed up somehow.

On 16.04.2008 17:21, joe@settoplinux.org wrote:

...
On Tue, 2008-04-08 at 01:13 +0200, Peter Stuge wrote:

...
True, but it's not very easy to regularly poke the timer without an interrupt - or the code will be full of try_poke() every other line.

What about the mostly existing code that pokes port 0x80? make it a macro that also pokes the WDT?

If there are "long" sections for a single POST code, (like clearing 4GB ECC ram), having a bit toggling would be less boring to watch anyhow.

...

Please avoid any POST code toggling. There are POST boards out there which display the last n POST codes and make it a lot easier to follow the code path. Once you add POST code toggling, that history is flushed away really fast.

Full ack. In addition, showing post codes and triggering a watchdog are two fundamentally different things. I think we should not let the one implicitly do the other.

Stefan

Jeremy Jackson

12:04 p.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On Wed, 2008-04-16 at 18:10 +0200, Stefan Reinauer wrote:

...

Full ack. In addition, showing post codes and triggering a watchdog are two fundamentally different things. I think we should not let the one implicitly do the other.

Ok after engaging brain, this appears to be wisdom. Isn't the most trivial error an infinite loop? Definitely *don't* want to poke the TCO timer in a loop unless it's very carefully done.

-- Jeremy Jackson Coplanar Networks (519)489-4903 http://www.coplanar.net jerj@coplanar.net

ron minnich

18 Apr 18 Apr

10:46 a.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

Stuff Learned The Hard Way: experience I have had with fallback/normal and timers suggests this:

bios should set the watchdog timer, very early, to fire in, say, 5 minutes

the final payload (linux) should clear the timer, but not until it hits a runlevel such that it is network reachable (and hence manageable), and then, the remote network manager system initiates the watchdog reset. The decision should not be made locally.

So, you have to look at the system -- at what point is the system reachable from remote? - not when the bios tries to boot a payload (could be a bug in bios that breaks payload) - not when payload tries to boot os (ditto) - not when os is booting (ditto) - not when os is in /etc/rc or equivalent

but only when os is "up", and presumed working. In fact, the best way to reset the watchdog? From remote: ssh node watchdog-reset

Why is this? because the bios, payload, and kernel, and kernel runtime, have to be taken as a whole. The system is not really considered viable until it's totally booted. The kernel might be on flash, and you might have just tweaked it, and broken it -- which I have done, on 1024 machines, and thanked my lucky stars for fallback!

if you can't do "reset the watchdog" from a remote node, over the network, the node is not up in any useful sense. Let it crash into fallback.

Obviously this logic does not apply to standalone nodes :-) (well,not completely: for standalone, don't clear the timer until you hit runlevel 3 or equivalant)

ron

Carl-Daniel Hailfinger

9:16 p.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On 18.04.2008 17:46, ron minnich wrote:

...

Stuff Learned The Hard Way: experience I have had with fallback/normal and timers suggests this:

bios should set the watchdog timer, very early, to fire in, say, 5 minutes

the final payload (linux) should clear the timer, but not until it hits a runlevel such that it is network reachable (and hence manageable), and then, the remote network manager system initiates the watchdog reset. The decision should not be made locally.

So, you have to look at the system -- at what point is the system reachable from remote?

not when the bios tries to boot a payload (could be a bug in bios

that breaks payload)

not when payload tries to boot os (ditto)

not when os is booting (ditto)

not when os is in /etc/rc or equivalent

but only when os is "up", and presumed working. In fact, the best way to reset the watchdog? From remote: ssh node watchdog-reset

Why is this? because the bios, payload, and kernel, and kernel runtime, have to be taken as a whole. The system is not really considered viable until it's totally booted. The kernel might be on flash, and you might have just tweaked it, and broken it -- which I have done, on 1024 machines, and thanked my lucky stars for fallback!

if you can't do "reset the watchdog" from a remote node, over the network, the node is not up in any useful sense. Let it crash into fallback.

Good writeup geared at the multiple servers scenario. Putting it in the wiki under a page called "Watchdog" would be great.

...

Obviously this logic does not apply to standalone nodes :-) (well,not completely: for standalone, don't clear the timer until you hit runlevel 3 or equivalant)

I beg to differ. What happens if the user wants to run an OS which will not reset the timer? Think Windows, think installation/live CD.

We already have USE_WATCHDOG_ON_BOOT in v2 and we could use that config var in v3.

Regards, Carl-Daniel

Jeremy Jackson

10:27 p.m.

New subject: [PATCH] Halt TCO timer on Intel 3100 chipset

On Sat, 2008-04-19 at 04:16 +0200, Carl-Daniel Hailfinger wrote:

...

...
Obviously this logic does not apply to standalone nodes :-) (well,not completely: for standalone, don't clear the timer until you hit runlevel 3 or equivalant)

I beg to differ. What happens if the user wants to run an OS which will not reset the timer? Think Windows, think installation/live CD.

We already have USE_WATCHDOG_ON_BOOT in v2 and we could use that config var in v3.

What about making that per-payload?

-- Jeremy Jackson Coplanar Networks (519)489-4903 http://www.coplanar.net jerj@coplanar.net

6174

days inactive

6177

days old

coreboot@coreboot.org

14 comments

5 participants

tags (0)

participants (5)

Carl-Daniel Hailfinger
Jeremy Jackson
joe＠settoplinux.org
ron minnich
Stefan Reinauer