Searching \ for '[PIC] Internal watchdog is overrated' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/microchip/devices.htm?key=pic
Search entire site for: 'Internal watchdog is overrated'.

Exact match. Not showing close matches.
PICList Thread
'[PIC] Internal watchdog is overrated'
2010\05\09@101920 by Olin Lathrop

face picon face
I just found the post below in my Drafts folder in Outlook Express.  I don't
normally use drafts, so didn't notice this post was stuck there since early
September.  Since I have no recollection what this thread was originally
about, I changed the subject and topic tag.  The points about watchdogs are
still relevant though, and I think often not considered.

------------------------------------------------------------------------

Bob Axtell wrote:
> I, too, use the PIC WDT always, and disable the interrupts
> infrequently. The only place that clears the WDT (CLRWDT) is in the
> timer interrupt, which occurs in less than 15mS intervals.

I think watchdogs, particularly internal ones, are way overrated.  All they
are going to do is reset the part when a particular kind of software bug
occurs.  There are several problems with this:

1 - They detect only a rare kind of software bug that made it into
production.  The system wedging is the kind of blatant symptom that is quite
unlikely to remain in final code.

2 - A hard reset may not be the best way to recover from a software bug.  In
certain systems this may be worse than wedging.

3 - There is the very real chance that a new bugs are introduced because the
watchdog isn't kicked often enough.  This kind of timing related bug is more
likely to make it into production than the obvious wedging.  I think in most
cases the cost of 3 outweighs the benefits of 1.

4 - They give a false sense of security.  "I'm using the watchdog so
everything is safe."

5 - All too often people don't think about using the watchdog effectively
when they do use it.  The worst thing is to kick the dog in a periodic
interrupt routine.  Think about it.  The interrupt routine is probably
small, with few branches, and well tested.  The chances of it wedging are
very small.  If something goes wrong, it will probably be in the foreground
code.  However, with the interrupt routine kicking the dog, the foreground
code can do all manner of bad things and the system will keep right on
running.  What you need is a mechanism that guarantees both parts of the
code are running.  Have the interrupt routine set a flag to kick the dog and
have this be one event the main event loop checks.  Now both have to be
running for the dog not to bite.

If you've got a complicated system with a lot of asynchronous events where
things could get clogged up or you're using code from elsewhere you don't
really trust (TCP/IP stack for example), then using a watchdog can be
reasonable.  But if it's that important, you can't really trust the internal
watchdog because it's too tightly coupled to the same firmware it's trying
to guard against failures in.  In one case like that I used a 10F200 as a
watchdog.  What's nice about that is you can program it for particular
characteristics.  In that case, for example, it let the main processor do
whatever it wanted for the first 3 seconds, since it had a sortof bootup
procedure before it ran its normal code.  After the initial period, it
required a heartbeat line to be toggled every 500mS +- 75mS.  The heartbeat
was toggled by one of the tasks in the cooperative tasking system of the
main processor, but its timing was derived from state left around by the
periodic interrupt.  In a simple round robin task scheduler, you know all
tasks are running if one of them runs periodically.  You can't detect a task
stuck in a loop waiting calling TASK_YIELD waiting for something that will
never happen, but if it completely wedged, all other tasks would stop too.


********************************************************************
Embed Inc, Littleton Massachusetts, http://www.embedinc.com/products
(978) 742-9014.  Gold level PIC consultants since 2000.

2010\05\09@163239 by Ruben Jönsson

flavicon
face
I don't think of the wdt as a way of handling software bugs but rather a way to
get the microcontroller back into a known state after an external event (emi,
esd, surge...) that has set it into an abnormal state which includes a
malfunctioning program or peripheral function that can't be restored without a
reset. Just like the BOR handles voltage dropouts below a critical value.

And then, as you say, it is important to clear the wdt only when you know that
the program is acting as intended, which can be quite difficult to know.
Clearing it in a timer interrupt is not a good idea.

/Ruben


{Quote hidden}

> -

2010\05\09@200327 by Isaac Marino Bavaresco

flavicon
face
I had an event where the WDT didn't help. After a surge, something
latched-up inside the PIC and not even the WDT could fix. The chip
heated a little, but after power down and some time to cool, it worked
OK. Of course we replaced the chip anyway.


Regards,

Isaac


Em 9/5/2010 17:29, Ruben Jönsson escreveu:
{Quote hidden}

>> --

2010\05\09@215741 by Xiaofan Chen

face picon face
On Mon, May 10, 2010 at 8:03 AM, Isaac Marino Bavaresco
<spam_OUTisaacbavarescoTakeThisOuTspamyahoo.com.br> wrote:
> I had an event where the WDT didn't help. After a surge, something
> latched-up inside the PIC and not even the WDT could fix. The chip
> heated a little, but after power down and some time to cool, it worked
> OK. Of course we replaced the chip anyway.
>

For Output modules, we always use an external watchdog if
the control is done by an MCU. The external watchdog also
needs to clear the output in case the MCU hangs.

--
Xiaofan http://mcuee.blogspot.com

2010\05\10@083628 by Olin Lathrop

face picon face
Isaac Marino Bavaresco wrote:
> I had an event where the WDT didn't help. After a surge, something
> latched-up inside the PIC and not even the WDT could fix. The chip
> heated a little, but after power down and some time to cool, it worked
> OK. Of course we replaced the chip anyway.

Sounds like it could have been a glitch on Vdd so that momemtarily some pins
were more than a diode drop above Vdd, which caused the chip to become a
SCR.


********************************************************************
Embed Inc, Littleton Massachusetts, http://www.embedinc.com/products
(978) 742-9014.  Gold level PIC consultants since 2000.

2010\05\10@085716 by Spehro Pefhany

picon face
At 08:36 AM 10/05/2010, you wrote:
>Isaac Marino Bavaresco wrote:
> > I had an event where the WDT didn't help. After a surge, something
> > latched-up inside the PIC and not even the WDT could fix. The chip
> > heated a little, but after power down and some time to cool, it worked
> > OK. Of course we replaced the chip anyway.
>
>Sounds like it could have been a glitch on Vdd so that momemtarily some pins
>were more than a diode drop above Vdd, which caused the chip to become a
>SCR.

Right, in which case nothing short of cycling the power would clear
such a fault. In some cases making provision for cycling the power,
as well as properly implementing a WDT, is the only way to get
acceptable reliability.

>Best regards,

Spehro Pefhany --"it's the network..."            "The Journey is the reward"
.....speffKILLspamspam@spam@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com



2010\05\10@112937 by M. Adam Davis

face picon face
One other big problem with the watchdog that makes me cringe is when
people use it to solve problems they don't understand.  It becomes a
crutch, and they'll say, "Well, we haven't figured out the issue, but
the watchdog catches it in time, and according to testing we still
meet the project requirements, so..."

But it's a tool, and like a hammer it can be used appropriately or
inappropriately.

-Adam

2010\05\10@125720 by Dwayne Reid
flavicon
face
At 08:19 AM 5/9/2010, Olin Lathrop wrote:

>I think watchdogs, particularly internal ones, are way overrated.  All they
>are going to do is reset the part when a particular kind of software bug
>occurs.

I think that watchdog timers are a great tool when used
appropriately.  I normally don't think of them as catching software
faults, but rather as one means of getting the processor back on
track if corrupted by some external event.

I have to preface the following comments with a couple of important
observations: I'm not a great programmer.  I write code that works
well but it often takes me way longer to get to the finished product
than it should.

I'm more of a hardware guy who got into writing firmware.  I
generally look at things more from a discrete hardware perspective
than a 'real' software person would.

That said: here goes.

I tend to structure my programs in one of two discrete ways:  simple
programs are along the lines of a cooperative multi-tasking loop,
more complex timing-type programs are more of a linear progression
from start to finish.

The simple multi-tasking loop is just that: a loop that repeats at a
(usually) 1ms rate.  It starts at the top and does everything that it
needs to do.  When it reaches the end of the loop, it calls the
background task.  The background task kicks the watchdog, then looks
after all background stuff - RTCC, a/d, timers, SPI communications,
etc.  Upon return from the background task, the main loop jumps back
to its beginning and the whole thing repeats.

I mention that the main loop can function as a simple cooperative
multi-tasking system.  It does that by implementing each task as a
state machine.  Each state machine yields its time slot when it has
nothing further to do.

The idea is that each separate task needs only a few (to a few dozen)
cycles - there is lots of room and lots of time for lots of state machines.

Interrupts mesh nicely with this concept: each ISR does what it needs
to do, then sets a flag to tell the foreground task that it needs to
deal with something.  Again, individual ISRs tend to quick and short.


The longer timing-type programs aren't as well suited to a simple,
relatively short loop.  Instead, they function more like traditional
programs, with a start and an end.  A typical example is what I use
in our gas-fired catalytic-heater industrial process oven
controllers: a start-up cycle of many steps followed by an operate cycle.

In those programs, each phase of a cycle is designed to do or wait
for one particular thing.  In the oven controllers, its waiting for
either temperature or time or both.  While its waiting, it repeatedly
calls the background task.

Again, the background task looks after all of the house-keeping: kick
the watchdog, deal with RTCC, timers, slow SPI communications (one
clock edge per tick), a/d, etc.  In my oven controllers, the
background task also contains the entire remote communications task
as well as calling one or two other self-contained tasks.

In this case, even though the main loop is held (stopped) at discrete
points within each phase of the various cycles, it gives the illusion
of multi-tasking because the background task is being called repeatedly.

Interrupts also fit into this concept nicely,  In this case, though,
its usually the background task that is looking for and dealing with
flags set by ISRs.


Any projects that I build that have safety issues (such as the
gas-fired oven controllers mentioned above) have more than one
watchdog.  The external watchdog timer I've been using for many years
now is the Xicor X5043 system supervisor - it contains a watchdog
timer, power-supply supervisor and 512 bytes of eeprom.  I still use
that eeprom to store configuration information even though modern
PICS have built-in eeprom storage - mostly out of inertia, I
guess.  But the Xicor chips have been bullet-proof - I *never* have
issues with corrupted eeprom contents.

The other thing that the Xicor chip does is drive an external timer
that is used to disable certain outputs if needed.  Basically, this
timer is driven by the reset line and is a pulse-stretcher with a
period about twice or three times the watchdog reset repetition
period.  It keeps hazardous outputs disabled if the processor can't
keep the external watchdog happy as well as when the power supply is
below safe operating levels.


Anyway, the point I'm trying to make is that I do NOT count on the
watchdog(s) to catch software errors.  That's not its job!  Instead,
the watchdog is used both to recover if the system is corrupted by
some external event as well as disable hazardous outputs if the
system is not working correctly.

The other point I should make is that I do NOT use interrupts to call
my background task.  That would defeat most of the safety aspects of
the watchdog - its quite possible that the foreground task is off in
lala land but the timer interrupt is still working
properly.  Instead, the background task must be called by the
foreground.  Only then is the watchdog timer reset.

I'm really interested in hearing other people comment on this subject
- its a great learning opportunity.  Olin - many thanks for making your post!

dwayne

--
Dwayne Reid   <dwaynerspamKILLspamplanet.eon.net>
Trinity Electronics Systems Ltd    Edmonton, AB, CANADA
(780) 489-3199 voice          (780) 487-6397 fax
http://www.trinity-electronics.com
Custom Electronics Design and Manufacturing

2010\05\10@131908 by Mike Hagen

picon face
I use the watchdog a lot on projects to wake from sleep, take a
measurement or status and go back to sleep.
May not be real accurate timing (WDT really has a big tolerance on it's
timing), but I like using it this way.

On 5/10/2010 9:57 AM, Dwayne Reid wrote:
{Quote hidden}

2010\05\10@132609 by Olin Lathrop

face picon face
Mike Hagen wrote:
> I use the watchdog a lot on projects to wake from sleep, take a
> measurement or status and go back to sleep.
> May not be real accurate timing (WDT really has a big tolerance on
> it's timing), but I like using it this way.

That's not really a watchdog then, but a wakeup timer.  I have used PIC
watchdog timers for that purpose too.  In one case I got around the large
tolerance in the wakeup period by occasionally leaving the main oscillator
running during a WDT period to calibrate it.


********************************************************************
Embed Inc, Littleton Massachusetts, http://www.embedinc.com/products
(978) 742-9014.  Gold level PIC consultants since 2000.

2010\05\10@175158 by sergio masci

flavicon
face


On Mon, 10 May 2010, Dwayne Reid wrote:

>
> I'm really interested in hearing other people comment on this subject
> - its a great learning opportunity.

Regarding robustness, if you use a real state machine, you are better able
to resume operation if an external event causes the MCU to lock up and the
watchdog is able to restart it. Think of it as a watchdog event that can
occure during any state. You would use this event to trigger a state
change to a special state (which could be one of many such states) much
like any other event used by your state machine.

At the start of many of your states you initialise things like variables
and output ports. In other states you update variables and output ports.
You would tend to use the watchdog event to take you back to a state (in
the current chain) that does the initialising or setting hardware to a
well defined condition.

Regards
Sergio Masci

2010\05\10@180457 by Dwayne Reid

flavicon
face
At 07:49 PM 5/10/2010, sergio masci wrote:

>Regarding robustness, if you use a real state machine, you are better able
>to resume operation if an external event causes the MCU to lock up and the
>watchdog is able to restart it. Think of it as a watchdog event that can
>occure during any state. You would use this event to trigger a state
>change to a special state (which could be one of many such states) much
>like any other event used by your state machine.
>
>At the start of many of your states you initialise things like variables
>and output ports. In other states you update variables and output ports.
>You would tend to use the watchdog event to take you back to a state (in
>the current chain) that does the initialising or setting hardware to a
>well defined condition.

I understand what you are saying but I must respectfully disagree with you.

The problem is that you don't know WHAT has been corrupted.  It may
be a single register, it might be many registers.  You just don't know.

I've always had the luxury of being able to restart the whole machine
right from the beginning (cold-boot) and that is the approach I've
always taken.

dwayne

--
Dwayne Reid   <.....dwaynerKILLspamspam.....planet.eon.net>
Trinity Electronics Systems Ltd    Edmonton, AB, CANADA
(780) 489-3199 voice          (780) 487-6397 fax
http://www.trinity-electronics.com
Custom Electronics Design and Manufacturing

2010\05\11@120239 by sergio masci

flavicon
face


On Mon, 10 May 2010, Dwayne Reid wrote:

{Quote hidden}

Yes I understand this. A very big problem with many state machine
implentations is the way programmers "hide" some state information in
variables and use it to communicate between states. Then you have a
situation where simply jumping to a state doesn't mean the state machine
is actually in that state (because some other information important to
that state is invalid).

Consider a state machine that uses NO RAM variables at all - except for a
handfull of variables the executive needs to manage the state machine.
Such a state machine could be used to implement quite complex control
functions yet it would have a trivial number of RAM locations that need to
be protected from corruption. Ok you can't actually put a force field
around these RAM locations but you can do other things to try to ensure
the values they hold are valid. You might make backup copies elsewhere in
RAM, you might add error correcting code, you might only update them in
safe ways. Ok you won't catch and fix every prossible type of error that
could occure but you wont be any worse off than simply detecting a fault
and doing a cold boot.

So if you progress from this to a state machine where specific RAM
locations are only important to coressponding specific states then the
specific state is responsible for initialising the RAM when it is entered.
By definition you don't care about the rest of the RAM because you can
only be in one state at a time. This other RAM can get corrupted but it
doesn't mater. If your MCU now gets hammered by some external event and
your state machine executive determines that your state machine is still
valid it will enter the state again and by definition initialise its RAM
again. So you've protected the state machine executive RAM and said the
rest doesn't mater because we can fix it if it gets corrupted.

Ok, progressing again, let's say you have a state machine that has a group
of states that use a small well defined self contained group of RAM
locations to communicate between them. One of these states WILL cause
these RAM locations to be initalised. Again, you don't care about the rest
of the RAM, it can get corrupted but it doesn't mater. If your MCU now
gets hammered by some external event (while in one of the states within
this group) and your state machine executive determines that your state
machine is still valid it will enter the initialising state again and by
definition initialise its RAM again. This is like all of the states within
the GROUP responding to a watchdog event by jumping to the initialising
state of that GROUP.

I/O ports could be treated the same way as a RAM location shared by a
group of states, the only fly in the ointment being that here a group of
states might be tied to a single pin on the port while another group of
states might be tied to another pin on the SAME port. We can get around
this problem a few ways but by far the most eligant would be to use
muliple state machines running concurrently each looking after their own
I/O pins and having their own set of initialisation states.

Seriously, the biggest problem with state machines is that programmers
tend to think of states and events as a way of getting around a
programming problem rather than seeing states and events as a replacement
for lines of code. It is very common to find huge programs that are
described as "state machines" with (litarally) only a few defined states
and events.

>
> I've always had the luxury of being able to restart the whole machine
> right from the beginning (cold-boot) and that is the approach I've
> always taken.

And having done this have you ever squirelled away some special flag
somewhere that enabled you to kind-of-restart (from the beginning
possibly) one of several tasks or functions?

I'm not saying state machines provide a bullet proof way of coping with an
MCU lockup and watchdog reset, what I am saying is that at best they
provide a way of handeling the fault gracefully and possibly recovering
completly and at worst they are no worse than a cold boot.

Regards
Sergio Masci

2010\05\11@124725 by Michael Rigby-Jones

flavicon
face


> -----Original Message-----
> From: EraseMEpiclist-bouncesspam_OUTspamTakeThisOuTmit.edu [piclist-bouncesspamspam_OUTmit.edu] On
Behalf
> Of sergio masci
> Sent: 11 May 2010 21:00
> To: Microcontroller discussion list - Public.
> Subject: Re: [PIC] Internal watchdog is overrated
>
>
>
>
> Consider a state machine that uses NO RAM variables at all - except
for a
> handfull of variables the executive needs to manage the state machine.
> Such a state machine could be used to implement quite complex control
> functions yet it would have a trivial number of RAM locations that
need to
> be protected from corruption. Ok you can't actually put a force field
> around these RAM locations but you can do other things to try to
ensure
> the values they hold are valid. You might make backup copies elsewhere
in
> RAM, you might add error correcting code, you might only update them
in
> safe ways. Ok you won't catch and fix every prossible type of error
that
> could occure but you wont be any worse off than simply detecting a
fault
> and doing a cold boot.

Apart from the size and execution speed overhead of testing these
variables are valid every time you want to use them?

>
> So if you progress from this to a state machine where specific RAM
> locations are only important to coressponding specific states then the
> specific state is responsible for initialising the RAM when it is
entered.
> By definition you don't care about the rest of the RAM because you can
> only be in one state at a time. This other RAM can get corrupted but
it
> doesn't mater. If your MCU now gets hammered by some external event
and
> your state machine executive determines that your state machine is
still
> valid it will enter the state again and by definition initialise its
RAM
> again. So you've protected the state machine executive RAM and said
the
> rest doesn't mater because we can fix it if it gets corrupted.

Assuming your state machine "executive" is even still running.  This is
a dangerous assumption, a single corrupted RAM address could vector the
PC to a completely random part of the micros memory.

>
> Ok, progressing again, let's say you have a state machine that has a
group
> of states that use a small well defined self contained group of RAM
> locations to communicate between them. One of these states WILL cause
> these RAM locations to be initalised. Again, you don't care about the
rest
> of the RAM, it can get corrupted but it doesn't mater. If your MCU now
> gets hammered by some external event (while in one of the states
within
> this group) and your state machine executive determines that your
state
> machine is still valid it will enter the initialising state again and
by
> definition initialise its RAM again. This is like all of the states
within
> the GROUP responding to a watchdog event by jumping to the
initialising
> state of that GROUP.
>
> I/O ports could be treated the same way as a RAM location shared by a
> group of states, the only fly in the ointment being that here a group
of
> states might be tied to a single pin on the port while another group
of
> states might be tied to another pin on the SAME port. We can get
around
> this problem a few ways but by far the most eligant would be to use
> muliple state machines running concurrently each looking after their
own
> I/O pins and having their own set of initialisation states.

What happens with external peripherals, that you may not be able to
recover the state of, e.g. SPI DACs?  If one RAM location is corrupted,
then you have to assume everything else could be, because you can't
possibly protect every variable with a checksum etc.  If you have to
re-initialise everything, then the difference to a soft reset is moot.

Mike

=======================================================================
This e-mail is intended for the person it is addressed to only. The
information contained in it may be confidential and/or protected by
law. If you are not the intended recipient of this message, you must
not make any use of this information, or copy or show it to any
person. Please contact us immediately to tell us that you have
received this e-mail, and return the original to us. Any use,
forwarding, printing or copying of this message is strictly prohibited.
No part of this message can be considered a request for goods or
services.
=======================================================================

2010\05\11@184629 by sergio masci

flavicon
face


On Tue, 11 May 2010, Michael Rigby-Jones wrote:

{Quote hidden}

I wouldn't test them every time I wanted to use them, instead what I would
probably do is have two sets of variables such that I have a stable set
and a buffer set. The stable set is the one you would use when you
enter the start of the state machine executive, the buffer set would be
the one you write updated copies of the stable set to. Then when you have
finished the updates you would switch the stable and buffer sets in one
operation.

The testing of the state machine executive variables would only be
performed after a watchdog reset and before the state machine executive is
restarted.

Yes there would be a small overhead using alternate sets of variables but
this would not be much more than people are currently prepared to accept
when doing things like using dynamic stack based local variables (standard
C type local variables).

{Quote hidden}

Ok, so you have an external fault that causes the MCU to start executing
code at some random location. As a consequence the code starts stomping
all over RAM in an unpredictable way. After a while the watchdog kicks in,
resets the MCU and after doing some checks determins that the state
machine is either still viable and can be restarted or is dead and the
whole system needs to be cold booted. The concept that the state machine
is still running and sees a "watchdog reset as an event" is just a
simplified way of looking at what is actually happening. To all intents
and purposes the MCU may have crashed and be executing data held in FLASH
- never to come back to the state machine executive. BUT because the state
machine executive can be restarted after such a fault by simply jumping
into its main loop (provided it's variables are not corrupted) then you
really can view it as a "watchdog reset as an event".

>
{Quote hidden}

no, there are many situations where recovering from an error is both
possible and desirable - you just need to design your program accordingly.
What happens if you are writing to an external peripheral and an error
occures? Do you just give up and do a soft reset or do you have an error
procedure in place to cope with it? What happens if the external fault
actually caused your DAC to die without affecting the PIC. How is a soft
reset going to help here?

The point is that if you have a typical program running on a PIC and a
watchdog reset kicks in, then the PIC tends to behave as though the power
has just come up and it has been reset. While if you use a real state
machine approach it is FAR easier to recover from the fault that caused
the watchdog to kick in.

BTW, think about this: there are lots of programs out there running with
the odd corrupted RAM location (caused by bugs) and they still manage to
work... for the most part :-)

Regards
Sergio Masci

More... (looser matching)
- Last day of these posts
- In 2010 , 2011 only
- Today
- New search...