Searching \ for '[PIC]: Increasing Watchdog Reliability' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/microchip/devices.htm?key=pic
Search entire site for: 'Increasing Watchdog Reliability'.

Exact match. Not showing close matches.
PICList Thread
'[PIC]: Increasing Watchdog Reliability'
2000\10\24@183437 by Bob Ammerman

picon face
I am wrapping up a design in which I use a bunch of 'virtual peripherals'
(ala Scenix) multiplexed on a timer 0 interrupt.

Of course, I have the watchdog enabled, and I have sprinkled CLRWDT's around
my task level code.

I got to thinking however, that if for some reason I stopped getting and
handling the interrupts, that my mainline code would continue hitting the
dog indefinitely, which wouldn't be a good thing since the system wouldn't
be functioning.

So....

I added a piece of code in the interrupt handler that sets a bit and now I
only do the CLRWDT in the mainline code if the bit is set (and of course I
clear the bit again).

I feel this will reduce the probability of an undetected error.

Bob Ammerman
RAm Systems
(contract development of high performance, high function, low-level
software)

--
http://www.piclist.com hint: PICList Posts must start with ONE topic:
"[PIC]:","[SX]:","[AVR]:" =uP ONLY! "[EE]:","[OT]:" =Other "[BUY]:","[AD]:" =Ads




2000\10\24@191156 by David VanHorn

flavicon
face
>
>I added a piece of code in the interrupt handler that sets a bit and now I
>only do the CLRWDT in the mainline code if the bit is set (and of course I
>clear the bit again).


Interesting.

What I do, is I have a 1mS "tick" int that decs a counter byte.
At the end of the mainline code, where I am about to jump to the beginning
of the idle loop again, I have a call to "timed_smack" which resets the dog
IF the timer byte is zero, and of course sets the timer back to the
appropriate value. This means the ISR has to be functional, and the
mainline code has to be functional, or the dog bites.

--
http://www.piclist.com hint: PICList Posts must start with ONE topic:
"[PIC]:","[SX]:","[AVR]:" =uP ONLY! "[EE]:","[OT]:" =Other "[BUY]:","[AD]:" =Ads




2000\10\24@223022 by Bob Ammerman

picon face
Exactly the same concept.


----- Original Message -----
From: David VanHorn <spam_OUTdvanhornTakeThisOuTspamCEDAR.NET>
To: <.....PICLISTKILLspamspam@spam@MITVMA.MIT.EDU>
Sent: Tuesday, October 24, 2000 6:58 PM
Subject: Re: [PIC]: Increasing Watchdog Reliability


{Quote hidden}

dog
> IF the timer byte is zero, and of course sets the timer back to the
> appropriate value. This means the ISR has to be functional, and the
> mainline code has to be functional, or the dog bites.
>
> --
> http://www.piclist.com hint: PICList Posts must start with ONE topic:
> "[PIC]:","[SX]:","[AVR]:" =uP ONLY! "[EE]:","[OT]:" =Other
"[BUY]:","[AD]:" =Ads
>
>
>
>

--
http://www.piclist.com hint: PICList Posts must start with ONE topic:
"[PIC]:","[SX]:","[AVR]:" =uP ONLY! "[EE]:","[OT]:" =Other "[BUY]:","[AD]:" =Ads




2000\10\25@062619 by mike

flavicon
face
On Tue, 24 Oct 2000 18:18:35 -0400, you wrote:

>I am wrapping up a design in which I use a bunch of 'virtual peripherals'
>(ala Scenix) multiplexed on a timer 0 interrupt.
>
>Of course, I have the watchdog enabled, and I have sprinkled CLRWDT's around
>my task level code.
>
>I got to thinking however, that if for some reason I stopped getting and
>handling the interrupts, that my mainline code would continue hitting the
>dog indefinitely, which wouldn't be a good thing since the system wouldn't
>be functioning.
>
>So....
>
>I added a piece of code in the interrupt handler that sets a bit and now I
>only do the CLRWDT in the mainline code if the bit is set (and of course I
>clear the bit again).
>
>I feel this will reduce the probability of an undetected error.
Another possibility - the only thing in practice that will stop the
timer ticking is the OPTION reg getting corrupted, so periodically
refreshing it in the FG code would effectively  do a similar job.

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@074420 by Andrew Kunz
flavicon
face
Bob,

We typically run a 5mS timer (because 7mS is the new minimum t/o) to kick the
dog.  It uses input from both TMR0 (ISR) and application code to run down the
5mS counter.  If we miss too much, it blows.

Andy








Bob Ammerman <RAMMERMANspamKILLspamPRODIGY.NET> on 10/24/2000 06:18:35 PM

Please respond to pic microcontroller discussion list <.....PICLISTKILLspamspam.....MITVMA.MIT.EDU>








To:      EraseMEPICLISTspam_OUTspamTakeThisOuTMITVMA.MIT.EDU

cc:      (bcc: Andrew Kunz/TDI_NOTES)



Subject: [PIC]: Increasing Watchdog Reliability








I am wrapping up a design in which I use a bunch of 'virtual peripherals'
(ala Scenix) multiplexed on a timer 0 interrupt.

Of course, I have the watchdog enabled, and I have sprinkled CLRWDT's around
my task level code.

I got to thinking however, that if for some reason I stopped getting and
handling the interrupts, that my mainline code would continue hitting the
dog indefinitely, which wouldn't be a good thing since the system wouldn't
be functioning.

So....

I added a piece of code in the interrupt handler that sets a bit and now I
only do the CLRWDT in the mainline code if the bit is set (and of course I
clear the bit again).

I feel this will reduce the probability of an undetected error.

Bob Ammerman
RAm Systems
(contract development of high performance, high function, low-level
software)

--
http://www.piclist.com hint: PICList Posts must start with ONE topic:
"[PIC]:","[SX]:","[AVR]:" =uP ONLY! "[EE]:","[OT]:" =Other "[BUY]:","[AD]:" =Ads

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@123752 by Bruce Cannon

picon face
I guess you're talking about something like an ISR which looks for flags
from each routine that should have been called in the past Xms, and if
they're all there clears the WDT and the flags?

Bruce Cannon
Style Management Systems
http://siliconcrucible.com
(510) 787-6870
1228 Ceres ST Crockett CA 94525

Remember: electronics is changing your world...for good!

> {Original Message removed}

2000\10\25@124413 by Bruce Cannon

picon face
Mike wrote:

> Another possibility - the only thing in practice that will stop the
> timer ticking is the OPTION reg getting corrupted, so periodically
> refreshing it in the FG code would effectively  do a similar job.

But if the option_reg is corrupted, isn't there an equal chance that the
mirror you'd use to refresh it is corrupted?  Or any other reg in your
system, for that matter?


Bruce Cannon
Style Management Systems
http://siliconcrucible.com
(510) 787-6870
1228 Ceres ST Crockett CA 94525

Remember: electronics is changing your world...for good!

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@130116 by Dan Michaels

flavicon
face
At 09:41 AM 10/25/2000 -0700, you wrote:
>Mike wrote:
>
>> Another possibility - the only thing in practice that will stop the
>> timer ticking is the OPTION reg getting corrupted, so periodically
>> refreshing it in the FG code would effectively  do a similar job.
>
>But if the option_reg is corrupted, isn't there an equal chance that the
>mirror you'd use to refresh it is corrupted?  Or any other reg in your
>system, for that matter?
>


I've been wondering about something as this thread has transpired.

There are chips like MAX706 supervisory chip that monitor a cpu
pin, and hold the system out of WD_reset as long as the cpu keeps
sending out pulses on the pin. This is a also a possible
function I could add to SonofPushMePullYou - the 8-pin wart.

The obvious question is, if you do use a 2nd chip to try to
maintain system integrity, do you not actually increase the
"probablility" for failure as there are now more ckts to fail,
and more complexity in the system?

The MAX706, for example, is no doubt CMOS, and also prone to
latchup problems/etc, and might be just as vunerable as the cpu.

- danM

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@132051 by Andrew Kunz

flavicon
face
>The MAX706, for example, is no doubt CMOS, and also prone to
>latchup problems/etc, and might be just as vunerable as the cpu.

To some degree, yes.  The difference is that the 706 is designed to detect those
situations, not to try to run a program.  And if the 706 is getting latched up,
you can bet your boots the micro is too.

Andy

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@140011 by Bob Ammerman

picon face
How is this philosophy:

Some PIC programs go to great lengths to deal with 'should never happen'
events, like registers getting zapped by a cosmic ray or something. They do
this by using shadow registers, periodically reinitializing peripherals,
etc.

I submit that one would be better served to use the same amount of effort to
include code which attempts to detect any of these conditions (eg: compare
the port register to the shadow, rather than force the shadow into the
port).

When an inconsistency is detected, let the watchdog bite and 'reset' the
chip.

This makes sense to me becuase it an extreme upset occurs that you could fix
with one of these tricks, chances are that same upset whacked something you
didn't think to fix or could not fix. Hopefully the watchdog reset and
subsequent reinitialization will put things right.

Whaddya think?

Bob Ammerman
RAm Systems
(contract development of high performance, high function, low-level
software)


{Original Message removed}

2000\10\25@153027 by Don Hyde

flavicon
face
I'm always taken aback by "what if a cosmic ray hits" bits of coding,
because I suspect that the vast majority of them actually decrease the
reliability of the sytstem they're in.  They are typically the very tricky
kinds of code that are most likely to harbor unsuspected bugs, and due to
their very nature are most likely to disrupt the functioning of an otherwise
healthy system.  They are often also highly resistant to testing.  Remember
"If you didn't test it, then it probably doesn't work."

A watchdog timer is a pretty pathetic reliability measure, and is not
without its own dangers.  Carefully used, it is a cheap way to increase the
overall reliability of a system, but it also introduces another opportunity
to design in a mistake that will take down the system.  Even when it works,
it may silently mask a fault that should have been repaired before it got to
the point that it stops the system altogether.  Do you tell anybody when
your watchdog goes off?

If you need real high reliability like with spacecraft and airliners, then
you need to go the whole route with redundant systems and all the extra
money and trouble it takes to make them work.  There is a reason why these
systems are so expensive, and it's not just government waste.

In a system as small as most PIC's are, you will probably do more for the
reliability of your system by spending a couple more hours testing your
software under unusual but realistic conditions than you will by spending
the same time trying to guess what will happen to a few registers when a
cosmic ray hits.

If you find and fix one tiny bug this way, you will probably do more for
your system than you would if you really could make it immune to cosmic
rays.

> {Original Message removed}

2000\10\25@160943 by Dan Michaels

flavicon
face
At 01:20 PM 10/25/2000 -0400, you wrote:
>>The MAX706, for example, is no doubt CMOS, and also prone to
>>latchup problems/etc, and might be just as vunerable as the cpu.
>
>To some degree, yes.  The difference is that the 706 is designed to detect
those
>situations, not to try to run a program.  And if the 706 is getting latched up,
>you can bet your boots the micro is too.
>

Exactly my point.

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@162422 by David VanHorn

flavicon
face
>
>A watchdog timer is a pretty pathetic reliability measure, and is not
>without its own dangers.  Carefully used, it is a cheap way to increase the
>overall reliability of a system, but it also introduces another opportunity
>to design in a mistake that will take down the system.  Even when it works,
>it may silently mask a fault that should have been repaired before it got to
>the point that it stops the system altogether.  Do you tell anybody when
>your watchdog goes off?

True enough.
Where I'm using it, all I care is that the system not get into a "locked
up" state, requiring the user to power-cycle the device.  Data that gets
dropped on the floor won't hurt me, so it's used within it's means I think.

I do wonder how the "big guys" do it, since an out-of-control micro can
wreak absolute havoc in as little as one instruction, and no watchdog has
even a hope in hell of catching that.

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@163055 by David VanHorn

flavicon
face
At 01:58 PM 10/25/00 -0400, Bob Ammerman wrote:
>How is this philosophy:
>
>Some PIC programs go to great lengths to deal with 'should never happen'
>events, like registers getting zapped by a cosmic ray or something. They do
>this by using shadow registers, periodically reinitializing peripherals,
>etc.

How can you implement this in a "real" system?
Re-initting peripherals is something I've heard of, in varying degrees from
"fugeddaboudit" to "it's absolutely necessary".  Still, you then have to
define a point when it's safe to do so.  All this assumes that you can
really determine that it needs re-initting.. What if your shadow took the
hit, and not the peripheral.. Now you need at least a pair of shadows, and
handlers for  the case where neither shadow agrees with the periph, or the
other shadow. :-P

In order to work at all, this sort of approach has to be really designed
into the app. Resetting because of a port-shadow conflict could cause more
problems than the port being (debatably) initted wrong.. Now you also have
to deal with how to come out of reset in a way that dosen't cause problems.

It's almost a "robotics" problem, in that the system has to come up,
observe it's "world" and figure out what to do next.

I suppose the nasa and medical guys must face this all the time, makes me
glad  I'm not there :) But I would like to hear how they approach things
like this, in case there are any small bits that I can use.

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@171024 by Olin Lathrop

flavicon
face
> The obvious question is, if you do use a 2nd chip to try to
> maintain system integrity, do you not actually increase the
> "probablility" for failure as there are now more ckts to fail,
> and more complexity in the system?

One of my current projects uses a 16F876 as the main controller and a
12C508A to watch a heartbeat from the main controller and also to look for
other inconsistant state that can only be caused by a hardware failure.  The
'876 can also partially monitor the '508A.  Each PIC can independantly shut
down the operation of some high power parts.

Yes, the additional parts increase the probability of a failure.  But keep
in mind what watchdogs are for.  The purpose is to prevent grossly
undesireable operation in the event of a failure, not to prevent failures.
The additional parts CAN greatly decrease the chance of unsafe operation
(people getting zapped, unit catching fire, etc) even though they increase
the chance of failure.


*****************************************************************
Olin Lathrop, embedded systems consultant in Devens Massachusetts
(978) 772-3129, olinspamspam_OUTcognivis.com, http://www.cognivis.com

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@180602 by Olin Lathrop

flavicon
face
> A watchdog timer is a pretty pathetic reliability measure, and is not
> without its own dangers.  Carefully used, it is a cheap way to increase
the
> overall reliability of a system, but it also introduces another
opportunity
> to design in a mistake that will take down the system.  Even when it
works,
> it may silently mask a fault that should have been repaired before it got
to
> the point that it stops the system altogether.  Do you tell anybody when
> your watchdog goes off?

I agree with you.  I thought I was the only heretic not to use the internal
watchdog, so I kept my mouth shut (didn't think that was possible, eh
James?).  In the last two dozen or so embedded systems I can only think of
two that used watchdogs, and both were external.  In other words, they were
watching the hard results, not what the software thinks it was doing.

The internal watchdog can only catch a software bug at best.  If a software
bug is encountered, is it really better to reset the system and try to
pretend it didn't happen, or to fail in an obvious way so that the bug can
be found and fixed?  Certainly the latter at the very least until it gets
into the end user's hands.  Even then, which would you prefer your newly
purchased gizmo to do: occasionally give you a bad reading without telling
you or occasionally crap out completely requiring a power down reset?  In
the first case you get bad data without knowing it, and in the second case
you get the thing fixed or replaced.

In mission-critical cases, and internal watchdog looks for the wrong symptom
in the wrong place anyway.  The only times I have used the internal watchdog
in a PIC was to wake the processor from sleep every once in a while.  I have
never used the watchdog in a "watchdog" role.


*****************************************************************
Olin Lathrop, embedded systems consultant in Devens Massachusetts
(978) 772-3129, @spam@olinKILLspamspamcognivis.com, http://www.cognivis.com

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@201325 by Dan Michaels

flavicon
face
Olin wrote:
>> The obvious question is, if you do use a 2nd chip to try to
>> maintain system integrity, do you not actually increase the
>> "probablility" for failure as there are now more ckts to fail,
>> and more complexity in the system?
>
>One of my current projects uses a 16F876 as the main controller and a
>12C508A to watch a heartbeat from the main controller and also to look for
>other inconsistant state that can only be caused by a hardware failure.  The
>'876 can also partially monitor the '508A.  Each PIC can independantly shut
>down the operation of some high power parts.
>


Hmmm, shades of SonofPushMePullYou. Can I "claim" royalties here,
or did you invent this independently?

- danM

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\25@204031 by Bob Ammerman

picon face
Actually, adding components to a system can greatly _increase_ reliability.

As an example: I have helped develop control systems for power plants based
on VAX and ALPHA computers.

These systems require 99.99...% uptime.

We arrange two computers as a pair, either of which can do the work of the
system. A small microprocessor based computer (called a watchdog :-))
monitors the health of the two main systems and automatically fails over to
the backup machine if the current live machine gets sick.

Since the reliability of the watchdog is much higher than the main computers
we get great reliability.

The same principal can, of course, be applied to smaller PIC-based systems.

Bob Ammerman
RAm Systems
(contract development of high performance, high function, low-level
software)

{Original Message removed}

2000\10\25@213759 by Dan Michaels

flavicon
face
Bob Ammerman wrote:
>Actually, adding components to a system can greatly _increase_ reliability.
>
>As an example: I have helped develop control systems for power plants based
>on VAX and ALPHA computers.
>
>These systems require 99.99...% uptime.
>
>We arrange two computers as a pair, either of which can do the work of the
>system. A small microprocessor based computer (called a watchdog :-))
>monitors the health of the two main systems and automatically fails over to
>the backup machine if the current live machine gets sick.
>
>Since the reliability of the watchdog is much higher than the main computers
>we get great reliability.
>
>The same principal can, of course, be applied to smaller PIC-based systems.
>


It may be in the semantics, but there is a difference between
"adding components" which I was talking about, and adding
"redundancy" which is what you are doing here. Basically, Olin
is correct - I believe - simply adding components will detract
somewhat from the overall reliability. Having 3 computers connnected
strategically to backup and monitor each other will improve
reliability.

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics




2000\10\26@044321 by Michael Rigby-Jones

flavicon
face
{Quote hidden}

Actually, what you did there was to increase *Availability*.  Reliability
from the point of view of MTBF will be reduced by the simple fact of having
twice the number of bits to go wrong.

The last place I worked for made monitoring equipment for electric loco's to
ensure the chopped DC drive didn't introduce harmonics into the track
circuits that could possibly make a blobked section looked clear.  There
were several configurations, one customer had two units in a rack wired in
parallel for increased availability, another had two units wired in series
for increased reliability (increased reliability in this case meaning you
could count on the two units in series to detect errant signal more reliably
than just one unit).  This configuration, not surprisingly had a lower MTBF
than the parallel arangement.

Mike

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@044735 by Alan B. Pearce

face picon face
>This makes sense to me becuase it an extreme upset occurs that you could fix
>with one of these tricks, chances are that same upset whacked something you
>didn't think to fix or could not fix. Hopefully the watchdog reset and
>subsequent reinitialization will put things right.

>Whaddya think?

You run a windows PC and reboot it at least once every day ??? ;))

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@051521 by Alan B. Pearce

face picon face
>We arrange two computers as a pair, either of which can do the work of the
>system. A small microprocessor based computer (called a watchdog :-))
>monitors the health of the two main systems and automatically fails over to
>the backup machine if the current live machine gets sick.

I was once involved with some data grading equipment that was used in a freezing works (in America known as a slaughter house?). These were attached to a pair of Burroughs B80 computers that did the data capture. The Burroughs used a multidrop serial line to talk to our equipment, and the two machines used the same line to talk to each other.

I understand the two machines ran identical programs, which had been cleverly written along the following lines:

machine boot - program starts and listens to serial line
Time out listening - assume I am master.
Else message on serial line - I am slave.

main loop:
decide next device to poll
master calls slave saying "I am going to poll device X"
slave replies confirming address to be polled
master polls device X
device X replies with data - received and saved by both machines
goto main loop


If the master called the slave and got no response it would proceed on its own to poll the device and get the data.

If the slave did not hear from the master for a while, it assumed the role of master and proceeded with the data capture.

This system worked extremely well despite being in an industrial environment with large speed controlled electric motors and all the other normal spike inducing paraphernalia of an industrial environment. I did observe situations where one processor went to never-never land in its program, but am not aware that they ever went to a situation where both machines crashed.

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@081417 by Bob Ammerman

picon face
>
> > We arrange two computers as a pair, either of which can do the work of
the
> > system. A small microprocessor based computer (called a watchdog :-))
> > monitors the health of the two main systems and automatically fails over
> > to
> > the backup machine if the current live machine gets sick.
> >
> > Since the reliability of the watchdog is much higher than the main
> > computers
> > we get great reliability.
> >
> > The same principal can, of course, be applied to smaller PIC-based
> > systems.
> >
> >
> Actually, what you did there was to increase *Availability*.  Reliability
> from the point of view of MTBF will be reduced by the simple fact of
having
> twice the number of bits to go wrong.

You are of course completely correct. However, if the machines are capable
of then rebooting themselves and getting back on line via the watchdog
(assuming the problem isn't a hardware one), _then_ we are increasing
Reliability (a self-corrected failure isn't a failure in MTBF/MTTR terms if
the system recovers on its own).

Bob Ammerman
RAm Systems
(contract development of high performance, high function, low-level
software)

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@082328 by mike

flavicon
face
On Wed, 25 Oct 2000 09:41:16 -0700, you wrote:

>Mike wrote:
>
>> Another possibility - the only thing in practice that will stop the
>> timer ticking is the OPTION reg getting corrupted, so periodically
>> refreshing it in the FG code would effectively  do a similar job.
>
>But if the option_reg is corrupted, isn't there an equal chance that the
>mirror you'd use to refresh it is corrupted?  Or any other reg in your
>system, for that matter?
Not if you can refresh it with a constant  
--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@091201 by Andrew Kunz

flavicon
face
>I suppose the nasa and medical guys must face this all the time, makes me
>glad  I'm not there :) But I would like to hear how they approach things
>like this, in case there are any small bits that I can use.

Somebody posted a link here about the NASA Shuttle development team.  Very
interesting.

They basically take the "let's all our CPUs vote on it" approach, which is
basically the same as the multi-VAX solution already presented.

Andy

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@102555 by Andy Jancura

picon face
>Somebody posted a link here about the NASA Shuttle development team.  Very
>interesting.

Andy or someone, would you like repost the link, please. I'm very
interesting in. Thank you.

By the way, watchdogs are only one part of a good running system. Good
watchdog rutine can only control how application program run, but cann't
avoid the cause of malfunction self.


Andrej
_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

Share information about yourself, create your own public profile at
http://profiles.msn.com.

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@104515 by Don Hyde

flavicon
face
My introduction to reliability came when I worked as a coop student at NASA
MSFC way back during Apollo days.  One thing I learned is that there is such
a thing as a Phd in reliability engineering, and it's not exactly simple.

MTBF is highly sensitive to the definition of F=failure, which is very
mission-dependent.  The usual definition is "failure to accomplish the
goal", not "no individual part failed."  In that case a lot of individual
parts can be busted, but as long as the system accomplishes whatever it was
designed to do, it worked.  One of the Apollo missions was launched through
a threatening cloud.  As it turned out, the cloud was more threatening than
the meteorologists thought, and the rocket was hit by a bolt of lightening,
which took out just about every active electronic system on the booster
rocket.  You never heard much about it because every one of those systems
switched to its redundant backup and worked just fine.

There are a few things you can do to increase reliability.  One is to
improve the quality of the parts -- often by rigorous inspection and
testing.  The difference between commercial and industrial parts is mostly
testing, and it shows up mostly in reliability.  Those parts you paid extra
for to work at -40 probably have better signal margins at room temperature,
and some of the ones with cracked welds and such got washed out.

Probably the best is to simplify the design.  The parts that aren't there
aren't going to break.  This works for software as well as hardware.  Any
feature that isn't really important will still be adding opportunities for a
failure that may cascade into something that does matter.  That's why
Windoze is is reliable as it is.  Less-important features often contribute
disproportionately to unreliability, because it doesn't feel as important to
test them rigorously.

The big gun is redundancy.  Like any big gun, it's expensive and can wind up
hurting you if you aren't really careful or don't know what you're doing
when you wheel it out.  It can run from a limited-capability backup that
provides a limp-home capability up to stuff like the 4- and 5-way redundancy
in some fly-by-wire systems like the space shuttle and some modern
jetliners.

Here's a place where a watchdog can shine.  If you can afford 3 systems, you
can build a voter circuit that will detect the "odd man out", and give the
right answer with one failure.  Cool.  But if you only have two systems,
it's easy to detect when they disagree, but you need something else to cast
the deciding vote.

Mathematically, if you have two or more things that need to work, then the
system probabality of success is the product of the individual components
probabality of success.  Since these probabalities are always less than one,
the more of them you multiply together, the smaller the result.  More parts
means more failures.

If either one of the systems can do the job, then you multiply the
probability of failure (also less than one), so the more redundant systems
you have, the smaller the system's probability of failure.  More backups
means more success.  Of course, the devil is in the details, and a lot of
those are in the detectors and switches.

> {Original Message removed}

2000\10\26@105719 by Alan B. Pearce

face picon face
>Of course, the devil is in the details, and a lot of
>those are in the detectors and switches.

tell me about it. I'm currently working on a power supply for a satellite instrument that is way behind schedule because of what we were not told about other sub systems in the instrument, and we are now having to fix once we got the flight unit built!!!!!!

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@145206 by Dan Michaels

flavicon
face
AndyK wrote:
>>I suppose the nasa and medical guys must face this all the time, makes me
>>glad  I'm not there :) But I would like to hear how they approach things
>>like this, in case there are any small bits that I can use.
>
>Somebody posted a link here about the NASA Shuttle development team.  Very
>interesting.
>
>They basically take the "let's all our CPUs vote on it" approach, which is
>basically the same as the multi-VAX solution already presented.
>


This doesn't sound all that bad until you actually think about
what is going on here --> hmmm, 3 cpus say no, and 4 cpus say yes.
Hmmm, close enough --> launch!

Oh, my. Don't think I like this system of majority rules very
much. Got to have "at least" 2/3:  5 say yes, and 2 say no.
Hmmm. Funny feelings running up the spine about this. Reboot.

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@171639 by Bruce Cannon

picon face
> >But if the option_reg is corrupted, isn't there an equal chance that the
> >mirror you'd use to refresh it is corrupted?  Or any other reg in your
> >system, for that matter?
> Not if you can refresh it with a constant

But if you refresh it with a constant you're resetting the system which is
the same thing the watchdog does.

Bruce Cannon
Style Management Systems
http://siliconcrucible.com
(510) 787-6870
1228 Ceres ST Crockett CA 94525

Remember: electronics is changing your world...for good!

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




2000\10\26@190036 by mike

flavicon
face
On Thu, 26 Oct 2000 14:14:55 -0700, you wrote:

>> >But if the option_reg is corrupted, isn't there an equal chance that the
>> >mirror you'd use to refresh it is corrupted?  Or any other reg in your
>> >system, for that matter?
>> Not if you can refresh it with a constant
>
>But if you refresh it with a constant you're resetting the system which is
>the same thing the watchdog does.
No - doing  movlw xx
option won't cause a reset. (Actually that's not always entirely true, but if xx is always the
same it is. Changing WDT prescaler settings can cause WDT resets in
some circumstances.)

--
http://www.piclist.com hint: The PICList is archived three different
ways.  See http://www.piclist.com/#archives for details.




More... (looser matching)
- Last day of these posts
- In 2000 , 2001 only
- Today
- New search...