Searching \ for '[EE]: Software and Human errors' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/index.htm?key=software+human+errors
Search entire site for: 'Software and Human errors'.

Exact match. Not showing close matches.
PICList Thread
'[EE]: Software and Human errors'
2007\04\14@150842 by Marcel Duchamp

picon face
How Mars Global Surveyor was lost:

http://planetary.org/news/2007/0413_Human_and_Spacecraft_Errors_Together.html

in light of the recent postings on how to create bug-free software, I
thought this might be of interest.

2007\04\14@154511 by wouter van ooijen

face picon face
> How Mars Global Surveyor was lost:
>
> planetary.org/news/2007/0413_Human_and_Spacecraft_Error
> s_Together.html
>
> in light of the recent postings on how to create bug-free software, I
> thought this might be of interest.

The bold text that starts the article is IMHO misleading: "resulted from
mistakes made by both the human operators and the spacecraft's onboard
fault protection software." The rest of the text does not mention any
error in the on-board software, only errouneous parameters upload from
earth. Pure human operator errors.

The loss of the first Ariane 5 is often atributed to "a software error",
which is equally misleading. It a sence it was an operator too, although
the operator in this case was the project management.

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu



2007\04\14@174046 by Gerhard Fiedler

picon face
wouter van ooijen wrote:

>> How Mars Global Surveyor was lost:
>>
>> planetary.org/news/2007/0413_Human_and_Spacecraft_Error
>> s_Together.html
>>
>> in light of the recent postings on how to create bug-free software, I
>> thought this might be of interest.
>
> The bold text that starts the article is IMHO misleading: "resulted from
> mistakes made by both the human operators and the spacecraft's onboard
> fault protection software." The rest of the text does not mention any
> error in the on-board software, only errouneous parameters upload from
> earth. Pure human operator errors.

Well, in the end all errors are human, aren't they? If a device fails, it's
the designer's error... :)

But I got the feeling that the real problem was with the parameter upload
software. They say about this process:

"When the identical and correct HGA parameters were uploaded to the
spacecraft, the operations team incorrectly specified the location for the
new parameter in the computer's memory.  Because the wrong memory location
was specified, the new parameter was written over the end of one and the
beginning of a second parameter being stored in onboard memory, corrupting
both parameters."

I think that the software used to upload a parameter to such an expensive
device should have a bit more safety built in, so that it is /really/
difficult to write to a wrong location. An operator should not have to
enter any addresses manually at all. This is considered bad design on
devices that cost $200...  

Gerhard

2007\04\15@040600 by wouter van ooijen

face picon face
> I think that the software used to upload a parameter to such
> an expensive device should have a bit more safety built in,
> so that it is /really/ difficult to write to a wrong
> location. An operator should not have to enter any addresses
> manually at all. This is considered bad design on devices
> that cost $200...  

In the end the operator must always specify (and address, or a name, or
whatever) and he can do that wrong. With this kind of equipment you want
to have a full-controll option, and you can't have full control without
full opportunity for errors...

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu



2007\04\15@092447 by Gerhard Fiedler

picon face
wouter van ooijen wrote:

>> I think that the software used to upload a parameter to such an
>> expensive device should have a bit more safety built in, so that it is
>> /really/ difficult to write to a wrong location. An operator should not
>> have to enter any addresses manually at all. This is considered bad
>> design on devices that cost $200...  
>
> In the end the operator must always specify (and address, or a name, or
> whatever) and he can do that wrong. With this kind of equipment you want
> to have a full-controll option, and you can't have full control without
> full opportunity for errors...

I strongly disagree. (Come on, Wouter, I really think you know better :)

You have a restricted hardware and firmware in the device. You have an
assembler or compiler that created that firmware. You used labels or
variables for the memory locations, with descriptive names and comments
that have an even better description. The linker created a memory map with
the addresses of these labels or variables. Something like this.

Now you say the best one can do is to take a printout of the map file and
enter the raw memory addresses manually when you want to send something to
the device? I'm almost certain that you would do better.

In the simplest form, you create a map between human-readable parameter
names and their address ranges and write a standard GUI application where
the operator fills in clearly labeled fields. Or you have a command line
application that checks the address input against this map, prompts the
user with the human-readable name (and rejects address ranges that don't
match any of the defined ranges) and allows the user to proceed or cancel.

In a more elaborate form, this address map used by the software can be
automatically derived from the original firmware sources and the linker
map. You can derive the variable name, and its description in the
associated comment, and its address from the associated linker map. The
software used to upload parameters can be as elaborate as needed (range and
consistency checking, link the parameter values to a simulator and simulate
them before uploading, etc.); the hardware restrictions of the target
device are completely irrelevant here. When entering a parameter, the
operator does not enter an address range, she selects from a list of
sufficiently descriptive parameter names. It's not even possible to enter
an address range that overwrites other parameters. It's still possible to
select a wrong parameter, of course, but IMO this is much more unlikely
than it is to make a typo in the address range.

I can't imagine that we need to discuss whether it's more error prone to
enter an address range of "6A32F-6A337" or to select a field "sun ray
vertical friction" when addressing a parameter value. Of course you can say
that the underlying memory map may be wrong. It may be, of course. But it
may be generated from the source code and linker map, and after this it may
be checked by a dozen independent reviewers. With that, it's reasonably
certain that it is correct (in any case more so than the address range
that's entered manually by the operator and not checked by anybody else).
>From that point on, operator errors due to entering the wrong address range
are much more difficult to occur.

(This all doesn't preclude that you have an override option that allows an
operator to in fact enter a raw target address, for unforeseen emergencies.
But this is not needed in normal operation, and should only be used with
the utmost care and double or triple checking the entered values.)

This sort of low-level control you are talking about is not normally
needed. I use it sometimes, as I'm sure we all do on occasion, especially
during debugging, because all the effort to create the comfortable access
software is not warranted. But that's during debugging, and I'm on a tight
budget usually. I would never (well, I think... "never say never" :)
deliver a device to a user (and may it be a space device operator) that
requires entering raw addresses for changing parameters. Such a space
device should have the budget for a decent software for parameter upload --
in the end, as can be seen, it's cheaper to have it than not :)

Gerhard

2007\04\15@101755 by wouter van ooijen

face picon face
> Now you say the best one can do is to take a printout of the
> map file and enter the raw memory addresses manually when you
> want to send something to the device? I'm almost certain that
> you would do better.

No, I said the operator has to specify *something*. A name is an
instance of something. If the default ground operator interface used hex
addresses instead of names I would consider that an error, but probably
a design error or more probably a specification error, not a software
error. And note: that's ground software, not on-board software.

> I would never (well, I think... "never say never" :)
> deliver a device to a user (and may it be a space device
> operator) that requires entering raw addresses for changing
> parameters.

For the space stuff I worked for such an interface was mandatory (a
name-based interface was build on top of it, but the address-based
interface was required to be available). So if you would not deliver it,
your software would not fly (or rather: you would not be on the project
at all).

> Such a space device should have the budget for a
> decent software for parameter upload -- in the end, as can be
> seen, it's cheaper to have it than not :)

That seems plausible, but did you realy try to do the math? Compare the
extra cost (to make it conform to whatever you see fit) for *all*
spacecraft, including the ones that never get launched, and then compare
it to the cost you save (you still won't catch all errrors!). I am not
saying you are not right, but I would not be convinced either way before
someone put the calculation in front of me.

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu




2007\04\15@213104 by Gerhard Fiedler

picon face
wouter van ooijen wrote:

>> Now you say the best one can do is to take a printout of the map file
>> and enter the raw memory addresses manually when you want to send
>> something to the device? I'm almost certain that you would do better.
>
> No, I said the operator has to specify *something*. A name is an
> instance of something.

I'm not sure you read the article I was referring to. Here's an excerpt:

"When the identical and correct HGA parameters were uploaded to the
spacecraft, the operations team incorrectly specified the location for the
new parameter in the computer's memory.  Because the wrong memory location
was specified, the new parameter was written over the end of one and the
beginning of a second parameter being stored in onboard memory, corrupting
both parameters."

An upload software that doesn't have safeguards against this is a major
problem IMO.

> If the default ground operator interface used hex addresses instead of
> names I would consider that an error, but probably a design error or
> more probably a specification error, not a software error.

So? It's still a major problem -- as this event showed.

> And note: that's ground software, not on-board software.

This is exactly my point.


>> I would never (well, I think... "never say never" :) deliver a device to
>> a user (and may it be a space device operator) that requires entering
>> raw addresses for changing parameters.
>
> For the space stuff I worked for such an interface was mandatory (a
> name-based interface was build on top of it, but the address-based
> interface was required to be available). So if you would not deliver it,
> your software would not fly (or rather: you would not be on the project
> at all).

Wouter... please read my stuff :)  If there was a name-based interface on
top of the raw address interface, that's exactly what I was suggesting (the
operators of your interface were not /required/ to use the address-based
interface) -- and what these guys obviously not had.


{Quote hidden}

I have to admit that I never sent a device into space. But I sent devices
to factory floors, on the streets, and elsewhere -- and none of them
required a user to change configuration parameters using raw address
tables. From my experience, this would be a support nightmare. (And at
least if consumers were involved, could probably easily cause legal
problems, too.) So, yes, so far both me and my clients have come to the
conclusion that the expense for a human-readable interface that does not
allow the user to write to arbitrary addresses is worth the expense. (This
expense is not that big, either. It's rather straightforward to build this
on top of the raw address interface.)

Gerhard

2007\04\16@014859 by wouter van ooijen

face picon face
> An upload software that doesn't have safeguards against this
> is a major problem IMO.

As I said, it looks like they had an address-based interface. Maybe the
had a name-based interface on top of that too, but from the article it
seems they did not use that. I am reluctant to trust the article in such
detail, I have seen similar articles to get such details subtly wrong.
Subtly can in cases like this mean totally.

> So? It's still a major problem -- as this event showed.

IIRC I argued that I did not see evidence for an (in flight?) software
error.

> Wouter... please read my stuff :)  If there was a name-based
> interface on top of the raw address interface, that's exactly
> what I was suggesting (the operators of your interface were
> not /required/ to use the address-based
> interface) -- and what these guys obviously not had.

I would not draw such much conculsions from an article like this.

> -- and none of them required a user to change configuration
> parameters using raw address tables. From my experience, this
> would be a support nightmare.

Was any of these devices you made accessible by remote access only (in
the sense that it would cost $$$ to replace the device, and $$$$ to get
a human near it)? In such circumstances you want *total* remote access.

> (This expense is not that big, either.
> It's rather straightforward to build this on top of the raw
> address interface.)

In space stuff everything is terribly expensive, even the parts and
software that will not fly.

As an example: guess what a "normal" resistor for in-space use will
cost? (I will have to look it up too)

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu



2007\04\16@055454 by Alan B. Pearce

face picon face
>Wouter... please read my stuff :)  If there was a name-based
>interface on top of the raw address interface, that's exactly
>what I was suggesting (the operators of your interface were not
>/required/ to use the address-based interface) -- and what
>these guys obviously not had.

Unfortunately what Wouter describes is what I see happening as well. You
submit a patch that you want sent to "your" instrument, and the operators
send it at a designated slot in their operations. No matter what you try and
do in ground testing a patch, it is still possible to have problems with the
uploaded one.

I haven't read the article, but it sounds like they should have done some
form of readback to verify correct loading of the parameters, but that is
not always possible either.

2007\04\16@063452 by wouter van ooijen

face picon face
> I haven't read the article, but it sounds like they should
> have done some
> form of readback to verify correct loading of the parameters,
> but that is
> not always possible either.

Where possible a write-to-temporary-storage, readback,
write-to-working-storage cycle is often used. But IIRC from the article
in this case the communication window was shorter than the round-trip
delay ...

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu



2007\04\16@094136 by Gerhard Fiedler

picon face
wouter van ooijen wrote:

> I am reluctant to trust the article in such detail, I have seen similar
> articles to get such details subtly wrong.

Ok, you have a point with this. But then, we don't know anything... I was
just assuming the article was correct, and arguing from that position. If
you know it wasn't, let's talk about this. If you only suspect it might not
be, then there's not really much to say anyway...


> IIRC I argued that I did not see evidence for an (in flight?) software
> error.

That might be so, but nobody claimed that there was one.

I argued that it seemed -- from the article -- that there were not enough
safeguards (in ground software and procedures) against an almost
predictable error.


>> -- and none of them required a user to change configuration parameters
>> using raw address tables. From my experience, this would be a support
>> nightmare.
>
> Was any of these devices you made accessible by remote access only (in
> the sense that it would cost $$$ to replace the device, and $$$$ to get
> a human near it)? In such circumstances you want *total* remote access.

I said that the very first time, and in every msg since: I never argued
against total and totally free access. I just argued against using such an
access without the appropriate safeguards -- which, according to the
article, seems to have happened:

"'The particular procedures that were in place didn't have as much rigor in
them as you could have expected,' Perkins said. 'For example, they didn't
require people to go back to this very basic and not-always-easy-
to-interpret set of tables that show everything that's in the memory on the
spacecraft to verify what was going on.  There wasn't any obvious thing
between June and November that would have led someone to discover [the
faulty parameters].  Among our findings is that... there were no systems in
place that would have discovered it and brought it to the attention of the
operations team.'"


> In space stuff everything is terribly expensive, even the parts and
> software that will not fly.

Right. But the loss of the craft (total cost $210M) shows that there was a
problem. /This/ was terribly expensive. Apparently a few $10k in better
procedures/ground software would have been well spent. (I also know that
afterwards everybody is smart -- I'm not attempting to judge the involved
people here.)

The only thing I argued was that this was -- IMO, and according to the
article -- not an operator error, but a problem in the procedures (which
probably include several pieces of ground software) that made it possible
that a single operator /can/ make a mistake with raw memory addresses in
the first place.

> As an example: guess what a "normal" resistor for in-space use will
> cost? (I will have to look it up too)

No clue. I'd guess over $1, after you factor everything in. But the more
expensive, the more it is an argument in favor of what I'm trying to say:
put better safeguards in ground procedures and software, so that the
probability of such silly things is smaller still. If a single operator
mistake can wreak such havoc, it's generally not the operator's fault, it's
the whole procedure that's missing something.

Gerhard

2007\04\16@103731 by wouter van ooijen

face picon face
> > I am reluctant to trust the article in such detail, I have seen
> > similar articles to get such details subtly wrong.
>
> Ok, you have a point with this. But then, we don't know
> anything... I was just assuming the article was correct, and
> arguing from that position. If you know it wasn't, let's talk
> about this. If you only suspect it might not be, then there's
> not really much to say anyway...

IIRC the article stated that they 'specified the wrong address'. From
this you extrapolated that they used an address-value interface, and
that they should have used a name-based interface. For me that's two
steps that are not 100% justified from such an article.

> I argued that it seemed -- from the article -- that there
> were not enough safeguards (in ground software and
> procedures) against an almost predictable error.

OK, so at least we agree on 'no (proven) flight software error' :)

> Right. But the loss of the craft (total cost $210M) shows
> that there was a problem. /This/ was terribly expensive.
> Apparently a few $10k in better procedures/ground software
> would have been well spent.

But that is not fair: if you create such a directive that amount of
money must be spent on *every* spacefraft, not just this one.

And IIRC the spacecraft was lost well after its intended lifetime, so in
some bookkeeping sense the loss was $ 0. That is probably not right
either, but certainly not the total cost of the craft!

> No clue. I'd guess over $1, after you factor everything in.

IIRC it was more like $100. But maybe someone who is currently into such
systems can give a ballpark figure.

> But the more expensive, the more it is an argument in favor
> of what I'm trying to say: put better safeguards in ground
> procedures and software, so that the probability of such
> silly things is smaller still.

But that cost exposion factor (a resistor in your TV set probably costs
~ $0.0005 .. $0.05, depending on which costs you include) also applies
to software, so when you guesstimate $10k the cost for space-related
software (even gound software) will be an order of magnitude higher. You
must take that into account when you compare it to the loss.

> If a single operator mistake
> can wreak such havoc, it's generally not the operator's
> fault, it's the whole procedure that's missing something.

That's a good first-order assumption.

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu



2007\04\16@122909 by Robert Rolf

picon face
wouter van ooijen wrote:
>>I haven't read the article, but it sounds like they should
>>have done some
>>form of readback to verify correct loading of the parameters,
>>but that is
>>not always possible either.
>
>
> Where possible a write-to-temporary-storage, readback,
> write-to-working-storage cycle is often used. But IIRC from the article
> in this case the communication window was shorter than the round-trip
> delay ...

But this wouldn't have helped since the readback would have given the
correct data back even though the supplied address was WRONG!

R

2007\04\16@190638 by Gerhard Fiedler

picon face
wouter van ooijen wrote:

> IIRC the article stated that they 'specified the wrong address'. From
> this you extrapolated that they used an address-value interface, and
> that they should have used a name-based interface. For me that's two
> steps that are not 100% justified from such an article.

I generally assume that even such an article makes a difference between
"address" and "name" :)  Besides, they explicitly explain that the address
was wrongly specified in a way that the value overwrote neighboring
parameter locations. This pretty much requires either an "address-value
interface" and a wrong address spec, or a pretty serious bug in the
name-based interface (in which case it would be downright wrong to say that
it was an operator error).

Claiming that it was an operator error and saying that the error caused
overwriting neighboring parameter addresses IMO pretty much infers what you
call an "address-value interface".

> OK, so at least we agree on 'no (proven) flight software error' :)

Sure. Flight software was never in the (my) picture :)


>> But the more expensive, the more it is an argument in favor of what I'm
>> trying to say: put better safeguards in ground procedures and software,
>> so that the probability of such silly things is smaller still.
>
> But that cost exposion factor (a resistor in your TV set probably costs
> ~ $0.0005 .. $0.05, depending on which costs you include) also applies
> to software, so when you guesstimate $10k the cost for space-related
> software (even gound software) will be an order of magnitude higher. You
> must take that into account when you compare it to the loss.

This is where I think I disagree.

If there's something that can improve an insecure procedure, I'd say
something is often better than nothing. The way it seemed to have been,
there were no safeguards at all. Even a simple safeguard (not preventing
anything, not writing to the actual data, but warning "Hey, there's
something fishy -- you're sure about this?") would have been better than
nothing. And software that doesn't act on its own, doesn't write to data,
probably doesn't have to have that cost explosion factor applied to it.


>> If a single operator mistake can wreak such havoc, it's generally not
>> the operator's fault, it's the whole procedure that's missing
>> something.
>
> That's a good first-order assumption.

Good that we agree on this... :)  This was my original point, probably not
formulated adequately at first.

Gerhard

2007\04\17@102912 by Howard Winter

face
flavicon
picon face
Wouter,

On Sat, 14 Apr 2007 21:46:53 +0200, wouter van ooijen wrote:

{Quote hidden}

Well *all* errors are human in origin - the machine didn't write its own software!  :-)  But they did say that the
onboard software made a couple of mistakes,  one in trying to orient the spacecraft in a way that faced a battery to
the Sun (implying that the battery orientation wasn't a parameter it took notice of, or even had), and one in
misinterpreting high battery temperature as due to overcharging.  Possibly also misinterpreting the "stuck" motor as a
hardware fault, rather than a parameter error.  Maybe there should have been a set of fallback "absolute limit"
parameters that are hardwired - appropriate for physical parameters such as the end-point of a jack-screw or
whatever - which the onboard software checks against the live parameters its using.

Cheers,


Howard Winter
St.Albans, England


2007\04\17@103526 by Howard Winter

face
flavicon
picon face
Wouter,

On Sun, 15 Apr 2007 10:05:42 +0200, wouter van ooijen wrote:

{Quote hidden}

The fact that the wrongly-addressed parameter overwrote half of one field and half of another implies that an
address was used, since it didn't align with any genuine parameters.  This should have been checked *somewhere* - of
the onboard software didn't check it, the ground software sending it should have done so.  Even if it only flagged it as
a warning: "You're writing over parameter boundaries".  And while full-control is necessary, making it the standard
interface is a bad thing.  I can adjust the ignition timing on my car, but not while I'm driving it (this used to be the case
in the early days of internal combustion engines) and I shouldn't have to.  The standard interface should be restricted
to predictable operations, so the possible range of mistakes is limited.  The raw full control interface (typing addresses
etc) should only be used in emergencies and checking of these operations should be given much more emphasis.

As for whether the spacecraft was worth saving by spending more on reducing proneness to error, I think this is a
"when are you going to stop beating your wife" question - there is no correct answer!  :-)

Cheers,


Howard Winter
St.Albans, England


More... (looser matching)
- Last day of these posts
- In 2007 , 2008 only
- Today
- New search...