Searching \ for 'Do an evil ghost live in my PIC18FxxJxx ? - using' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/microchip/devices.htm?key=18F
Search entire site for: 'Do an evil ghost live in my PIC18FxxJxx ? - using'.

No exact or substring matches. trying for part
PICList Thread
'Do an evil ghost live in my PIC18FxxJxx ? - using '
2007\08\27@173410 by Morgan Olsson

flavicon
face
Hi

We are having severe trouble with PIC18 programmed in C.

Is there anybody who have experienced anything odd, like functions  
returning wrong value, if statements evaluating erroneously, suspected  
nonprovoked jumps.. *occasionally*  while everything works OK during most  
executions of the same parts of code, then after a random time - bang!??

This is driving us nuts, me (hardware designer and assembler guy), and my  
collegue who does all programming, in C (I myself is not very C litterate)

To start from the beginning:
We have designed two cirquit boards, one is measurement and user I/O slave  
based on PIC16F883 and a couple precision A/D and other I/O, and the main  
control board use PIC18F66J15.

Both are programmed in C using CCS PCWH4.x compiler.

The PIC16 board works perfectly, so it seems we are not too stupid ;)

The PIC18 on the other hand works mostly, but once in some million  
structions or so it does some bad thing.
We can not explain what is the cause and whe have wasted weeks now on this  
it and is over the intended deadline already...
Simply, when we loaded the first program version we was happy to see it  
worked at all; bidirectional comms timesharing power on one line, also  
serial communication to a VFC motor driver, and some logic between.
But.. occasionally it just goes wrong.

For example we caught this behaviour using ICD2 and extra code to sample  
variables:
We shut of interrupts, then execute a part of program that calls a  
function, the function always evaluates correctly but *occasionally* the  
returned value is *not the evaluated one* - the function theoretically  
even cannot reuturn that value we got at the reveiving end!
AND THIS IS WITH INTERRUPTS SHUT OFF!!!!
Also, we could not find an error in the generated assembly code.
At that point, we thought it must be EMI or bad hardware.

But... we eliminated all hardware and EMI problems;


o  We tested to move the PCB away from the VFC, put it in a metal box,  
shielded the processor with copper foil, supply by batteries instead of  
the switchmode converter, toriods on cables.  Also tried cooling it far  
below freezing point, and warmed it hot using a hair dryer.  Also varied  
CPU voltage (core and I/O) to max and min.  All this had ZERO impact on  
error rate.

o  We also scoped the VDDCORE (which also is VDD) to be perfectly fine.  
Added extra capacitors of other types and even daming R-C just in case.  
Still no difference.

o  We changed to another PIC individual - same behaviour

o  Changed the design to use the similar PIC18F6722 (higher voltage) on  
another cirquit board: same problems (plus more, this chip have a bigger  
errata...)

o  Also we asked Microchip support if any execution bugs are known in this  
chip, answer is no.

So we rule out hardware problems.

My collegue have found and corrected some own errors in the source code,  
but still the basic problem i have described is not found.

But we also cannot understand how that problem expressing itself that  
randomly can be related to our own code.
-Or the compiler for that matter.
So if we rule out PIC chip, surrounding hardware, compiler and source  
code, what is left?
Nothing?
Still this stupid behavoiur!!  AAAAAH.


We have analysed parts of what the compiler have generated.
Some parts are smart, some very clumsy, but not really wrong.
We changed between a few 4.x compiler verisons and also ported it to 3.x  
compiler as a lot of users still prefer that and call 4.x still to be in  
beta.  Still we have about the same problem.

The problem seem to wander as we insert debug code.
We even have seen simple if statements go wrong!!  occasionally.
The error mostly happens when system operation mode is changing, when  
there are a lot of variables changing - but sometimes it just sit and  
change operation mode by itself...

It seems like there is some ghost throwing a dice and rewrites some  
register randomly, and/or cause the program to jump and/or return to wrong  
adress after call.

Even with interrupts shut of we have observed "something" hitting us.
Also the most spooky thing is that in one code setup PortB interrupt fired  
always after a timer interrupt althoug we could not find in hardware or  
code why it ever wold do that.

We ponder switching to C18 compiler.  Maybe the problem is not the CCS  
compiler, but the rewrite might make us find an source code problem, plus  
C18 supports REAL-ICE for better debugging.
But it is time consuming to port from CCS C, we are at the deadline  
already, so a direct fix would be much better.


The erratas we have found are
ww1.microchip.com/downloads/en/DeviceDoc/80246b.pdf
ww1.microchip.com/downloads/en/DeviceDoc/80315a.pdf
It was not easy to find both. Maybe we missed more erratas?

Our main thread on CCS forum on this:
www.ccsinfo.com/forum/viewtopic.php?t=31672
"zilog" is my programming collegue on this project.

We got a lot of help there, but nothing that found the problem.
Remembering the wisdom here on PICLIST i now turn to you  ;)

--
Morgan Olsson

2007\08\27@180639 by Harold Hallikainen

face
flavicon
face
Regarding functions returning values different than what was in the
function, make sure the code that calls the function has the correct
prototype for the function. If, for example, the header file is not
specified, the compiler assumes functions return an int, which is often
not the case!

Hope that is SOME help.

Harold
Been there, done that...


--
FCC Rules Updated Daily at http://www.hallikainen.com - Advertising
opportunities available!

2007\08\27@182233 by Morgan Olsson

flavicon
face
The thing is that the code works perfectly most of the time for thousands  
of iterations.
Then suddenly *the same part of the code* do something erratically.

Like if some value was overwroitten by an interrupt or something.
But that happens also with interrupts globally disabled...

/Morgan


Den 2007-08-28 00:06:35 skrev Harold Hallikainen <spam_OUTharoldTakeThisOuTspamhallikainen.org>:

{Quote hidden}

--
Morgan Olsson

2007\08\27@190106 by Harold Hallikainen

face
flavicon
face
These CAN be fun! Since function parameters and larger return values (I
think 8 bit values are returned in w) are returned on the software stack,
if the return value is misinterpreted as to type, it MAY return the
correct value if other stuff left that stack memory in the correct
condition, and not when it hasn't. I spent a LOT of time single stepping
through code and watching C18 return a wrong value now and then. My
problem turned out to be a missing #include of the header for the function
in the file where I called it. This may not be your problem, though..

Good luck!

Harold


{Quote hidden}

> -

2007\08\28@041538 by Ruben Jönsson

flavicon
face
Hi Morgan and welcome back,

First, I am not that familiar with the 18F (yet) but I have been through
similar cases with other micros.

1. Try to reduce the code more and more until the problem goes away and then
add things back to see where the problem is.

2. Mostly when I have had these kind of problems, it has been the growing stack
that has overwritten static structures and variables in memory. This can be
kind of fun since the problem appears to be somewhere quite different than it
really is.

3. Uninitialized variables that mostly have the right value to start with but
sometimes, depending on prior use of memory, not. (This may be something that
differs between debug and release since debug might initialize allocated memory
but release won't or not to the same value.)

4. Can the code be simulated? Does the simulator show the same symptoms?

5. Does it show up in both debug and release? (does this even exist on 18F
compilers?)

6. If all else fails, a realtime trace would show you exactly what happens. A
real emulator could perhaps be borrowd or rented for shorter times. I know, it
costs a lot and takes time to learn how to handle, but it is invaluable in
these kind of situations.

7. Could it be hardware related? Does your code perhaps manipulate hardware
that could be dangerous to your micro, ie a momentary state that would draw a
lot of current causing a short surge on the power supply? No shielding would
help here. Are all pins on the micro operating within absolute maximum ratings?
No current into protection diodes?

8. Overflow in intermediate variables?

9. Static buildup on moving parts, with tiny discharges causing hickups? (I
think I have this in a propeller clock circuit.)

10. Also agree with Harold that you should make sure that all functions have
the expected signature everywhere - return type, parameters types, parameter
passing methods, big/little endian use...

Good luck

/Ruben


{Quote hidden}

> -

2007\08\28@133651 by Morgan Olsson

flavicon
face

Den 2007-08-28 09:11:25 skrev Ruben Jönsson <rubenspamKILLspampp.sbbs.se>:

> Hi Morgan and welcome back,

Hi Ruben :)

Thank you for the ideas; I will forward to my programmer guy.
Small notes interspersed below.
/Morgan

> First, I am not that familiar with the 18F (yet) but I have been through
> similar cases with other micros.
>
> 1. Try to reduce the code more and more until the problem goes away and  
> then
> add things back to see where the problem is.

We try.  The problem seem to wander...

> 2. Mostly when I have had these kind of problems, it has been the  
> growing stack
> that has overwritten static structures and variables in memory. This can  
> be
> kind of fun since the problem appears to be somewhere quite different  
> than it
> really is.

My collegue (who is the programmer on this) have checked best he can...

> 3. Uninitialized variables that mostly have the right value to start  
> with but
> sometimes, depending on prior use of memory, not. (This may be something  
> that
> differs between debug and release since debug might initialize allocated  
> memory
> but release won't or not to the same value.)

Iĺl tell my programmer

> 4. Can the code be simulated? Does the simulator show the same symptoms?
>
> 5. Does it show up in both debug and release? (does this even exist on  
> 18F
> compilers?)

I am not sure what you mean vith debug and release?

We have an SPI drive UART we pus out debug info thrpugh (the onchip real  
UARTS are used in the application)
We also use pins and also an 4bit resistive D/A to track what happens,  
plus ICD2.

> 6. If all else fails, a realtime trace would show you exactly what  
> happens. A
> real emulator could perhaps be borrowd or rented for shorter times. I  
> know, it
> costs a lot and takes time to learn how to handle, but it is invaluable  
> in
> these kind of situations.

We need 40MHz to keep up communication and stuff.
IIRC real emulators only go to 25MHz.
I have now bought a REAL-ICE, and my programmer is porting to C18 in order  
to utliize it.
AFAIK it can not do full trace but report some kind of trace...

> 7. Could it be hardware related? Does your code perhaps manipulate  
> hardware
> that could be dangerous to your micro, ie a momentary state that would  
> draw a
> lot of current causing a short surge on the power supply? No shielding  
> would
> help here.

All are safe.

Are all pins on the micro operating within absolute maximum
> ratings?
> No current into protection diodes?

Good points.  Checked.

> 8. Overflow in intermediate variables?

Even the generated assembly looks good for a part we checked extra well.

> 9. Static buildup on moving parts, with tiny discharges causing hickups?

No way.

{Quote hidden}

>> -

2007\08\28@145329 by David VanHorn

picon face
Have you checked that your crystal is within the speed specifications
for the part, and actually running at the right amplitude?

Oscillator margin test?

2007\08\28@145626 by David VanHorn

picon face
Other thoughts:  I am a little paranoid about uninitted variables, so
I write $00 to all memory at boot, except for a couple locations that
I leave crash state variables in, which must survive a reset.

That way, even if I do have an uninitted variable, it starts off with
a known value.
I suppose if I wanted to take the other track, I could fill all of ram
with random noise, that way it would shake out any problems that
paving to $00 would otherwise hide.

2007\08\29@025650 by Morgan Olsson

flavicon
face
It does not seem to be uninitialised variables; everything works for a lot  
of iterations, then some variable that already have been handled OK,  
suddenly gets a completely wrong value.

/Morgan

Den 2007-08-28 20:56:23 skrev David VanHorn <.....microbrixKILLspamspam.....gmail.com>:

{Quote hidden}

--
Morgan Olsson

2007\08\29@031719 by Morgan Olsson

flavicon
face
Den 2007-08-28 20:53:28 skrev David VanHorn <EraseMEmicrobrixspam_OUTspamTakeThisOuTgmail.com>:

> Have you checked that your crystal is within the speed specifications
> for the part, and actually running at the right amplitude?
>
> Oscillator margin test?

Done all.
Experimented with series resistors, different xtal type, borrowed a fast  
oscilloscope to see.
Large margins.

Also tried going down to 8MHz xtal (32MHz PLL)
Below that we can not keep up communication with the rest of the system  
(without a lot rewrite of the program)

Also tried varying voltage and temperature to recommended max and min.

Nothing we have tried in hardware changed the problem.

And we tried a LOT of measures, some are listed here
http://www.ccsinfo.com/forum/viewtopic.php?p=84891#84891

--
Morgan Olsson

2007\08\29@032654 by Richard Prosser

picon face
Morgan,
You've probably already checked and I'm no expert on the 18 series
(never used one) but don't they have prioritised interrupts? Could you
be getting a double interrupt & the temporary storage is getting
corrupted in the handler routine? The compiler should look after this
but it may require a #pragma or setup option.

RP

On 29/08/2007, Morgan Olsson <ost011spamspam_OUTosterlen.tv> wrote:
{Quote hidden}

> -

2007\08\29@043841 by Morgan Olsson

flavicon
face
Den 2007-08-29 09:26:52 skrev Richard Prosser <KILLspamrhprosserKILLspamspamgmail.com>:

> Morgan,
> You've probably already checked and I'm no expert on the 18 series
> (never used one) but don't they have prioritised interrupts?

Yes, and we use that facility.

> Could you be getting a double interrupt

No i could not find interrupts being re-enabled, and high pri seem it can not disturb low pri.

> & the temporary storage is getting
> corrupted in the handler routine?

I have checked the generated save/restore sequences for both levels and they seem OK; saving to different areas, and high priority interrup - and only that - use "RETFIE, FAST".

/Morgan


--
Morgan Olsson

2007\08\29@050940 by Dario Greggio

face picon face
Morgan Olsson wrote:
> I have checked the generated save/restore sequences for both levels
> and they seem OK; saving to different areas, and high priority
> interrup - and only that - use "RETFIE, FAST".

Hi Morgan, a silly one, but have you checked that no "interrupt errata"
does exist on this part, as in some 18F parts?

--
Ciao, Dario

2007\08\29@064529 by Michael Rigby-Jones

picon face


>-----Original Message-----
>From: RemoveMEpiclist-bouncesTakeThisOuTspammit.edu [spamBeGonepiclist-bouncesspamBeGonespammit.edu]
>On Behalf Of Dario Greggio
>Sent: 29 August 2007 10:10
>To: Microcontroller discussion list - Public.
>Subject: Re: Do an evil ghost live in my PIC18FxxJxx ? - using
>CCS compiler
>
>
>Morgan Olsson wrote:
>> I have checked the generated save/restore sequences for both levels
>> and they seem OK; saving to different areas, and high priority
>> interrup - and only that - use "RETFIE, FAST".
>
>Hi Morgan, a silly one, but have you checked that no
>"interrupt errata"
>does exist on this part, as in some 18F parts?

Even if no errata exists for the fast interrupt, I'd personly try the standard workarounds anyway.  Going by the vast quantity of errata for 18F devices in general, it's entirely possible you are seeing bug that Microchip haven't discoevered/documented yet.

Regards

Mike

=======================================================================
This e-mail is intended for the person it is addressed to only. The
information contained in it may be confidential and/or protected by
law. If you are not the intended recipient of this message, you must
not make any use of this information, or copy or show it to any
person. Please contact us immediately to tell us that you have
received this e-mail, and return the original to us. Any use,
forwarding, printing or copying of this message is strictly prohibited.
No part of this message can be considered a request for goods or
services.
=======================================================================

2007\08\29@082636 by Morgan Olsson

flavicon
face
Den 2007-08-29 11:09:36 skrev Dario Greggio <TakeThisOuTadpm.toEraseMEspamspam_OUTinwind.it>:

> Morgan Olsson wrote:
>> I have checked the generated save/restore sequences for both levels
>> and they seem OK; saving to different areas, and high priority
>> interrup - and only that - use "RETFIE, FAST".
>
> Hi Morgan, a silly one, but have you checked that no "interrupt errata"
> does exist on this part, as in some 18F parts?

Been thinking that too.
I found two erratas as listed in bottom of my first post.
Nothing there that affect us.
It was not easy to find the erratas, which make me wonder if there are more of them hiding somewhere...?


--
Morgan Olsson

2007\08\29@084432 by Morgan Olsson

flavicon
face
Den 2007-08-29 12:40:34 skrev Michael Rigby-Jones <RemoveMEMichael.Rigby-JonesspamTakeThisOuTbookham.com>:

>
>
>> {Original Message removed}

2007\08\29@085856 by Michael Rigby-Jones

picon face


>-----Original Message-----
>From: piclist-bouncesEraseMEspam.....mit.edu [EraseMEpiclist-bouncesspammit.edu]
>On Behalf Of Morgan Olsson
>Sent: 29 August 2007 13:39
>To: Microcontroller discussion list - Public.
>Subject: Re: Do an evil ghost live in my PIC18FxxJxx ? - using
>CCS compiler
>
>
>The copiler generates the save and restore sequences...
>How do we (on C18) tell it not to use the fast return stack
>for any interrupt nor subroutine?

I do not use C18, so I can't give a definitive answer, but the following thread on the Microchip forum gives a neat  workaround:
<http://forum.microchip.com/tm.aspx?m=266134&mpage=1&key=disable%2cglobal%2cinterrupts&#266134>


>
>How do we tell it to when disabling interrupts while acessign
>variables, it handle interrupt disable like on old PIC16 (i
>guess that is what you meant); disalbe interrupt and if it did
>not get disabled loop to do it again.

I think only the early 16F parts suffered from the interrupt disabling bug, I've not seen any errata for the 18F regarding this.

Regards

Mike

=======================================================================
This e-mail is intended for the person it is addressed to only. The
information contained in it may be confidential and/or protected by
law. If you are not the intended recipient of this message, you must
not make any use of this information, or copy or show it to any
person. Please contact us immediately to tell us that you have
received this e-mail, and return the original to us. Any use,
forwarding, printing or copying of this message is strictly prohibited.
No part of this message can be considered a request for goods or
services.
=======================================================================

2007\08\29@104427 by Dario Greggio

face picon face
Morgan Olsson wrote:
> The copiler generates the save and restore sequences... How do we (on
> C18) tell it not to use the fast return stack for any interrupt nor
> subroutine?

Everything is in that thread that Micheal suggested.

> How do we tell it to when disabling interrupts while acessign
> variables, it handle interrupt disable like on old PIC16 (i guess
> that is what you meant); disalbe interrupt and if it did not get
> disabled loop to do it again.

no, this should not be needed any longer on 18F.
As for the "atomicity", you have to care about it (if I understand
correctly your point)


--
Ciao, Dario

2007\08\29@160756 by Barry Gershenfeld

face picon face
First of all, I'd have to answer your original question..."Yes, there's a
ghost".  We've all had these.

I would have liked to mention that CCS has a forum, but I see you've
visited it already.

I would like to suggest the same thing that Harold did, about "implicit
forward referencing", but I have to say that although I've been bitten by
this several times, the effect is rather immediate; I didn't have to wait
to start seeing funny results.  I did want to mention that in the scenario
where part of a returned value is actually "random memory values", that if
you depend on that part of the data to happen to be zero, a lot of times it
will be, just by chance.  About the compiler,  I use the CCS compiler and
it is very willing to do this forward reference thing.  I've otherwise had
pretty good luck with the compiler but I have not gone to the 4.x
versions.  You may want to try a 3.x version if it supports your chip.

It looks like you have done due diligence in the hardware area, so we
should look at the firmware carefully.  I do remember some errata on 18F
parts where they'd execute the wrong code under certain circumstances, at
speeds over 25MHz, and in some cases it was speeds over 4 MHz.  I would
like to say that's all fixed now, but I don't recall seeing a statement to
that effect.  But if it's not in your errata then it's probably fixed.

I had a problem not unlike yours several years ago, not with a PIC but in
an H8 system.  There was a task monitor that would periodically call
tasks.  Each task had a count-down timer and a flag that would determine
when it would run.  The problem was that the flag would get corrupted, so
that when the timer got to zero the task would never run.  I wrote all
kinds of code first to discover this condition, and then monitor it.  I
could watch the problem happen.  I even wrote code to detect it and restart
the task.  I never found the cause, though.  There was an interrupt clock
and I always suspected it was the interrupts.   By the way, one technique I
used was to snapshot some variables during the interrupt, and then when the
scheduler was idle, I would print out the values.  This allowed me to
monitor things that are normally difficult to see.  Another technique I
could use was to just shut off the interrupts and print the
information.  In this case the device in question was able to function even
with everything stalled now and then.  In this case I discovered that the
bug would occur more frequently when I was stalling it.  This is one of the
points I want to make--see if you can get it to screw up more.  It makes it
easier to find the bug.

The other advice I also agree with.  Remove as much of the code as you can
without disturbing the problem you are trying to find.   This narrows the
field to examine.  Try to get it to fault more often.  And throw at it any
monitoring tricks you can think of.

Oh, and it always turns out to be something stupid.  I can't reveal how I
know this.

Best of luck,
Barry

2007\08\29@205552 by Russell McMahon

face
flavicon
face
A few semi-random thoughts - probably all been covered already or
irrelevant, but just may catalyse a useful thought:


- Assuming persistence of non persistent local variables?

- Non initialisation of variables (mainly only for machine language -
a good compiler won't let this happen)

- Trashing interupt / subroutine return stacks by any mix of adding or
removing data improperly, mixing subroutine/interupt returns, overflow
of stack into other variable space, use of stack space by other
variables, stack enters badly behaved memory space (eg some processors
corrupt the last byte at top or bottom of memory due to wrap around
bugs), ... (Compiler should stop many of these).

- Local / global confusions.

- Non re-entrant re-entrant code. Reentrant operations occurring due
to running out of IRQ time and recalling IRQ routines you are already
in.

- Running out of IRQ time (effect varies depending on how it is
handled).

- Occasional variable overflow with truncation when it occurs.

- Occasional variable overflow with corruption of  adjacent memory
when it occurs.

- Flash memory marginally programmed - occasional flippy bit(s)

- *** Body diodes conducting, even very very very slightly *** - rapid
random anything but straight line program operation can occur often /
sometimes / very occasionally / almost never.



I've seen or heard of and/or caused most of these in one form or
another over the years.



           Russell


2007\08\31@050842 by Morgan Olsson

flavicon
face
Thank you Russel McMahon and Barry Gershenfeld
All are good points and ideas.

We have already tried moving to 3.x compiler.
Strangely still similar problem.

/Morgan

2007\08\31@125717 by Morgan Olsson

flavicon
face
Thank you Russel McMahon and Barry Gershenfeld
All are good points and ideas.

We have already tried moving to 3.x compiler.
Strangely still similar problem.

/Morgan
--
Morgan Olsson

More... (looser matching)
- Last day of these posts
- In 2007 , 2008 only
- Today
- New search...