Searching \ for 'Re-inventing the SQRT wheel' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/index.htm?key=inventing+sqrt+wheel
Search entire site for: 'Re-inventing the SQRT wheel'.

Truncated match.
PICList Thread
'Re-inventing the SQRT wheel'
1996\06\05@165226 by Scott Dattalo

face
flavicon
face
A week or so ago Andy W. made the claim:

> I have in front of me a 16-bit square-root routine that takes only
> 126 cycles to execute on a PIC16C5x.

and it requires:

>    RAM:  4 registers, including the two that hold the 16-bit input.
>    ROM:  24 words.

and finally

> Anyway, he hasn't given me permission to publish it, so I can't post
^-- The Author
> the routine here.  However, the knowledge that such a solution exists
> should be enough to prod someone into trying to duplicate it.


Well, I've been prodded. However, I've come up a little short. My solution
requires:

  RAM:  4 registers, including the two that hold the 16-bit input. (same)
  ROM:  28 words.  (4 words too big)

Perhaps someone may see where 4 words can be squeezed out. But, realize that
the slightest change has many ramifications. There are many inconspicuous
dependencies (i.e. tricks).

Note, I have tested this routine over many but not all of the 64k possible
values. I haven't found any errors....


s1      equ     0x20
s2      equ     0x21
N_hi    equ     0x22
N_lo    equ     0x23


;---------------------------------------------------------------
;sqrt
;
; The purpose of this routine is take the square root of a 16 bit
;unsigned integer.
sqrt
       MOVLW   0x40
       MOVWF   s1
       CLRF    s2

L2      ADDWF   s2,W
       SUBWF   N_hi,W          ;W = N_hi-s1-s2

       SKPNC                   ;if N_hi > (s1+s2)
         goto  L3              ;  then replace N_hi with N_hi-s1-s2

   ;C==0.

       BTFSC   s1,6            ;If this is the first time through the loop
         goto  L1              ;then N is < s1+s2

   ;N may still be greater than s1+s2, so we need to check MS bit of N
       BTFSS   N_lo,0          ;If MS bit is zero, then N < s1+s2
         goto  L1

L3
   ;N is greater than s1+s2, so N_hi with N_hi-s1-s2
       MOVWF   N_hi

       RLF     s1,W
       RLF     s1,W
       ADDWF   s2,F            ;s2 = s2 | (s1<<1)
L1

   ;Next, roll N left one bit position. I.e. N = (N<<1) | (N>>15). This puts
the
   ;MS bit into the LS bit position.
       RLF     N_hi,W          ;C = MS bit of N
       RLF     N_lo,F
       RLF     N_hi,F

       CLRC
       RRF     s1,W
       RRF     s1,F            ;s1 >>= 1

       BTFSC   s1,6
         return

       BTFSS   s1,7
         goto  L2

L4
       BTFSS   N_lo,7
         MOVLW 1
       goto    L2



With some hesitation, I'll post the theory too...

The way this routine works is based on the following square root algorithm:
(note, all of the psuedo C code that follows has not been tested. In other
words if you suspect an error then you might be right.)

unsigned char sqrt(unsigned int N)
{
 unsigned int x,j;

 for(j= 1<<7; j<>0; j>>=1)
 {
   x = x + j;
   if( x*x > N)
     x = x - j;
 }

 return(x);
}

In other words, x is built up one bit at a time, starting with the most
significant
bit. Then it is squared and compared to N. This algorithm works quite well for
processors that have a multiply instruction. Unfortunately, the PIC doesn't fall
into that category. However, it is posssible to efficiently multiply a number
by itself (i.e. square it).

For example suppose you had an 8 bit number:

y = a*2^7 + b*2^6 + c*2^5 + d*2^4 + e*2^3 + f*2^2 + g*2^1 + h*2^0

If you square it and collect terms, you get:


y^2 = (a^2+ab)*2^14 + ac*2^13 + (b^2+ad+bc)*2^12 + (ae+bd)*2^11 +
(c^2+af+be+cd)*2^10 +
     (ag+bf+ce)*2^9 + (d^2+ah+bg+cf+de)*2^8 + (bh+cg+df)*2^7 +
(e^2+ch+dg+ef)*2^5 +
     (dh+eg)*2^5 + (f^2+eh+fg)*2^4 + fh*2^3 + (g^2+gh)*2^2 + h^2

There are several things to note in this expression:
1) The bits a-h can only be zero or one. So, a^2 == a.
2) If we are trying to build y iteratively by starting with a, then b, etc. then
all
of the least significant bits are assumed to be zero. Thus most of the terms do
not need
to be computed at each iteration.
3) The bit that is being squared is always multiplied by an even power of two.


The following un-optimized algorithm implements this squared polynomial in the
square
root algorithm:

s1 = 1<<14
s2 = 0
s = 0
do
{
 s = s + s1 + s2
 if (s>N)
   s = s - s1 + s2
 else
   s2 = s2 | s1

 s1 = s1 >> 2
 s2 = s2 >> 1
} while(s1)

return(s2>>8)

And a more optimized version:

s1 = 1<<14
s2 = 0
do
{
 if((s1 + s2) <= N)
 {
   N = N - s2 -s1
   s2 = s2 | (s1<<1)
 }
 s1 = s1 >> 1
 N  = N  << 1
} while(s1 > 2^6)

return(s2>>8)

If you follow the details of this last algorithm, you will notice that all of
the
arithmetic is being performed on bits 8-15 i.e. the most significant bits. The
one
exception is the very last iteration where s1 = 2^7. Another thing you will
notice
is that under certain situations, N will be larger than 2^16. But, it will never
be
larger than 2^17 - 1. This extra overflow bit can be handled with a relatively
simple
trick. Instead of shifting N one position to the left, roll N one position. The
difference is that the most significant bit of N will get moved to the least
significant
position after the roll operation. (A simple shift zeroes the least significant
bit.)


Anyway, maybe this will prod someone else too.


Scott

1996\06\06@003226 by fastfwd

face
flavicon
face
Scott Dattalo <spam_OUTPICLISTTakeThisOuTspamMITVMA.MIT.EDU> wrote:

{Quote hidden}

   Wow, Scott... You've been busy.  Actually, I miscounted both the
   execution time and the ROM required.  The routine actually takes
   27 words (so your 28-word solution isn't so bad), but it runs
   FASTER than I originally said... 113 cycles worst-case, 95
   cycles best-case, and 103 cycles average over the entire range
   [0-65535].

> The way this routine works is based on the following square root
> algorithm:
>
> [pseudo-code deleted]
>
> In other words, x is built up one bit at a time, starting with the
> most significant bit. Then it is squared and compared to N.

   The routine I was referring to uses the same algorithm.

   -Andy

Andrew Warren - .....fastfwdKILLspamspam@spam@ix.netcom.com
Fast Forward Engineering, Vista, California
http://www.geocities.com/SiliconValley/2499

1996\06\06@075221 by Martin Nilsson

picon face
> From:    Scott Dattalo <sdattalospamKILLspamUNIX.SRI.COM>
>
> A week or so ago Andy W. made the claim:
>
> > I have in front of me a 16-bit square-root routine that takes only
> > 126 cycles to execute on a PIC16C5x.

[snip]

> Well, I've been prodded. However, I've come up a little short. My solution
> requires:
>
>    RAM:  4 registers, including the two that hold the 16-bit input. (same)
>    ROM:  28 words.  (4 words too big)

Nice work, Scott! How many cycles is your routine, worst case?

> Perhaps someone may see where 4 words can be squeezed out. But, realize that
> the slightest change has many ramifications. There are many inconspicuous
> dependencies (i.e. tricks).

I believe you could save one instruction as follows (std peephole):

>         BTFSC   s1,6            ;If this is the first time through the loop
>           goto  L1              ;then N is < s1+s2
>
>     ;N may still be greater than s1+s2, so we need to check MS bit of N
>         BTFSS   N_lo,0          ;If MS bit is zero, then N < s1+s2
>           goto  L1

  =>

        BTFSS   s1,6            ;If this is the first time through the loop
          BTFSS   N_lo,0        ;If MS bit is zero, then N < s1+s2
            goto  L1

-- Martin

Martin Nilsson                           http://www.sics.se/~mn/
Swedish Institute of Computer Science    E-mail: .....mnKILLspamspam.....sics.se
Box 1263, S-164 28 Kista                 Fax: +46-8-751-7230
Sweden                                   Tel: +46-8-752-1574

1996\06\06@113416 by Eric Brewer

flavicon
face
A couple of thoughts some to mind:

At 12:49 PM 6/5/96, Scott Dattalo wrote:
{Quote hidden}

Adding in the following documentation (for a little clarity now that
we have this wicked code to add to our libaries!):
;       N_hi N_lo     - Comprise 16 bit number to be rooted
;       S1                      - Temporary
;       S2                      - 8 bit integer sqaure root of N

{Quote hidden}

*>        BTFSC   s1,6            ;If this is the first time through the loop
*>          goto  L1              ;then N is < s1+s2

The lines marked with "*>" above can be changed to:
       BTFSS   s1,6            ;If this is the first time through the loop
                                         ;then N is < s1+s2
We remove the goto and make use of the goto (now) 2 instructions
below. To do so, we flip the bit test.

{Quote hidden}

Moved the following from below:
This saves 6 cycles when we end. Since we don't care what N or S1
contain, no need to shift them!
       BTFSC   s1,6
         return

{Quote hidden}

Move to above. No longer needed.
;;        BTFSC   s1,6
;;          return
{Quote hidden}

[...............shnip..............]
{Quote hidden}

reworking the above code to be:

s1 = 1<<14
s2 = 0
while (1)
{
       if((s1 + s2) <= N)
       {
               N = N - s2 -s1
               s2 = s2 | (s1<<1)
       }

       if (s1 <= 2^7)  // Since low bits of S1 are 0, this is if (s1 == 2^7)
               break;

       // Only do the following if we need to
       // (Save a couple cycles at the end!)
       s1 = s1 >> 1
       N  = N  << 1
}

return(s2>>8)

[...............shnip..............]

>
>Anyway, maybe this will prod someone else too.
>
>
>Scott

The net effect:

* We made the code 1 instruction smaller.
* The timing (worst, average, best) is at least 7 cycles shorter.
       - Always 6 cycles for the end condition
       - Save at least 1 cycle through the loop.
       - Each time we have to loop, we save 1 cycle

cheers,
eric

PS. If I get a chance, I'll do the timing analisys (if no one else does!)

1996\06\09@165233 by Scott Dattalo

face
flavicon
face
Thanks to Eric Brewer and Martin Nilsson for finding a way to shave one more
byte from the sqrt routine. This brings the total word count down to 27, which
is the same as the other one Andy Warren says exists.

Another thanks to Eric for finding a way to save 6 execution cycles at the end
of the routine. Here's the latest version with a few more comments:


s1      equ     0x20
s2      equ     0x21
N_hi    equ     0x22
N_lo    equ     0x23

;---------------------------------------------------------------
;sqrt
;
; The purpose of this routine is take the square root of a 16 bit
;unsigned integer.
;Inputs:  N_hi - High byte of the 16 bit number to be square rooted
;         N_lo - Low byte     "                  "             "
;Outputs: s1   - Temporary variable
;         s2   - 8 bit result returned in here
;
sqrt
       MOVLW   0x40
       MOVWF   s1
       CLRF    s2

L2      ADDWF   s2,W
       SUBWF   N_hi,W          ;W = N_hi-s1-s2

       SKPNC                   ;if N_hi > (s1+s2)
         goto  L3              ;  then replace N_hi with N_hi-s1-s2

       BTFSS   s1,6            ;If this is the first time through the loop
         BTFSS   N_lo,0        ;or If MS bit is zero, then N < s1+s2
           goto  L1

L3
   ;N is greater than s1+s2, so replace N_hi with N_hi-s1-s2
       MOVWF   N_hi

       RLF     s1,W
       RLF     s1,W
       ADDWF   s2,F            ;s2 = s2 | (s1<<1)
L1
       BTFSC   s1,7            ;If s1 has rolled all the way around, we're
done.
         return

   ;Next, roll N left one bit position. I.e. N = (N<<1) | (N>>15). This puts
the
   ;MS bit into the LS bit position.
       RLF     N_hi,W          ;C = MS bit of N
       RLF     N_lo,F
       RLF     N_hi,F

       CLRC
       RRF     s1,W            ;Roll s1 right one bit: s1 = ((s1>>1) | (s1<<7))
& 0xff
       RRF     s1,F            ;

       BTFSS   s1,7            ;If this is the last time through the loop, we
need to
         goto  L2              ;make a tiny exception.

   ;We only get here if this is the last time through the loop. This special
exception
   ;needs to handle a 10 bit addition. Right now, the value of s1 that got
shifted into
   ;W is zero. It should be 1 >> 1, ie a fraction of 1/2. Or, another way to
look at it
   ;is the value of s1 to be subtracted from N is 0x0080. Now, if N_lo is less
than 0x80,
   ;then the subtraction will cause a borrow that must be propagated to N_hi,
i.e. N_hi
   ;must be decremented. Rather than subtracting 1 from N_hi now, we'll instead
add 1 to
   ;s2 and do the subtraction above.

       BTFSS   N_lo,7
         MOVLW 1

       goto    L2



Eric Brewer wrote:

>
> PS. If I get a chance, I'll do the timing analisys (if no one else does!)

I haven't done a full blown timing analysis. However, by adding up cycles I
estimate the worst case condition to be about 175 cycles. This is about 50%
slower than Andy's ghost programmer. But, Andy has assured me that the
original solution is no hoax. (If you think about, it wouldn't be a bad idea
to claim some hoax just to see if someone can meet your claim!)

So this _new_ wheel may not be perfectly round, but it is free.


Scott

1996\06\09@215942 by fastfwd

face
flavicon
face
Scott Dattalo <EraseMEPICLISTspam_OUTspamTakeThisOuTMITVMA.MIT.EDU> wrote:

> I haven't done a full blown timing analysis. However, by adding up
> cycles I estimate the worst case condition to be about 175 cycles.
> This is about 50% slower than Andy's ghost programmer. But, Andy has
> assured me that the original solution is no hoax. (If you think
> about, it wouldn't be a bad idea to claim some hoax just to see if
> someone can meet your claim!)

True; as I said when I first posted an overview of the routine's
performance, the belief that such results were possible might be
enough to prod someone into writing a similarly-good routine.

In this case, however, the solution IS real; in fact, Scott now has a
copy of the routine.  I'm trying to get in touch with the author; if
I'm successful and he agrees, I'll post the routine to the list.

It really IS remarkable...

-Andy

Andrew Warren - fastfwdspamspam_OUTix.netcom.com
Fast Forward Engineering, Vista, California
http://www.geocities.com/SiliconValley/2499

More... (looser matching)
- Last day of these posts
- In 1996 , 1997 only
- Today
- New search...