Truncated match.
PICList
Thread
'Reinventing the SQRT wheel'
1996\06\05@165226
by
Scott Dattalo

A week or so ago Andy W. made the claim:
> I have in front of me a 16bit squareroot routine that takes only
> 126 cycles to execute on a PIC16C5x.
and it requires:
> RAM: 4 registers, including the two that hold the 16bit input.
> ROM: 24 words.
and finally
> Anyway, he hasn't given me permission to publish it, so I can't post
^ The Author
> the routine here. However, the knowledge that such a solution exists
> should be enough to prod someone into trying to duplicate it.
Well, I've been prodded. However, I've come up a little short. My solution
requires:
RAM: 4 registers, including the two that hold the 16bit input. (same)
ROM: 28 words. (4 words too big)
Perhaps someone may see where 4 words can be squeezed out. But, realize that
the slightest change has many ramifications. There are many inconspicuous
dependencies (i.e. tricks).
Note, I have tested this routine over many but not all of the 64k possible
values. I haven't found any errors....
s1 equ 0x20
s2 equ 0x21
N_hi equ 0x22
N_lo equ 0x23
;
;sqrt
;
; The purpose of this routine is take the square root of a 16 bit
;unsigned integer.
sqrt
MOVLW 0x40
MOVWF s1
CLRF s2
L2 ADDWF s2,W
SUBWF N_hi,W ;W = N_his1s2
SKPNC ;if N_hi > (s1+s2)
goto L3 ; then replace N_hi with N_his1s2
;C==0.
BTFSC s1,6 ;If this is the first time through the loop
goto L1 ;then N is < s1+s2
;N may still be greater than s1+s2, so we need to check MS bit of N
BTFSS N_lo,0 ;If MS bit is zero, then N < s1+s2
goto L1
L3
;N is greater than s1+s2, so N_hi with N_his1s2
MOVWF N_hi
RLF s1,W
RLF s1,W
ADDWF s2,F ;s2 = s2  (s1<<1)
L1
;Next, roll N left one bit position. I.e. N = (N<<1)  (N>>15). This puts
the
;MS bit into the LS bit position.
RLF N_hi,W ;C = MS bit of N
RLF N_lo,F
RLF N_hi,F
CLRC
RRF s1,W
RRF s1,F ;s1 >>= 1
BTFSC s1,6
return
BTFSS s1,7
goto L2
L4
BTFSS N_lo,7
MOVLW 1
goto L2
With some hesitation, I'll post the theory too...
The way this routine works is based on the following square root algorithm:
(note, all of the psuedo C code that follows has not been tested. In other
words if you suspect an error then you might be right.)
unsigned char sqrt(unsigned int N)
{
unsigned int x,j;
for(j= 1<<7; j<>0; j>>=1)
{
x = x + j;
if( x*x > N)
x = x  j;
}
return(x);
}
In other words, x is built up one bit at a time, starting with the most
significant
bit. Then it is squared and compared to N. This algorithm works quite well for
processors that have a multiply instruction. Unfortunately, the PIC doesn't fall
into that category. However, it is posssible to efficiently multiply a number
by itself (i.e. square it).
For example suppose you had an 8 bit number:
y = a*2^7 + b*2^6 + c*2^5 + d*2^4 + e*2^3 + f*2^2 + g*2^1 + h*2^0
If you square it and collect terms, you get:
y^2 = (a^2+ab)*2^14 + ac*2^13 + (b^2+ad+bc)*2^12 + (ae+bd)*2^11 +
(c^2+af+be+cd)*2^10 +
(ag+bf+ce)*2^9 + (d^2+ah+bg+cf+de)*2^8 + (bh+cg+df)*2^7 +
(e^2+ch+dg+ef)*2^5 +
(dh+eg)*2^5 + (f^2+eh+fg)*2^4 + fh*2^3 + (g^2+gh)*2^2 + h^2
There are several things to note in this expression:
1) The bits ah can only be zero or one. So, a^2 == a.
2) If we are trying to build y iteratively by starting with a, then b, etc. then
all
of the least significant bits are assumed to be zero. Thus most of the terms do
not need
to be computed at each iteration.
3) The bit that is being squared is always multiplied by an even power of two.
The following unoptimized algorithm implements this squared polynomial in the
square
root algorithm:
s1 = 1<<14
s2 = 0
s = 0
do
{
s = s + s1 + s2
if (s>N)
s = s  s1 + s2
else
s2 = s2  s1
s1 = s1 >> 2
s2 = s2 >> 1
} while(s1)
return(s2>>8)
And a more optimized version:
s1 = 1<<14
s2 = 0
do
{
if((s1 + s2) <= N)
{
N = N  s2 s1
s2 = s2  (s1<<1)
}
s1 = s1 >> 1
N = N << 1
} while(s1 > 2^6)
return(s2>>8)
If you follow the details of this last algorithm, you will notice that all of
the
arithmetic is being performed on bits 815 i.e. the most significant bits. The
one
exception is the very last iteration where s1 = 2^7. Another thing you will
notice
is that under certain situations, N will be larger than 2^16. But, it will never
be
larger than 2^17  1. This extra overflow bit can be handled with a relatively
simple
trick. Instead of shifting N one position to the left, roll N one position. The
difference is that the most significant bit of N will get moved to the least
significant
position after the roll operation. (A simple shift zeroes the least significant
bit.)
Anyway, maybe this will prod someone else too.
Scott
1996\06\06@003226
by
fastfwd

Scott Dattalo <spam_OUTPICLISTTakeThisOuTMITVMA.MIT.EDU> wrote:
{Quote hidden}> A week or so ago Andy W. made the claim:
>
> > I have in front of me a 16bit squareroot routine that takes only
> > 126 cycles to execute on a PIC16C5x.
>
> and it requires:
>
> > RAM: 4 registers, including the two that hold the 16bit
> > input. ROM: 24 words.
>
> and finally
>
> > Anyway, [the author] hasn't given me permission to publish it, so
> > I can't post the routine here. However, the knowledge that such a
> > solution exists should be enough to prod someone into trying to
> > duplicate it.
>
> Well, I've been prodded. However, I've come up a little short. My
> solution requires:
>
> RAM: 4 registers, including the two that hold the 16bit input.
> ROM: 28 words. (4 words too big)
Wow, Scott... You've been busy. Actually, I miscounted both the
execution time and the ROM required. The routine actually takes
27 words (so your 28word solution isn't so bad), but it runs
FASTER than I originally said... 113 cycles worstcase, 95
cycles bestcase, and 103 cycles average over the entire range
[065535].
> The way this routine works is based on the following square root
> algorithm:
>
> [pseudocode deleted]
>
> In other words, x is built up one bit at a time, starting with the
> most significant bit. Then it is squared and compared to N.
The routine I was referring to uses the same algorithm.
Andy
Andrew Warren  .....fastfwdKILLspam@spam@ix.netcom.com
Fast Forward Engineering, Vista, California
http://www.geocities.com/SiliconValley/2499
1996\06\06@075221
by
Martin Nilsson

> From: Scott Dattalo <sdattaloKILLspamUNIX.SRI.COM>
>
> A week or so ago Andy W. made the claim:
>
> > I have in front of me a 16bit squareroot routine that takes only
> > 126 cycles to execute on a PIC16C5x.
[snip]
> Well, I've been prodded. However, I've come up a little short. My solution
> requires:
>
> RAM: 4 registers, including the two that hold the 16bit input. (same)
> ROM: 28 words. (4 words too big)
Nice work, Scott! How many cycles is your routine, worst case?
> Perhaps someone may see where 4 words can be squeezed out. But, realize that
> the slightest change has many ramifications. There are many inconspicuous
> dependencies (i.e. tricks).
I believe you could save one instruction as follows (std peephole):
> BTFSC s1,6 ;If this is the first time through the loop
> goto L1 ;then N is < s1+s2
>
> ;N may still be greater than s1+s2, so we need to check MS bit of N
> BTFSS N_lo,0 ;If MS bit is zero, then N < s1+s2
> goto L1
=>
BTFSS s1,6 ;If this is the first time through the loop
BTFSS N_lo,0 ;If MS bit is zero, then N < s1+s2
goto L1
 Martin
Martin Nilsson http://www.sics.se/~mn/
Swedish Institute of Computer Science Email: .....mnKILLspam.....sics.se
Box 1263, S164 28 Kista Fax: +4687517230
Sweden Tel: +4687521574
1996\06\06@113416
by
Eric Brewer

A couple of thoughts some to mind:
At 12:49 PM 6/5/96, Scott Dattalo wrote:
{Quote hidden}>A week or so ago Andy W. made the claim:
>
>> I have in front of me a 16bit squareroot routine that takes only
>> 126 cycles to execute on a PIC16C5x.
>
>and it requires:
>
>> RAM: 4 registers, including the two that hold the 16bit input.
>> ROM: 24 words.
>
>and finally
>
>> Anyway, he hasn't given me permission to publish it, so I can't post
> ^ The Author
>> the routine here. However, the knowledge that such a solution exists
>> should be enough to prod someone into trying to duplicate it.
>
>
>Well, I've been prodded. However, I've come up a little short. My solution
>requires:
>
> RAM: 4 registers, including the two that hold the 16bit input. (same)
> ROM: 28 words. (4 words too big)
>
>Perhaps someone may see where 4 words can be squeezed out. But, realize that
>the slightest change has many ramifications. There are many inconspicuous
>dependencies (i.e. tricks).
>
>Note, I have tested this routine over many but not all of the 64k possible
>values. I haven't found any errors....
>
>
>s1 equ 0x20
>s2 equ 0x21
>N_hi equ 0x22
>N_lo equ 0x23
>
>
>;
>;sqrt
>;
>; The purpose of this routine is take the square root of a 16 bit
>;unsigned integer.
Adding in the following documentation (for a little clarity now that
we have this wicked code to add to our libaries!):
; N_hi N_lo  Comprise 16 bit number to be rooted
; S1  Temporary
; S2  8 bit integer sqaure root of N
{Quote hidden}>sqrt
> MOVLW 0x40
> MOVWF s1
> CLRF s2
>
>L2 ADDWF s2,W
> SUBWF N_hi,W ;W = N_his1s2
>
> SKPNC ;if N_hi > (s1+s2)
> goto L3 ; then replace N_hi with N_his1s2
>
> ;C==0.
>
*> BTFSC s1,6 ;If this is the first time through the loop
*> goto L1 ;then N is < s1+s2
The lines marked with "*>" above can be changed to:
BTFSS s1,6 ;If this is the first time through the loop
;then N is < s1+s2
We remove the goto and make use of the goto (now) 2 instructions
below. To do so, we flip the bit test.
{Quote hidden}>
> ;N may still be greater than s1+s2, so we need to check MS bit of N
> BTFSS N_lo,0 ;If MS bit is zero, then N < s1+s2
> goto L1
>
>L3
> ;N is greater than s1+s2, so N_hi with N_his1s2
> MOVWF N_hi
>
> RLF s1,W
> RLF s1,W
> ADDWF s2,F ;s2 = s2  (s1<<1)
>L1
Moved the following from below:
This saves 6 cycles when we end. Since we don't care what N or S1
contain, no need to shift them!
BTFSC s1,6
return
{Quote hidden}>
> ;Next, roll N left one bit position. I.e. N = (N<<1)  (N>>15). This puts
> the
> ;MS bit into the LS bit position.
> RLF N_hi,W ;C = MS bit of N
> RLF N_lo,F
> RLF N_hi,F
>
> CLRC
> RRF s1,W
> RRF s1,F ;s1 >>= 1
>
Move to above. No longer needed.
;; BTFSC s1,6
;; return
{Quote hidden}>
> BTFSS s1,7
> goto L2
>
>L4
> BTFSS N_lo,7
> MOVLW 1
> goto L2
>
>
>
[...............shnip..............]
{Quote hidden}>
>s1 = 1<<14
>s2 = 0
>do
>{
> if((s1 + s2) <= N)
> {
> N = N  s2 s1
> s2 = s2  (s1<<1)
> }
> s1 = s1 >> 1
> N = N << 1
>} while(s1 > 2^6)
>
>return(s2>>8)
>
reworking the above code to be:
s1 = 1<<14
s2 = 0
while (1)
{
if((s1 + s2) <= N)
{
N = N  s2 s1
s2 = s2  (s1<<1)
}
if (s1 <= 2^7) // Since low bits of S1 are 0, this is if (s1 == 2^7)
break;
// Only do the following if we need to
// (Save a couple cycles at the end!)
s1 = s1 >> 1
N = N << 1
}
return(s2>>8)
[...............shnip..............]
>
>Anyway, maybe this will prod someone else too.
>
>
>Scott
The net effect:
* We made the code 1 instruction smaller.
* The timing (worst, average, best) is at least 7 cycles shorter.
 Always 6 cycles for the end condition
 Save at least 1 cycle through the loop.
 Each time we have to loop, we save 1 cycle
cheers,
eric
PS. If I get a chance, I'll do the timing analisys (if no one else does!)
1996\06\09@165233
by
Scott Dattalo

Thanks to Eric Brewer and Martin Nilsson for finding a way to shave one more
byte from the sqrt routine. This brings the total word count down to 27, which
is the same as the other one Andy Warren says exists.
Another thanks to Eric for finding a way to save 6 execution cycles at the end
of the routine. Here's the latest version with a few more comments:
s1 equ 0x20
s2 equ 0x21
N_hi equ 0x22
N_lo equ 0x23
;
;sqrt
;
; The purpose of this routine is take the square root of a 16 bit
;unsigned integer.
;Inputs: N_hi  High byte of the 16 bit number to be square rooted
; N_lo  Low byte " " "
;Outputs: s1  Temporary variable
; s2  8 bit result returned in here
;
sqrt
MOVLW 0x40
MOVWF s1
CLRF s2
L2 ADDWF s2,W
SUBWF N_hi,W ;W = N_his1s2
SKPNC ;if N_hi > (s1+s2)
goto L3 ; then replace N_hi with N_his1s2
BTFSS s1,6 ;If this is the first time through the loop
BTFSS N_lo,0 ;or If MS bit is zero, then N < s1+s2
goto L1
L3
;N is greater than s1+s2, so replace N_hi with N_his1s2
MOVWF N_hi
RLF s1,W
RLF s1,W
ADDWF s2,F ;s2 = s2  (s1<<1)
L1
BTFSC s1,7 ;If s1 has rolled all the way around, we're
done.
return
;Next, roll N left one bit position. I.e. N = (N<<1)  (N>>15). This puts
the
;MS bit into the LS bit position.
RLF N_hi,W ;C = MS bit of N
RLF N_lo,F
RLF N_hi,F
CLRC
RRF s1,W ;Roll s1 right one bit: s1 = ((s1>>1)  (s1<<7))
& 0xff
RRF s1,F ;
BTFSS s1,7 ;If this is the last time through the loop, we
need to
goto L2 ;make a tiny exception.
;We only get here if this is the last time through the loop. This special
exception
;needs to handle a 10 bit addition. Right now, the value of s1 that got
shifted into
;W is zero. It should be 1 >> 1, ie a fraction of 1/2. Or, another way to
look at it
;is the value of s1 to be subtracted from N is 0x0080. Now, if N_lo is less
than 0x80,
;then the subtraction will cause a borrow that must be propagated to N_hi,
i.e. N_hi
;must be decremented. Rather than subtracting 1 from N_hi now, we'll instead
add 1 to
;s2 and do the subtraction above.
BTFSS N_lo,7
MOVLW 1
goto L2
Eric Brewer wrote:
>
> PS. If I get a chance, I'll do the timing analisys (if no one else does!)
I haven't done a full blown timing analysis. However, by adding up cycles I
estimate the worst case condition to be about 175 cycles. This is about 50%
slower than Andy's ghost programmer. But, Andy has assured me that the
original solution is no hoax. (If you think about, it wouldn't be a bad idea
to claim some hoax just to see if someone can meet your claim!)
So this _new_ wheel may not be perfectly round, but it is free.
Scott
1996\06\09@215942
by
fastfwd

Scott Dattalo <EraseMEPICLISTspam_OUTTakeThisOuTMITVMA.MIT.EDU> wrote:
> I haven't done a full blown timing analysis. However, by adding up
> cycles I estimate the worst case condition to be about 175 cycles.
> This is about 50% slower than Andy's ghost programmer. But, Andy has
> assured me that the original solution is no hoax. (If you think
> about, it wouldn't be a bad idea to claim some hoax just to see if
> someone can meet your claim!)
True; as I said when I first posted an overview of the routine's
performance, the belief that such results were possible might be
enough to prod someone into writing a similarlygood routine.
In this case, however, the solution IS real; in fact, Scott now has a
copy of the routine. I'm trying to get in touch with the author; if
I'm successful and he agrees, I'll post the routine to the list.
It really IS remarkable...
Andy
Andrew Warren  fastfwdspam_OUTix.netcom.com
Fast Forward Engineering, Vista, California
http://www.geocities.com/SiliconValley/2499
More... (looser matching)
 Last day of these posts
 In 1996
, 1997 only
 Today
 New search...