Searching \ for 'multiply routine optimized for speed' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/method/math.htm?key=multiply
Search entire site for: 'multiply routine optimized for speed'.

Truncated match.
PICList Thread
'multiply routine optimized for speed'
1998\01\31@213659 by Amey Deosthali

flavicon
face
Hi all,

The 8x8 multiply routine given in application note AN526 by Microchip is
optimized for speed. Still it takes a stagerring 44 instructions in the
worst case. Is this the best routine  for multiplication? Is there any
other routine available, requiring less number of instructions?

Amey
-------------------------------------------------------------------------------
                       AMEY A. DEOSTHALI
-------------------------------------------------------------------------------
Lab Address:                              Home Address:
ESPL, ENS526E                             3110 Red River, Apt # 208
Dept. of ECE                              Austin, TX 78705
University of Texas at Austin,            Tel:(512)-499-8957
Austin, TX 78712 - 1084.
Tel:(512)-232-2769
               Email:spam_OUTameyTakeThisOuTspamvision.ece.utexas.edu
               Web  :http://anchovy.ece.utexas.edu/~amey
-------------------------------------------------------------------------------

1998\01\31@222130 by Andrew Warren

face
flavicon
face
Amey Deosthali <.....ameyKILLspamspam@spam@vision.ece.utexas.edu> wrote:

> The 8x8 multiply routine given in application note AN526 by
> Microchip is optimized for speed. Still it takes a stagerring 44
> instructions in the worst case. Is this the best routine  for
> multiplication?

   That depends on your definition of "best", Amey.

> Is there any other routine available, requiring less number of
> instructions?

   Of course.  Try something like this:

       ; Enter with multiplier in W-Reg, multiplicand in "PRODLO".
       ; Exits with product in PRODHI:PRODLO.

       MPY8X8:

           CLRF    PRODHI

           CLRF    COUNT
           BSF     COUNT,3

           RRF     PRODLO

       LOOP:

           SKPNC
           ADDWF   PRODHI

           RRF     PRODHI
           RRF     PRODLO

           DECFSZ  COUNT
           GOTO    LOOP

   -Andy

   P.S.  As usual with code that I write online, this routine hasn't
         even been assembled, let alone tested.  Still, it's pretty
         simple; I'd be surprised if it didn't work.

=== Andrew Warren - fastfwdspamKILLspamix.netcom.com
=== Fast Forward Engineering - Vista, California
=== http://www.geocities.com/SiliconValley/2499

1998\01\31@231201 by John Payson

picon face
> > The 8x8 multiply routine given in application note AN526 by
> > Microchip is optimized for speed. Still it takes a stagerring 44
> > instructions in the worst case. Is this the best routine  for
> > multiplication?
>
>     That depends on your definition of "best", Amey.
>
> > Is there any other routine available, requiring less number of
> > instructions?
>
>     Of course.  Try something like this:
>
>         ; Enter with multiplier in W-Reg, multiplicand in "PRODLO".
>         ; Exits with product in PRODHI:PRODLO.

No offense, but I counted 7 cycles per iteration in a loop that gets
executed 8 times (actually, only 6 cycles for the last iteration).
Consequently, as written, the code was slower than 44 cycles.

On the other hand, if you unroll that loop the code will run a lot more
quickly.  One thing I personally like to do in my multiply routines, btw,
is to write them so that the value sitting in the MSB isn't cleared before
the operation, but is instead added to the LSB of the result.  This makes
computation of longer products convenient.

1998\01\31@233526 by Andrew Warren

face
flavicon
face
John Payson <.....PICLISTKILLspamspam.....MITVMA.MIT.EDU> wrote:

> No offense, but I counted 7 cycles per iteration in a loop that gets
> executed 8 times (actually, only 6 cycles for the last iteration).
> Consequently, as written, the code was slower than 44 cycles.

   No offense taken, John... The object, I thought, was to write a
   relatively-quick multiply routine that took less than 44
   INSTRUCTIONS.  For maximum speed, inline code will OF COURSE run
   faster than looped code.

   -Andy

=== Andrew Warren - EraseMEfastfwdspam_OUTspamTakeThisOuTix.netcom.com
=== Fast Forward Engineering - Vista, California
=== http://www.geocities.com/SiliconValley/2499


'multiply routine optimized for speed'
1998\02\01@015426 by Amey Deosthali
flavicon
face
> No offense, but I counted 7 cycles per iteration in a loop that gets
> executed 8 times (actually, only 6 cycles for the last iteration).
> Consequently, as written, the code was slower than 44 cycles.
>
> On the other hand, if you unroll that loop the code will run a lot more
> quickly.  One thing I personally like to do in my multiply routines, btw,
> is to write them so that the value sitting in the MSB isn't cleared before
> the operation, but is instead added to the LSB of the result.  This makes
> computation of longer products convenient.

Thanks Andy for providing the routine. But, yes the routine does take more
instructions than 44 instructions. The routine with 44 instructions does
exactly what John suggested. It unfolds the loops, that's all. Microchip
have also provided a routine with loops (having code efficiency rather
than time). I think Andy's code is pretty much similar to that.

I am implementing a filter. So what I will do is try to adjust the
coefficents to be as close to 2's power as possible. Then multiplications
can be implemented as shifts. I have two multiplications in the filter. So
even non-looped code will take 88 instructions (at worst)!!!

Amey

1998\02\01@025057 by John Payson

picon face
> John Payson <PICLISTspamspam_OUTMITVMA.MIT.EDU> wrote:
>
> > No offense, but I counted 7 cycles per iteration in a loop that gets
> > executed 8 times (actually, only 6 cycles for the last iteration).
> > Consequently, as written, the code was slower than 44 cycles.
>
>     No offense taken, John... The object, I thought, was to write a
>     relatively-quick multiply routine that took less than 44
>     INSTRUCTIONS.  For maximum speed, inline code will OF COURSE run
>     faster than looped code.

[bonks head] Instructions, cycles, code words, all the same thing... [mumble
mumble]  Yeah, I should have read the original question better.  I guess
I saw the "worst case" wording and thought that referred to execution time.
Perhaps in some applications, 44 instruction words may be a big deal, but
for a multiply routine that need only be coded once even if it's called
dozens of times I think the speed would be more important than saving a
few words (in many cases, I admit, neither will be important).

1998\02\01@144613 by Andrew Warren

face
flavicon
face
Amey Deosthali <@spam@ameyKILLspamspamvision.ece.utexas.edu> wrote:

> Microchip have also provided a routine with loops (having code
> efficiency rather than time). I think Andy's code is pretty much
> similar to that.

   Yes, my looped code was similar to Microchip's.  However, mine
   runs faster (its execution time is about midway between
   Microchip's looped and unrolled routines) and is shorter by a
   couple of bytes.

   I posted it in order to show that you could write a short
   routine that performed almost as well as Microchip's
   44-instruction unlooped version; I guess I didn't realize that
   you were looking for a routine that was both shorter AND FASTER
   than Microchip's unrolled code.

> I am implementing a filter. So what I will do is try to adjust the
> coefficents to be as close to 2's power as possible. Then
> multiplications can be implemented as shifts. I have two
> multiplications in the filter. So even non-looped code will take 88
> instructions (at worst)!!!

   If your coefficients are constants, you can write inline code
   that's MUCH faster (and shorter) than a generic two-variable
   multiply, even if the constant isn't a power of two... Start
   with Microchip's unrolled code, then delete all the BTFSCs.
   Next, remove all the ADDWFs except those which correspond to "1"
   bits in the constant multiplier.

   To multiply by 164, for instance, you'd do this:

       CLRF    H_BYTE
       CLRF    L_BYTE

       MOVF    MULCND,W
       CLRC

       RRF     H_BYTE
       RRF     L_BYTE

       RRF     H_BYTE
       RRF     L_BYTE

       ADDWF   H_BYTE
       RRF     H_BYTE
       RRF     L_BYTE

       RRF     H_BYTE
       RRF     L_BYTE

       RRF     H_BYTE
       RRF     L_BYTE

       ADDWF   H_BYTE
       RRF     H_BYTE
       RRF     L_BYTE

       RRF     H_BYTE
       RRF     L_BYTE

       ADDWF   H_BYTE
       RRF     H_BYTE
       RRF     L_BYTE

   You'll need a separate routine for each constant, of course.

   -Andy

   P.S.  By the way, I just looked at Microchip's appnotes and I
         don't SEE any 44-instruction 8x8 multiplies... Their
         unrolled version seems to take only 36 instructions.  Am I
         missing something?

=== Andrew Warren - KILLspamfastfwdKILLspamspamix.netcom.com
=== Fast Forward Engineering - Vista, California
=== http://www.geocities.com/SiliconValley/2499

1998\02\01@151309 by John Payson

picon face
>     If your coefficients are constants, you can write inline code
>     that's MUCH faster (and shorter) than a generic two-variable
>     multiply, even if the constant isn't a power of two... Start
>     with Microchip's unrolled code, then delete all the BTFSCs.
>     Next, remove all the ADDWFs except those which correspond to "1"
>     bits in the constant multiplier.

You can often do even better than this by using "Booth's" algorithm; it's
not worth the effort if the coefficient isn't constant, but if it is
constant all the extra decision-making can be done by the programmer
before the program is compiled/run.

Although 164 doesn't benefit from Booth's algorithm, other constants do.
The trick is to rewrite strings of 3 or more "1"'s as a higher "1" and a
"minus one".  For example, to multiply by 159 ($9F) rather than regarding
that number as 128+16+8+4+2+1, it's much faster to regard it as 128+32-1;
then instead of six "add"s you end up with two "add"s and a subtract.  I'm
not quite sure what carry ends up doing; you need to be a little careful
about that.  Nonetheless, it is for some cases a very useful optimization
to keep in mind.

1998\02\01@153425 by Andrew Warren

face
flavicon
face
John Payson <RemoveMEPICLISTTakeThisOuTspamMITVMA.MIT.EDU> wrote:

> You can often do even better than this by using "Booth's" algorithm
> ....
> The trick is to rewrite strings of 3 or more "1"'s as a higher "1"
> and a "minus one".  For example, to multiply by 159 ($9F) rather
> than regarding that number as 128+16+8+4+2+1, it's much faster to
> regard it as 128+32-1; then instead of six "add"s you end up with
> two "add"s and a subtract.

   I'm not so sure that it's faster, John.  The "standard" method,
   even with 6 adds, takes 26 cycles (assuming that, on entry, the
   multiplier and multiplicand are in registers and the two-byte
   product holds an unknown value)... Care to write a "Booth's
   algorithm" version that demonstrates the advantage?

   -Andy

=== Andrew Warren - spamBeGonefastfwdspamBeGonespamix.netcom.com
=== Fast Forward Engineering - Vista, California
=== http://www.geocities.com/SiliconValley/2499

More... (looser matching)
- Last day of these posts
- In 1998 , 1999 only
- Today
- New search...