Truncated match.
PICList
Thread
'multiply routine optimized for speed'
1998\01\31@213659
by
Amey Deosthali
Hi all,
The 8x8 multiply routine given in application note AN526 by Microchip is
optimized for speed. Still it takes a stagerring 44 instructions in the
worst case. Is this the best routine for multiplication? Is there any
other routine available, requiring less number of instructions?
Amey

AMEY A. DEOSTHALI

Lab Address: Home Address:
ESPL, ENS526E 3110 Red River, Apt # 208
Dept. of ECE Austin, TX 78705
University of Texas at Austin, Tel:(512)4998957
Austin, TX 78712  1084.
Tel:(512)2322769
Email:spam_OUTameyTakeThisOuTvision.ece.utexas.edu
Web :http://anchovy.ece.utexas.edu/~amey

1998\01\31@222130
by
Andrew Warren

Amey Deosthali <.....ameyKILLspam@spam@vision.ece.utexas.edu> wrote:
> The 8x8 multiply routine given in application note AN526 by
> Microchip is optimized for speed. Still it takes a stagerring 44
> instructions in the worst case. Is this the best routine for
> multiplication?
That depends on your definition of "best", Amey.
> Is there any other routine available, requiring less number of
> instructions?
Of course. Try something like this:
; Enter with multiplier in WReg, multiplicand in "PRODLO".
; Exits with product in PRODHI:PRODLO.
MPY8X8:
CLRF PRODHI
CLRF COUNT
BSF COUNT,3
RRF PRODLO
LOOP:
SKPNC
ADDWF PRODHI
RRF PRODHI
RRF PRODLO
DECFSZ COUNT
GOTO LOOP
Andy
P.S. As usual with code that I write online, this routine hasn't
even been assembled, let alone tested. Still, it's pretty
simple; I'd be surprised if it didn't work.
=== Andrew Warren  fastfwdKILLspamix.netcom.com
=== Fast Forward Engineering  Vista, California
=== http://www.geocities.com/SiliconValley/2499
1998\01\31@231201
by
John Payson

> > The 8x8 multiply routine given in application note AN526 by
> > Microchip is optimized for speed. Still it takes a stagerring 44
> > instructions in the worst case. Is this the best routine for
> > multiplication?
>
> That depends on your definition of "best", Amey.
>
> > Is there any other routine available, requiring less number of
> > instructions?
>
> Of course. Try something like this:
>
> ; Enter with multiplier in WReg, multiplicand in "PRODLO".
> ; Exits with product in PRODHI:PRODLO.
No offense, but I counted 7 cycles per iteration in a loop that gets
executed 8 times (actually, only 6 cycles for the last iteration).
Consequently, as written, the code was slower than 44 cycles.
On the other hand, if you unroll that loop the code will run a lot more
quickly. One thing I personally like to do in my multiply routines, btw,
is to write them so that the value sitting in the MSB isn't cleared before
the operation, but is instead added to the LSB of the result. This makes
computation of longer products convenient.
1998\01\31@233526
by
Andrew Warren
John Payson <.....PICLISTKILLspam.....MITVMA.MIT.EDU> wrote:
> No offense, but I counted 7 cycles per iteration in a loop that gets
> executed 8 times (actually, only 6 cycles for the last iteration).
> Consequently, as written, the code was slower than 44 cycles.
No offense taken, John... The object, I thought, was to write a
relativelyquick multiply routine that took less than 44
INSTRUCTIONS. For maximum speed, inline code will OF COURSE run
faster than looped code.
Andy
=== Andrew Warren  EraseMEfastfwdspam_OUTTakeThisOuTix.netcom.com
=== Fast Forward Engineering  Vista, California
=== http://www.geocities.com/SiliconValley/2499
'multiply routine optimized for speed'
1998\02\01@015426
by
Amey Deosthali

> No offense, but I counted 7 cycles per iteration in a loop that gets
> executed 8 times (actually, only 6 cycles for the last iteration).
> Consequently, as written, the code was slower than 44 cycles.
>
> On the other hand, if you unroll that loop the code will run a lot more
> quickly. One thing I personally like to do in my multiply routines, btw,
> is to write them so that the value sitting in the MSB isn't cleared before
> the operation, but is instead added to the LSB of the result. This makes
> computation of longer products convenient.
Thanks Andy for providing the routine. But, yes the routine does take more
instructions than 44 instructions. The routine with 44 instructions does
exactly what John suggested. It unfolds the loops, that's all. Microchip
have also provided a routine with loops (having code efficiency rather
than time). I think Andy's code is pretty much similar to that.
I am implementing a filter. So what I will do is try to adjust the
coefficents to be as close to 2's power as possible. Then multiplications
can be implemented as shifts. I have two multiplications in the filter. So
even nonlooped code will take 88 instructions (at worst)!!!
Amey
1998\02\01@025057
by
John Payson

> John Payson <PICLISTspam_OUTMITVMA.MIT.EDU> wrote:
>
> > No offense, but I counted 7 cycles per iteration in a loop that gets
> > executed 8 times (actually, only 6 cycles for the last iteration).
> > Consequently, as written, the code was slower than 44 cycles.
>
> No offense taken, John... The object, I thought, was to write a
> relativelyquick multiply routine that took less than 44
> INSTRUCTIONS. For maximum speed, inline code will OF COURSE run
> faster than looped code.
[bonks head] Instructions, cycles, code words, all the same thing... [mumble
mumble] Yeah, I should have read the original question better. I guess
I saw the "worst case" wording and thought that referred to execution time.
Perhaps in some applications, 44 instruction words may be a big deal, but
for a multiply routine that need only be coded once even if it's called
dozens of times I think the speed would be more important than saving a
few words (in many cases, I admit, neither will be important).
1998\02\01@144613
by
Andrew Warren

Amey Deosthali <@spam@ameyKILLspamvision.ece.utexas.edu> wrote:
> Microchip have also provided a routine with loops (having code
> efficiency rather than time). I think Andy's code is pretty much
> similar to that.
Yes, my looped code was similar to Microchip's. However, mine
runs faster (its execution time is about midway between
Microchip's looped and unrolled routines) and is shorter by a
couple of bytes.
I posted it in order to show that you could write a short
routine that performed almost as well as Microchip's
44instruction unlooped version; I guess I didn't realize that
you were looking for a routine that was both shorter AND FASTER
than Microchip's unrolled code.
> I am implementing a filter. So what I will do is try to adjust the
> coefficents to be as close to 2's power as possible. Then
> multiplications can be implemented as shifts. I have two
> multiplications in the filter. So even nonlooped code will take 88
> instructions (at worst)!!!
If your coefficients are constants, you can write inline code
that's MUCH faster (and shorter) than a generic twovariable
multiply, even if the constant isn't a power of two... Start
with Microchip's unrolled code, then delete all the BTFSCs.
Next, remove all the ADDWFs except those which correspond to "1"
bits in the constant multiplier.
To multiply by 164, for instance, you'd do this:
CLRF H_BYTE
CLRF L_BYTE
MOVF MULCND,W
CLRC
RRF H_BYTE
RRF L_BYTE
RRF H_BYTE
RRF L_BYTE
ADDWF H_BYTE
RRF H_BYTE
RRF L_BYTE
RRF H_BYTE
RRF L_BYTE
RRF H_BYTE
RRF L_BYTE
ADDWF H_BYTE
RRF H_BYTE
RRF L_BYTE
RRF H_BYTE
RRF L_BYTE
ADDWF H_BYTE
RRF H_BYTE
RRF L_BYTE
You'll need a separate routine for each constant, of course.
Andy
P.S. By the way, I just looked at Microchip's appnotes and I
don't SEE any 44instruction 8x8 multiplies... Their
unrolled version seems to take only 36 instructions. Am I
missing something?
=== Andrew Warren  KILLspamfastfwdKILLspamix.netcom.com
=== Fast Forward Engineering  Vista, California
=== http://www.geocities.com/SiliconValley/2499
1998\02\01@151309
by
John Payson

> If your coefficients are constants, you can write inline code
> that's MUCH faster (and shorter) than a generic twovariable
> multiply, even if the constant isn't a power of two... Start
> with Microchip's unrolled code, then delete all the BTFSCs.
> Next, remove all the ADDWFs except those which correspond to "1"
> bits in the constant multiplier.
You can often do even better than this by using "Booth's" algorithm; it's
not worth the effort if the coefficient isn't constant, but if it is
constant all the extra decisionmaking can be done by the programmer
before the program is compiled/run.
Although 164 doesn't benefit from Booth's algorithm, other constants do.
The trick is to rewrite strings of 3 or more "1"'s as a higher "1" and a
"minus one". For example, to multiply by 159 ($9F) rather than regarding
that number as 128+16+8+4+2+1, it's much faster to regard it as 128+321;
then instead of six "add"s you end up with two "add"s and a subtract. I'm
not quite sure what carry ends up doing; you need to be a little careful
about that. Nonetheless, it is for some cases a very useful optimization
to keep in mind.
1998\02\01@153425
by
Andrew Warren
John Payson <RemoveMEPICLISTTakeThisOuTMITVMA.MIT.EDU> wrote:
> You can often do even better than this by using "Booth's" algorithm
> ....
> The trick is to rewrite strings of 3 or more "1"'s as a higher "1"
> and a "minus one". For example, to multiply by 159 ($9F) rather
> than regarding that number as 128+16+8+4+2+1, it's much faster to
> regard it as 128+321; then instead of six "add"s you end up with
> two "add"s and a subtract.
I'm not so sure that it's faster, John. The "standard" method,
even with 6 adds, takes 26 cycles (assuming that, on entry, the
multiplier and multiplicand are in registers and the twobyte
product holds an unknown value)... Care to write a "Booth's
algorithm" version that demonstrates the advantage?
Andy
=== Andrew Warren  spamBeGonefastfwdspamBeGoneix.netcom.com
=== Fast Forward Engineering  Vista, California
=== http://www.geocities.com/SiliconValley/2499
More... (looser matching)
 Last day of these posts
 In 1998
, 1999 only
 Today
 New search...