; This Source Code Form is subject to the terms of the Mozilla Public
; License, v. 2.0. If a copy of the MPL was not distributed with this
; file, You can obtain one at http://mozilla.org/MPL/2.0/.
; This performs a multiple-precision integer version of "daxpy",
; Using the selected addressing direction. "Little-wordian" means that
; the least significant word of a number is stored at the lowest address.
; "Big-wordian" means that the most significant word is at the lowest
; address. Either way, the incoming address of the vector is that
; of the least significant word. That means that, for little-wordian
; addressing, we move the address upward as we propagate carries
; from the least significant word to the most significant. For
; big-wordian we move the address downward.
; We use the following registers:
;
; r2 return PC, of course
; r26 = arg1 = length
; r25 = arg2 = address of scalar
; r24 = arg3 = multiplicand vector
; r23 = arg4 = result vector
;
; fr9 = scalar loaded once only from r25
; The cycle counts shown in the bodies below are simply the result of a
; scheduling by hand. The actual PCX-U hardware does it differently.
; The intention is that the overall speed is the same.
; The pipeline startup and shutdown code is constructed in the usual way,
; by taking the loop bodies and removing unnecessary instructions.
; We have left the comments describing cycle numbers in the code.
; These are intended for reference when comparing with the main loop,
; and have no particular relationship to actual cycle numbers.
LDO SIXTEEN(%r23),%r23
ADD %r3,%r22,%r1
$JOIN1 ADD,DC %r0,%r0,%r21
CMPIB,*= 0,%r21,$L0 ; if no overflow, exit
STD %r1,UN_SIXTEEN(%r23)
; Final carry propagation
$FINAL1 LDO EIGHT(%r23),%r23
LDD UN_SIXTEEN(%r23),%r21
ADDI 1,%r21,%r21
CMPIB,*= 0,%r21,$FINAL1 ; Keep looping if there is a carry.
STD %r21,UN_SIXTEEN(%r23)
B $L0
NOP
; Here is the code that handles the difficult cases N=1, N=2, and N=3.
; We do the usual trick -- branch out of the startup code at appropriate
; points, and branch into the shutdown code.
; We came out of the unrolled loop with wrong parity. Do one more
; single cycle. This is quite tricky, because of the way the
; carry chains and SHRPD chains have been chopped up.
$FDIAG2
LDO EIGHT(%r24),%r24
LDD UN_EIGHT(%r24),%r26
ADDI 1,%r26,%r26
CMPIB,*= 0,%r26,$FDIAG2 ; Keep looping if there is a carry.
STD %r26,UN_EIGHT(%r24)
B $Z0
NOP
; Here is the code that handles the difficult case N=1.
; We do the usual trick -- branch out of the startup code at appropriate
; points, and branch into the shutdown code.
$DIAG_N_IS_ONE
LDD -88(%sp),%r22
LDD -72(%sp),%r31
B $JOINDIAG
LDD -96(%sp),%r20
; We came out of the unrolled loop with wrong parity. Do one more
; single cycle. This is the "alternate body". It will, of course,
; give us opposite registers from the other case, so we need
; completely different shutdown code.
.SPACE $TEXT$
.SUBSPA $CODE$
#ifdef LITTLE_WORDIAN
#ifdef __GNUC__
; GNU-as (as of 2.19) does not support LONG_RETURN
.EXPORT maxpy_little,ENTRY,PRIV_LEV=3,ARGW0=GR,ARGW1=GR,ARGW2=GR,ARGW3=GR
.EXPORT add_diag_little,ENTRY,PRIV_LEV=3,ARGW0=GR,ARGW1=GR,ARGW2=GR
#else
.EXPORT maxpy_little,ENTRY,PRIV_LEV=3,ARGW0=GR,ARGW1=GR,ARGW2=GR,ARGW3=GR,LONG_RETURN
.EXPORT add_diag_little,ENTRY,PRIV_LEV=3,ARGW0=GR,ARGW1=GR,ARGW2=GR,LONG_RETURN
#endif
#else
.EXPORT maxpy_big,ENTRY,PRIV_LEV=3,ARGW0=GR,ARGW1=GR,ARGW2=GR,ARGW3=GR,LONG_RETURN
.EXPORT add_diag_big,ENTRY,PRIV_LEV=3,ARGW0=GR,ARGW1=GR,ARGW2=GR,LONG_RETURN
#endif
.END
; How to use "maxpy_PA20_little" and "maxpy_PA20_big"
;
; The routine "maxpy_PA20_little" or "maxpy_PA20_big"
; performs a 64-bit x any-size multiply, and adds the
; result to an area of memory. That is, it performs
; something like
;
; A B C D
; * Z
; __________
; P Q R S T
;
; and then adds the "PQRST" vector into an area of memory,
; handling all carries.
;
; Digression on nomenclature and endian-ness:
;
; Each of the capital letters in the above represents a 64-bit
; quantity. That is, you could think of the discussion as
; being in terms of radix-16-quintillion arithmetic. The data
; type being manipulated is "unsigned long long int". This
; requires the 64-bit extension of the HP-UX C compiler,
; available at release 10. You need these compiler flags to
; enable these extensions:
;
; -Aa +e +DA2.0 +DS2.0
;
; (The first specifies ANSI C, the second enables the
; extensions, which are beyond ANSI C, and the third and
; fourth tell the compiler to use whatever features of the
; PA2.0 architecture it wishes, in order to made the code more
; efficient. Since the presence of the assembly code will
; make the program unable to run on anything less than PA2.0,
; you might as well gain the performance enhancements in the C
; code as well.)
;
; Questions of "endian-ness" often come up, usually in the
; context of byte ordering in a word. These routines have a
; similar issue, that could be called "wordian-ness".
; Independent of byte ordering (PA is always big-endian), one
; can make two choices when representing extremely large
; numbers as arrays of 64-bit doublewords in memory.
;
; "Little-wordian" layout means that the least significant
; word of a number is stored at the lowest address.
;
; MSW LSW
; | |
; V V
;
; A B C D E
;
; ^ ^ ^
; | | |____ address 0
; | |
; | |_______address 8
; |
; address 32
;
; "Big-wordian" means that the most significant word is at the
; lowest address.
;
; MSW LSW
; | |
; V V
;
; A B C D E
;
; ^ ^ ^
; | | |____ address 32
; | |
; | |_______address 24
; |
; address 0
;
; When you compile the file, you must specify one or the other, with
; a switch "-DLITTLE_WORDIAN" or "-DBIG_WORDIAN".
;
; Incidentally, you assemble this file as part of your
; project with the same C compiler as the rest of the program.
; My "makefile" for a superprecision arithmetic package has
; the following stuff:
;
; # definitions:
; CC = cc -Aa +e -z +DA2.0 +DS2.0 +w1
; CFLAGS = +O3
; LDFLAGS = -L /usr/lib -Wl,-aarchive
;
; # general build rule for ".s" files:
; .s.o:
; $(CC) $(CFLAGS) -c $< -DBIG_WORDIAN
;
; # Now any bind step that calls for pa20.o will assemble pa20.s
;
; End of digression, back to arithmetic:
;
; The way we multiply two huge numbers is, of course, to multiply
; the "ABCD" vector by each of the "WXYZ" doublewords, adding
; the result vectors with increasing offsets, the way we learned
; in school, back before we all used calculators:
;
; A B C D
; * W X Y Z
; __________
; P Q R S T
; E F G H I
; M N O P Q
; + R S T U V
; _______________
; F I N A L S U M
;
; So we call maxpy_PA20_big (in my case; my package is
; big-wordian) repeatedly, giving the W, X, Y, and Z arguments
; in turn as the "scalar", and giving the "ABCD" vector each
; time. We direct it to add its result into an area of memory
; that we have cleared at the start. We skew the exact
; location into that area with each call.
;
; The prototype for the function is
;
; extern void maxpy_PA20_big(
; int length, /* Number of doublewords in the multiplicand vector. */
; const long long int *scalaraddr, /* Address to fetch the scalar. */
; const long long int *multiplicand, /* The multiplicand vector. */
; long long int *result); /* Where to accumulate the result. */
;
; (You should place a copy of this prototype in an include file
; or in your C file.)
;
; Now, IN ALL CASES, the given address for the multiplicand or
; the result is that of the LEAST SIGNIFICANT DOUBLEWORD.
; That word is, of course, the word at which the routine
; starts processing. "maxpy_PA20_little" then increases the
; addresses as it computes. "maxpy_PA20_big" decreases them.
;
; In our example above, "length" would be 4 in each case.
; "multiplicand" would be the "ABCD" vector. Specifically,
; the address of the element "D". "scalaraddr" would be the
; address of "W", "X", "Y", or "Z" on the four calls that we
; would make. (The order doesn't matter, of course.)
; "result" would be the appropriate address in the result
; area. When multiplying by "Z", that would be the least
; significant word. When multiplying by "Y", it would be the
; next higher word (8 bytes higher if little-wordian; 8 bytes
; lower if big-wordian), and so on. The size of the result
; area must be the the sum of the sizes of the multiplicand
; and multiplier vectors, and must be initialized to zero
; before we start.
;
; Whenever the routine adds its partial product into the result
; vector, it follows carry chains as far as they need to go.
;
; Here is the super-precision multiply routine that I use for
; my package. The package is big-wordian. I have taken out
; handling of exponents (it's a floating point package):
;
; static void mul_PA20(
; int size,
; const long long int *arg1,
; const long long int *arg2,
; long long int *result)
; {
; int i;
;
; for (i=0 ; i<2*size ; i++) result[i] = 0ULL;
;
; for (i=0 ; i<size ; i++) {
; maxpy_PA20_big(size, &arg2[i], &arg1[size-1], &result[size+i]);
; }
; }
Die Informationen auf dieser Webseite wurden
nach bestem Wissen sorgfältig zusammengestellt. Es wird jedoch weder Vollständigkeit, noch Richtigkeit,
noch Qualität der bereit gestellten Informationen zugesichert.
Bemerkung:
Die farbliche Syntaxdarstellung und die Messung sind noch experimentell.