dnl SPARC v9 64-bit mpn_addmul_2 -- Multiply an n limb number with 2-limb
dnl number andadd the result to a n limb vector.
dnl Copyright 2002, 2003 Free Software Foundation, Inc.
dnl This file is part of the GNU MP Library.
dnl
dnl The GNU MP Library is free software; you can redistribute it and/or modify
dnl it under the terms of either:
dnl
dnl * the GNU Lesser General Public License as published by the Free
dnl Software Foundation; either version 3 of the License, or (at your
dnl option) any later version.
dnl
dnl or
dnl
dnl * the GNU General Public License as published by the Free Software
dnl Foundation; either version 2 of the License, or (at your option) any
dnl later version.
dnl
dnl or both in parallel, as here.
dnl
dnl The GNU MP Library is distributed in the hope that it will be useful, but
dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
dnl for more details.
dnl
dnl You should have received copies of the GNU General Public License and the
dnl GNU Lesser General Public License along with the GNU MP Library. If not,
dnl see https://www.gnu.org/licenses/.
include(`../config.m4')
C cycles/limb
C UltraSPARC 1&2: 9
C UltraSPARC 3: 10
C Algorithm: We use 16 floating-point multiplies per limb product, with the
C 2-limb v operand split into eight 16-bit pieces, and the n-limb u operand
C split into 32-bit pieces. We sum four 48-bit partial products using
C floating-point add, then convert the resulting four 50-bit quantities and
C transfer them to the integer unit.
C Possible optimizations:
C 1. Align the stack area where we transfer the four 50-bit product-sums
C to a 32-byte boundary. That would minimize the cache collision.
C (UltraSPARC-1/2 use a direct-mapped cache.) (Perhaps even better would
C be to align the area to map to the area immediately before up?)
C 2. Perform two of the fp->int conversions with integer instructions. We
C can get almost ten free IEU slots, if we clean up bookkeeping and the
C silly carry-limb code.
C 3. For an mpn_addmul_1 based on this, we need to fix the silly carry-limb
C code.
C OSP (Overlapping software pipeline) version of mpn_mul_basecase:
C Operand swap will require 8 LDDA and 8 FXTOD, which will mean 8 cycles.
C FI = 20
C L = 9 x un * vn
C WDFI = 10 x vn / 2
C WD = 4
C Instruction classification (as per UltraSPARC functional units).
C Assuming silly carrycode is fixed. Includes bookkeeping.
C
C mpn_addmul_X mpn_mul_X
C 1 2 1 2
C ========== ==========
C FM 8 16 8 16
C FA 10 18 10 18
C MEM 12 12 10 10
C ISHIFT 6 6 6 6
C IADDLOG 11 11 10 10
C BRANCH 1 1 1 1
C
C TOTAL IEU 17 17 16 16
C TOTAL 48 64 45 61
C
C IEU cycles 8.5 8.5 8 8
C MEM cycles 12 12 10 10
C ISSUE cycles 12 16 11.25 15.25
C FPU cycles 10 18 10 18
C cycles/loop 12 18 12 18
C cycles/limb 12 9 12 9
C INPUT PARAMETERS
C rp[n + 1] i0
C up[n] i1
C n i2
C vp[2] i3
C Initialization. (1) Split v operand into eight 16-bit chunks and store them
C as IEEE double in fp registers. (2) Clear upper 32 bits of fp register pairs
C f2 and f4. (3) Store masks in registers aliased to `xffff' and `xffffffff'.
C Thiscode could be better scheduled.
Die Informationen auf dieser Webseite wurden
nach bestem Wissen sorgfältig zusammengestellt. Es wird jedoch weder Vollständigkeit, noch Richtigkeit,
noch Qualität der bereit gestellten Informationen zugesichert.
Bemerkung:
Die farbliche Syntaxdarstellung und die Messung sind noch experimentell.