Copyright 2000-2002 Free Software Foundation, Inc.
This file is part of the GNU MP Library.
The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:
* the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.
or
* the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any
later version.
or both in parallel, as here.
The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library. If not,
see https://www.gnu.org/licenses/.
The code in this directory works for Cray vector systems such as C90,
J90, T90 (both the CFP variant and the IEEE variant) and SV1. (For
the T3E and T3D systems, see the `alpha' subdirectory at the same
level as the directory containing this file.)
The cfp subdirectory is for systems utilizing the traditional Cray
floating-point format, and the ieee subdirectory is for the newer
systems that use the IEEE floating-point format.
There are several issues that reduces speed on Cray systems. For
systems with cfp floating point, the main obstacle is the forming of 128-bit products. For IEEE systems, adding, and in particular
computing carry is the main issue. There are no vectorizing
unsigned-less-than instructions, and the sequence that implement that
operation is very long.
Shifting is the only operation that is simple to make fast. All Cray
systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
should be really useful.
For best speed for cfp systems, we need a mul_basecase, since that
reduces the need for carry propagation to a minimum. Depending on the
size (vn) of the smaller of the two operands (V), we should split U and V
in different chunk sizes:
U split in 232-bit parts
V split according to the table:
parts 45678
bits/part 161311108
max allowed vn 183264256
number of multiplies 810121416
peak cycles/limb 45678
U split in 322-bit parts
V split according to the table:
parts 345
bits/part 221613
max allowed vn 1610248192
number of multiplies 91215
peak cycles/limb 4.567.5
U split in 416-bit parts
V split according to the table:
parts 4
bits/part 16
max allowed vn 65536
number of multiplies 16
peak cycles/limb 8
(A T90 CPU can accumulate two products per cycle.)
IDEA:
* Rewrite mpn_add_n: short cy[n + 1];
#pragma _CRI ivdep for (i = 0; i < n; i++)
{ s = up[i] + vp[i];
rp[i] = s;
cy[i + 1] = s < up[i]; }
more_carries = 0;
#pragma _CRI ivdep for (i = 1; i < n; i++)
{ s = rp[i] + cy[i];
rp[i] = s;
more_carries += s < cy[i]; }
cys = 0; if (more_carries)
{
cys = rp[1] < cy[1]; for (i = 2; i < n; i++)
{ rp[i] += cys;
cys = rp[i] < cys; }
} return cys + cy[n];
* Write mpn_add3_n for adding three operands. First add operands 1
and 2, and generate cy[]. Then add operand 3 to the partial result,
and accumulate carry into cy[]. Finally propagate carry just like
in the new mpn_add_n.
IDEA:
Store fewer bits, perhaps 62, per limb. That brings mpn_add_n time
down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb. By
storing even fewer bits per limb, perhaps 56, it would be possible to
write a mul_mul_basecase that would run at effectively 1 cycle/limb.
(Use VM here to better handle the romb-shaped multiply area, perhaps
rounding operand sizes up to the next power of 2.)
Messung V0.5 in Prozent
¤ Dauer der Verarbeitung: 0.11 Sekunden
(vorverarbeitet am 2026-06-07)
¤
Die Informationen auf dieser Webseite wurden
nach bestem Wissen sorgfältig zusammengestellt. Es wird jedoch weder Vollständigkeit, noch Richtigkeit,
noch Qualität der bereit gestellten Informationen zugesichert.
Bemerkung:
Die farbliche Syntaxdarstellung und die Messung sind noch experimentell.