9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # Faster floating-point square root for integer processors

Faster ?oating-point square root for integer processors

Claude-Pierre Jeannerod? , Herv? Knochel? , Christophe Monat? , Member, IEEE, and Guillaume Revy? e

Laboratoire LIP (CNRS, ENSL, INRIA, UCBL) ? Ecole Normale Sup? rieure de Lyon, 46 all? e d’Italie, 69364 Lyon cedex 07, France e e Email: {Claude-Pierre.Jeannerod, Guillaume.Revy}@ens-lyon.fr ? STMicroelectronics, 12 rue Jules Horowitz, 38019 Grenoble cedex 1, France Email: {Herve.Knochel,Christophe.Monat}@st.com

Abstract— This paper presents some work in progress on fast and accurate ?oating-point arithmetic software for ST200-based embedded systems. We show how to use some key architectural features to design codes that achieve correct rounding-to-nearest without sacri?cing for ef?ciency. This is illustrated with the square root function, whose implementation given here is faster by over 35% than the previously best one for such systems.

?

I. I NTRODUCTION The STMicroelectronics ST200 is an embedded media processor derived from the Lx technology platform [7], designed to implement advanced audio and video codecs in consumer devices such as set-top boxes for HD-IPTV (High De?nition Internet Protocol Television), cell phones, wireless terminals, and PDAs. The integration of several instances of the ST200 in large Systems On Chip (SOC) provides an unprecedented level of media processing ability, enabling for instance single chip H.264 solutions. The processing power brought by these media processors allows to replace dedicated hardware by software, thus giving more ?exibility to the design while sustaining a high performance level. For instance, complex audio decoders or graphics pipeline fragments are now fully implemented in software. One of the interesting characteristics of these applications is that they are highly demanding on ?oating-point square root computations, mostly for vector normalization purposes. However, this kind of situation is not so frequent in actual code: application developers have traditionally avoided square root functions (as they did for the division operator) because of its high real or perceived computational cost. Instead, they have devised “square root free” algorithms or designed ad-hoc square root implementations. We think that this burden could be removed from the application developer by providing a square root implementation that is both very fast and, as prescribed by the IEEE-754 standard [1], correctly rounded. Motivated by the above applications, we shall focus in this paper on correct rounding-to-nearest and normalized singleprecision ?oating-point numbers. The input argument x to square root is a normalized single-precision ?oating-point number [19, §4] when it has the form x = m · 2e , (1)

with e an integer between ?126 and 127 and m a rational number having binary expansion ±1.f1 f2 . . . f23 . For x > 0, √ the real number x thus has the form √ x = · 2d , √ √ where d = e/2 , = t m and t = 1 or 2 depending on the parity of e. Although d lies in the range of e, the binary expansion (2) 1. 1 2 . . . 23 24 . . . of is in general nonterminating, unlike that of m, and the √ exact value x may eventually not ?t in format (1). Instead, √ we output a correctly-rounded value of x; under round-tonearest rounding mode, it is the unique normalized single√ precision number r that is closest to x. In fact, obtaining r from x essentially reduces to computing the exact values of the bits 1 , . . . , 24 in (2) above (see for example [6, §8.6.3] and Subsection III-A). Several ef?cient software implementations of correctlyrounded square roots have already been described for processors with or without ?oating-point units. On HP/Intel’s Itanium the algorithms use Newton-Raphson’s iteration ([4],[14, §9.1.1],[9], [3, p.238]). The same method had been used earlier for IBM’s RS/6000 [16]. On ST231, the ?rst implementation has been given in [20, §11]; there, the initial approximation is re?ned by a single iteration. On IBM’s Power3, NewtonRaphson’s method has been replaced in [21] by another algorithm, better suited to the PowerPC architecture; that faster √ algorithm essentially re?nes an initial approximation of x by multiplying it with a well-chosen polynomial approximation √ of 1 ? y for some y 0. All those implementations rely on the same approach: compute an initial approximation to square root or square root reciprocal, and re?ne it by one or two iterations, typically Newton-Raphson’s. The initial approximation is obtained either by table lookup (RS/6000, Power3) or by calling an instruction that approximates the square root reciprocal to about 8 bits (frsqrta instruction on IA-64), or by evaluating small degree polynomial approximants (two degree-3 polynomials for ST231, see [20, p.113]). In this paper, we propose for the ST231 some new implementations: they are based exclusively on the evaluation of

polynomial approximants and thus avoid Newton-like iterative re?nements. Although the polynomials we use have degree 5 or 6, they can be evaluated very fast and accurately thanks to some key architectural features of the ST231. As a result, the square root latency of [20] has been reduced by over 35%. Notice that, for IA-64, a similar approach is used in [11] and [9], but for transcendental functions. Also, [12] studies ?oating-point software support for Intel’s XScale processor which, like the ST231, has no ?oating-point unit. The paper is organized as follows. Section II describes some features of the ST231 architecture and compiler that have been crucial for speeding up square root. Section III outlines our square root implementation and reports experimental results. Finally Section IV gives concluding remarks and future plans. II. OVERVIEW OF THE ST231 PROCESSOR AND COMPILER A. Architectural features The ST231 is a 32-bit, four-issue member of the ST200 VLIW core family. VLIW (Very Long Instruction Word) [8] processors use an architectural technique where instruction level parallelism (ILP) is explicitly exposed to the compiler. RISC-like operations (syllables) are grouped into bundles (wide words). The operations in a bundle are issued simultaneously and complete simultaneously. While the delay between issue and completion is the same for all operations, some results are available for bypassing to subsequent operations prior to completion. A hardware implementation of a VLIW is signi?cantly simpler than a corresponding multiple issue superscalar CPU. This is due principally to the simpli?cation of the operation grouping and scheduling hardware. This complexity is moved to the ILP extractor (compiler) and the instruction scheduling system (compiler and assembler) in the software toolchain. The ST200 family of cores uses a scalable technology that allows variation in instruction issue width, the number and capabilities of functional units and register ?les, and the instruction set. The ST200 family includes the following features: 1) parallel execution units, including multiple integer ALUs and 32x32 bit multipliers, 2) a large register ?le of 64 general purpose registers and 8 condition registers, 3) predicated execution through select operations, 4) ef?cient branch architecture with multiple condition registers, 5) encoding of immediate operands up to 32 bits. All these features are key to the square root function implementation, that uses ef?ciently all the resources exposed by the machine. B. Compiler The ST231 VLIW compiler is based on the Open64 technology (www.open64.net) retargeted for the ST200 processor family by STMicroelectronics since 2001. The compiler has been improved to support development for high performance embedded targets. Most important to the square root code

ef?ciency are the if-conversion optimization [2] and the Linear Assembly Optimizer (LAO) [5]. The if-conversion optimization generates mostly straight-line code by emitting ef?cient sequences of select instructions instead of costly control ?ow. The LAO generates a schedule for the instructions that is very close to the optimal. III. S QUARE ROOT IMPLEMENTATION As usual for IEEE-754 arithmetic [19, §4], a number x as in (1) is stored in a 32-bit register R = [R31 . . . R0 ] as follows: R31 contains the sign bit, R30 , . . . , R23 contain the 8 bits of the biased exponent e + 127, and the last 23 bits of R contain the fraction bits f1 , . . . , f23 . The correctly-rounded value r of √ x will be returned using the same three-?eld representation. This representation further allows to store special values like ±0, ±∞ and NaN (Not a Number). A. General algorithm outline Using the above machine representation of single-precision ?oating-point numbers, we get the following steps for computing a square root with correct rounding: 1) 2) 3) 4) 5) 6) Unpack x into three ?elds (sign, exponent, fraction) Handle special values of x (zeros, in?nities, NaNs) Compute the 8-bit exponent ?eld of r Compute the bits 1 , . . . , 24 in (2) Round to get the 23-bit fraction ?eld of r Pack the three ?elds of r

For special values, the following behavior is mandated by the √ √ standard: x = x for x ∈ {±0, +∞, NaN} and x = NaN for either x negative and nonzero, or x = ?∞. Filtering such special values can clearly be done in parallel with the rest of the computation, and step 2) has been implemented ef?ciently as in [20, §11.2.1]. Step 3) can also be done independently. This is essentially due to some known properties of the square root function. √ First, x = · 2d cannot lie in the middle of two consecutive numbers in format (1) (see for example [14, p.139]). Therefore knowing the bits 1 , . . . , 24 is enough for rounding correctly to nearest: r = (1. 1 . . . 23 + 24 · 2?23 ) · 2d . Second, when ?23 is always 24 = 1, it can be checked that 1. 1 . . . 23 + 2 < 2 and thus d needs not be incremented. Hence step 3) can be performed at any time once the biased exponent E = e + 127 of x is available; since d = e/2 , the biased exponent D = d + 127 of r is obtained very fast by shifting and truncating. We are thus left with steps 4) and 5), which are the most time-consuming. Several algorithmic choices are possible as well shown for example in [17]. Subsection III-B below reviews the methods that we have implemented; their latencies will be given in Subsection III-C. B. Implemented methods We have implemented seven square root algorithms, each of which falling into one of the three categories below.

1) Restoring and nonrestoring algorithms: These two are the most basic algorithms based on digit recurrence (see [6] for a detailed and comprehensive study). Both consist of 24 iterations: in the restoring algorithm, the ith iteration outputs exactly i , possibly after a restoration step involving one more addition; the nonrestoring algorithm avoids restoration by using {?1, 1} instead of {0, 1} as a digit set. Although a ?nal correction step is needed for that algorithm to return 1 , . . . , 24 , it is in general faster than the previous one. These two methods output 1 , . . . , 24 and thus rounding to nearest (step 5) in Subsection III-A) is straightforward. However, these methods are highly sequential (iteration i + 1 cannot start before iteration i is completed) and thus poorly suited to the parallelism available on ST231. Hence we use them mostly for comparison with the next, faster methods. 2) Newton-Raphson and Goldschmidt iterations: Those methods are based on iterative approximation [6, §7]: they √ start with an approximation v0 of 1/ m over [1, 2) and re?ne it by successive iterations into a very good approximation v √ of m, from which 1 , . . . , 24 can be deduced. For as in (2), it can be shown that a suf?cient condition on v for being able to round correctly is that |v ? | ≤ 2

?25

accurately enough thanks to the 32x32 bit multipliers, and polynomials are evaluated concurrently thanks to the parallel execution units. However, further speed-ups can be obtained on ST231 by a third approach, which we outline now. 3) Evaluation of polynomial approximants: This method corresponds to approximation by a real function [17, √ §5]. Unlike the previous method, it approximates directly m by one or several truncated minimax polynomials, and no re?nement is needed. We propose two implementations of this approach: ? Poly-5: three subintervals associated with three degree-5 polynomials giving values v as in (3). ? Poly-6: two subintervals associated with two degree-6 polynomials giving values v as in (3). The 32x32 bit multipliers of the ST231 allow to evaluate those polynomials very accurately, so that (3) holds. In order to take full advantage of the parallelism available on that architecture, we further use Estrin-like algorithms rather than the classical, but highly sequential Horner’s scheme ([13, p.488], [11]): for example polynomials of degree 5 are evaluated as (a5 · x + a4 ) · z + (a3 · x + a2 ) · y + (a1 · x + a0 ) , where y = x · x and z = y · y. As we shall see in the next subsection, on ST231 this simpler approach turns out to be also the fastest one. To achieve further speed-up, we have also improved the rounding procedure of [20, §11]: instead of ?rst deducing 1. 1 . . . 24 from v and then rounding according to the value of 24 , we compute the 23-bit fraction of r directly from the 26 most signi?cant bits of v. C. Experimental results The implementation of each of the above square root algorithms has been veri?ed by exhaustive tests (bit-exactness comparison with the values returned by the sqrtf function of the gcc libm). To measure timings (in numbers of clock cycles), non-special generic values have been used. As stated in Section II, the ST231 is a four-issue machine, but to evaluate the impact of the instruction parallelism, performance has been measured for three issue width values (2, 3, 4), which is a feature of the ST200 toolset (compiler and simulator). Timings are detailed in Table I, whereas Table II shows the average IPB (instructions per bundle) and IPC (instructions per cycle) numbers computed for the same issue width values. We see that best performance is obtained with the polynomial

Restoring Nonrestoring Newton-2 Goldschmidt-2 Goldschmidt-1 Poly-5 Poly-6 2 170 233 53 50 45 53 49 3 153 159 49 46 42 45 38 4 148 133 45 42 36 33 30

.

(3)

As in [20], initial approximations are obtained by evaluating a low-degree truncated minimax polynomial [18, p.36] that √ approximates the function 1/ m over a small interval. Also, to exploit parallelism while keeping degrees small, the interval [1, 2) is split into two or three subintervals, each of them corresponding to a possibly different approximation polynomial. Newton-Raphson’s or Goldschmidt’s iteration is known to roughly double the number of bits of accuracy of the current √ approximation. Starting from an approximation of 1/ m √ rather than m ensures that the iterations contain multiplications (for which the ST231 is well suited) √ no division. but Newton-Raphson’s iteration converges to 1/ m and a ?nal multiplication by m gives the desired value v; Goldschmidt’s version can be seen as performing that multiplication in the √ ?rst iteration and then converging directly to m. If the re?nement consists of a single iteration then both methods are essentially the same. We have implemented the following three variants: ? Newton-2: three subintervals associated with three polynomials √ degree 1 giving initial values v0 such that of |v0 ? 1/ m| ≤ and ∈ {2?10 , 2?8 } depending on the subinterval. Then two Newton-Raphson iterations are enough to get v as in (3). ? Goldschmidt-2: similar to Newton-2 except that the two iterations are now Goldschmidt iterations. ? Goldschmidt-1 [20, §11.2.3]: two subintervals associated with two degree-3 √ polynomials giving initial values v0 such that |v0 ? 1/ m| ≤ and ∈ {2?14.5 , 2?13.5 } depending on the subinterval. Then one iteration suf?ces to get v as in (3). Those methods are more suited to the ST231 architecture than the (non)restoring algorithm: multiplications are done

TABLE I T IMINGS ( CLOCK CYCLES ) FOR ISSUE WIDTH VALUES 2, 3, 4.

2 Newton-2 Goldschmidt-2 Goldschmidt-1 Poly-5 Poly-6 IPB 1.46 1.39 1.51 1.44 1.43 IPC 1.30 1.33 1.45 1.44 1.43 IPB 1.59 1.63 1.66 1.75 1.89

3 IPC 1.37 1.52 1.58 1.67 1.85 IPB 1.92 2.15 2.03 2.50 2.45

4 IPC 1.61 1.73 1.91 2.34 2.45

ACKNOWLEDGMENT The authors would like to thank the ”P? le de Comp? titivit? o e e Mondial Minalogic” (www.minalogic.org) for its support. R EFERENCES

[1] American National Standards Institute and Institute of Electrical and Electronic Engineers. IEEE standard for binary ?oating-point arithmetic. ANSI/IEEE Standard, Std 754-1985, New York, 1985. [2] Christian Bruel. If-conversion SSA framework for partially predicated VLIW architectures. In Digest of the 4th Workshop on Optimizations for DSP and Embedded Systems (Manhattan, New York, NY), March 2006. [3] Marius Cornea, John Harrison, and Ping Tak Peter Tang. Scienti?c Computing on Itanium-Based Systems. Intel Press, 2002. [4] Marius Cornea-Hasegan and Bob Norin. IA-64 ?oating-point operations and the IEEE standard for binary ?oating-point arithmetic. Intel Technology Journal, 1999-Q4:1–16, 1999. [5] Beno?t Dupont de Dinechin. From machine scheduling to VLIW ? instruction scheduling. ST Journal of Research, 1(2), September 2004. http://cri.ensmp.fr/classement/2003. [6] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann, 2003. [7] Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. Lx: a technology platform for customizable VLIW embedded processing. In ISCA’00: Proceedings of the 27th annual international symposium on Computer architecture, pages 203– 213. ACM Press, 2000. [8] Joseph A. Fisher, Paolo Faraboschi, and Cliff Young. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann, 2005. [9] Bruce Greer, John Harrison, Greg Henry, Wei Li, and Peter Tang. Scienti?c computing on the Itanium processor. In Proceedings of the 2001 Conference on Supercomputing, 2001. [10] John Harrison. Formal veri?cation of square root algorithms. Formal Methods in Systems Design, 22:143–153, 2003. [11] John Harrison, Ted Kubaska, Shane Story, and Peter Tang. The computation of transcendental functions on the IA-64 architecture. Intel Technology Journal, 1999-Q4:1–7, 1999. [12] Cristina Iordache and Ping Tak Peter Tang. An overview of ?oatingpoint support and math library on the Intel XScaleTM architecture. In IEEE Symposium on Computer Arithmetic, pages 122–128, 2003. [13] Donald E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Addison-Wesley, third edition, 1997. [14] P. Markstein. IA-64 and Elementary Functions : Speed and Precision. Hewlett-Packard Professional Books. Prentice Hall, 2000. [15] Peter Markstein. Software division and square root using Goldschmidt’s algorithms. In 6th Conference on Real Numbers and Computers, pages 146–157, 2004. [16] Peter W. Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM Journal of Research and Development, 34(1):111–119, 1990. [17] P. Montuschi and P. M. Mezzalama. Survey of square rooting algorithms. Computers and Digital Techniques, IEE Proceedings-, 137(1):31–40, 1990. [18] Jean-Michel Muller. Elementary functions: algorithms and implementation. Birkh¨ user, second edition, 2006. a [19] Michael L. Overton. Numerical computing with IEEE ?oating point arithmetic. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001. [20] Saurabh-Kumar Raina. FLIP: a ?oating-point library for integer ? processors. PhD thesis, Ecole Normale Sup? rieure de Lyon, 2006. e [21] Martin S. Schmookler, Ramesh C. Agarwal, and Fred G. Gustavson. Series approximation methods for divide and square root in the Power3TM processor. In ARITH ’99: Proceedings of the 14th IEEE Symposium on Computer Arithmetic, pages 116–123. IEEE Computer Society, 1999. [22] Joshua J. Yi, Ajay Joshi, Resit Sendag, Lieven Eeckhout, and David J. Lilja. Analyzing the processor bottlenecks in SPEC CPU2000. In Proceedings of the 2006 SPEC Benchmark Workshop, page on CD, Austin, TX, 1 2006.

TABLE II AVERAGE IPB ( INSTRUCTIONS PER BUNDLE ) AND IPC ( INSTRUCTIONS PER CYCLE ) NUMBERS FOR ISSUE WIDTH VALUES 2, 3, 4.

evaluation methods of III-B-3), achieving a timing of 30 cycles for the implementation of Poly-6, and ef?ciently exploiting the instruction parallelism of the ST231. Compared to the Newton-Raphson/Goldschmidt-based method [20], currently implemented in the latest of?cial ST231 toolset (and whose latency is of 48 cycles), a speed-up of 37.5% is observed; compared to the implementation of the naive nonrestoring algorithm, the speed-up is by a factor of almost 5. All special values are handled in at most 22 cycles and are thus never slower than non-special values. Notice also that Goldschmidt-2 is always faster than Newton-2, thus con?rming the observation already made in [15] that Goldschmidt’s method can be useful not only in hardware but also in software. As highlighted in Tables I and II, the polynomial evaluation methods are also more sensitive to the issue width value, because of their high degree of parallelism; for example, the Poly-6 method achieves an IPC value of 2.45. IV. C ONCLUSIONS As we have observed, a software implementation of a correctly rounded-to-nearest single-precision square root operator using polynomial evaluation algorithms, and exploiting at best the instruction parallelism, turns out to be very ef?cient on a VLIW architecture such as the ST231, achieving a cost of no more than 30 cycles. Furthermore, as it was formerly noticed in the IA-64 context [10], letting the compiler generate open code for the square root can even be more ef?cient when multiple calls to this function are done (either by the programmer or because of compiler transformations such as loop unrolling or software pipelining). In that case, all the constants needed to evaluate the polynomials are factorized by the compiler, thus increasing the computation issue rate. In this paper we have focused on square root only: although simple, this function already illustrates well how to adapt to a target like the ST231 in order to achieve both speed and accuracy; also, in processor performance bottlenecks, square roots come ?rst among basic ?oating-point operations [22, Table 2]. But we are currently extending our approach to other functions like square root reciprocal (that is even more critical to graphics pipelines), reciprocal, and division. Finally, automating the generation of fast and accurate C code for computing algebraic functions beyond the four above will provide a way to explore various code generation tradeoffs, and to evaluate our techniques on other architectures.

赞助商链接

- FOR FLOATING POINT ADDITION
- Integer Floating-point Recon gurable ALU
- Non-restoring integer square root A case study in design by principled optimization
- The case for a redundant format in floating point arithmetic
- AN INTEGER BASED SQUARE ROOT ALGORITHM
- ALTSQRT (Integer Square Root)
- Implementation of single precision floating point square root on FPGAs
- 1 A Floating-point to Fixed-point C Converter for Fixed-point Digital Signal Processors
- AN INTEGER BASED SQUARE ROOT ALGORITHM
- Floating Point Square Root (ALTFP_SQRT) Megafunction User Guide
- Decimal Floating-Point Square Root Using Newton-Raphson Iteration
- Correctness proofs outline for Newton-Raphson based floating-point divide and square root a
- A Block Floating Point Implementation for an N-Point FFT on
- An efficient square-root algorithm for BLAST
- Computing a square root for the number field sieve

更多相关文章：
更多相关标签：