9512.net
甜梦文库
当前位置:首页 >> >>

Telescopic Units A New Paradigm for Performance Optimization of VLSI Design


220

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 3, MARCH 1998

Telescopic Units: A New Paradigm for Performance Optimization of VLSI Designs
Luca Benini, Enrico Macii, Member, IEEE, Massimo Poncino, Member, IEEE, and Giovanni De Micheli, Fellow, IEEE

Abstract—This paper introduces a novel optimization paradigm for increasing the throughput of digital systems. The basic idea consists of transforming ?xed-latency units into variable-latency ones that run with a faster clock cycle. The transformation is fully automatic and can be used in conjunction with traditional design techniques to improve the overall performance of speedcritical units. In addition, we introduce procedures for reducing the area overhead of the modi?ed units, and we formulate an algorithm for automatically restructuring the controllers of the data paths in which variable-latency units have been introduced. Results, obtained on a large set of benchmark circuits, show an average throughput improvement exceeding 27%, at the price of a modest area increase (less than 8% on average). Index Terms— Circuit optimization, design automation, highspeed integrated circuits, logic design, synchronization.

I. INTRODUCTION HE ever increasing clock frequency of high-performance systems pushes IC designers and synthesis tools to perform substantial efforts in minimizing the delay of combinational logic blocks that constrain the cycle time. Gate-level timing optimization is often a computationally intensive task and sometimes it leads to a signi?cant area and powerconsumption overhead. In addition, it may not be the most convenient choice if some ?exibility is allowed in changing the design architecture. In the majority of circuit and system designs, throughput provides a more meaningful measure of performance than clock frequency. Throughput is abstractly de?ned as the amount of computation performed in a time unit. Obviously, decreasing the clock cycle time is one way to improve the throughput in a digital system. However, architectural optimizations such as parallelism exploitation and pipelining are much more effective in increasing throughput than bare clock speedup [1]. A well-known throughput-enhancement technique is based on using variable-latency units [1], [2]. For example, highperformance hardware components for division or for the computation of transcendental functions are well suited for
Manuscript received August 28, 1997. This work was supported in part by NSF under Contract MIP-9313701 and by ARPA under Grant DABT63-95C-0049. This paper was recommended by Associate Editor F. Brglez. L. Benini and G. De Micheli are with Stanford University, Computer Systems Laboratory, Stanford, CA 94305 USA. E. Macii and M. Poncino are with Politecnico di Torino, Dipartimento di Automatica e Informatica, Torino, Italy 10129. Publisher Item Identi?er S 0278-0070(98)04630-2.

T

variable-latency implementation. Such units complete execution in a variable number of clock cycles, depending on the input data they receive. The variable-latency implementation is a natural solution for ?oating-point arithmetic computations because the algorithms involved are iterative in nature and the number of iterations is data-dependent. The basic principle that motivates the implementation of a variable-latency resource is that of “speeding up the common case” [1]. A ?xed-latency unit completes execution with the latency of its longest possible computation. On the contrary, a variable-latency unit adapts its latency to the length of the computation it is performing. Average throughput is improved if the probability of a long-latency computation is much smaller than that of a short-latency one. Unfortunately, the overhead that occurs when instantiating a variable-latency unit is twofold. First, a completion signal must be provided to inform the environment of the termination of a computation. Second, the control logic in the environment must be able to synchronize with a variable-latency completion. Clearly, the overhead should be kept as small as possible. High probability of short-latency computation and low overhead are the two con?icting requirements for the success of a variable-latency resource in satisfying the design goals. Hence, hand-crafted design of variable-latency units is a dif?cult task and computer-aided design tools may be of great help. In this work, we address the automatic synthesis of highthroughput, variable-latency units and the estimation of the expected performance improvements. We focus our attention on components of synchronous circuits, which originally implement arbitrary combinational functions in a single clock cycle (i.e., with ?xed latency of one unit). We transform such units into variable-cycle implementations, which we call telescopic units. Whereas such units have data-dependent latency, their clock rate can be sped-up to match the “common case,” i.e., the critical-path delay of most computations that can still be achieved in one clock cycle. Longer computations will be split over two (or more) cycles. The overall performance improvement of this transformation stems from achieving a faster clock rate for the synchronous circuit of interest. Seen as a black box, a telescopic unit produces two outputs: the functional output (i.e., the result of the computation) and a handshaking hold signal which is activated when the functional unit requires more than one clock cycle to complete the computation. The advantage of the telescopic unit is an increase in average throughput. The overhead consists of the

0278–0070/98$10.00 ? 1998 IEEE

BENINI et al.: NEW PARADIGM FOR PERFORMANCE OPTIMIZATION OF VLSI DESIGNS

221

circuitry for the generation of the hold signal. Additional circuitry is required in the external control logic in order to observe the hold signal and synchronize the telescopic unit with its environment. Although telescopic units are, in principle, similar to selftimed units [3], they operate in a fully synchronous environment. Hence, they take an integer number of clock cycles to complete their executions. The fully synchronous operation allows us to ignore the issues related to hazards, which make the design of large scale self-timed circuits complex and expensive. We outline algorithms and heuristics for automatically synthesizing telescopic units which rely on symbolic techniques for exact timing analysis [4]. Experimental results con?rm the viability of our approach, and they clearly indicate the applicability of the technique for throughput optimization, as well as for area optimization under throughput constraints. Recently, Hassoun and Ebeling have presented an approach similar to ours, called architectural retiming [5]. Their idea is to increase the number of registers on latency-constrained paths, thus decreasing the cycle time without increasing the latency. This is made possible by adding a negative register to each newly-added register, in such a way that regular/negative register pairs are implemented as wires. The implementation of the negative registers is key for the applicability of the method. Since the output of a negative register is equal to the input value at the next clock cycle, the implementation requires a sort of prediction. This prediction is veri?ed one clock cycle after its calculation: if it is correct, the system can proceed with the next prediction; otherwise, the mispredicted value must be ?ushed and the circuit must be restored to the previous state. This implies a one-cycle latency penalty. Another similar approach that ?nds application in the design of asynchronous data-path units is called speculative completion [6]. This method merges the advantages of other well-known techniques for the detection of the computation completion for asynchronous units. The idea is that of associating multiple, “speculative” delay models with a unit, in such a way that the completion of an operation is detected in parallel with the unit itself. The multiple models account for different (e.g., worst-case verses best-case) speeds of early completion, and each speculative delay has its own abort detection logic that signals whether the corresponding delay model has to be aborted. Both architectural retiming and speculative completion are hand-crafted techniques that have not been automated. On the contrary, telescopic units are automatically synthesized. The remainder of this manuscript is organized as follows. Section II provides the notation and some background information concerning delays in combinational circuits and summarizes how exact timing analysis can be performed ef?ciently using symbolic techniques based on algebraic decision diagrams (ADD’s). Section III introduces the telescopic unit architecture, and Section IV discusses in detail algorithms and heuristics for the automatic synthesis of telescopic units. Section V addresses the problem of designing controllers for systems containing telescopic units. Section VI reports the results of a large set of experiments we have carried out on

standard benchmark examples. Finally, Section VII is devoted to conclusions. II. BACKGROUND A. Circuits and Delays A combinational circuit is a feedback-free network of combinational logic gates, called gates for brevity. If the output of is connected to an input of a gate , then is a a gate and gate is a fanout of gate A controlling fanin of value at a gate input is the value that determines the value at the output of the gate independent of the other inputs, while a noncontrolling value at a gate input is the value whose presence is not suf?cient to determine the value at the output of the gate. For example, zero is the controlling value for a NAND gate. rise Each connection is associated with two delays, fall delay. The delay function of connection delay and from gate to gate is called It equals if takes value 1 when input vector is applied to the primary If all fanin inputs of the circuit. Otherwise, and , we connections of have the same values of , where de?ne the delay function of as is any fanin connection of If is the global function of (function in terms of the primary inputs) and connects gate to gate , then

Given a gate , the arrival time is the time at which the output of settles to its ?nal value if input vector is applied at time zero. A path in a combinational circuit is a sequence of gates and , where connection connections connects the output of gate to the input of The length of a path is gate The topological delay of a de?ned as combinational circuit is the length of its longest path. An event or at a gate. Given a sequence is a transition occurring at gates of events, along a path, such that occurs as a result of event , the event is said to propagate along the path. Under a speci?ed is said to be delay model, a path occurring at gate can propagate sensitizable if an event A false path is a nonsensitizable path. The critical along path of a combinational circuit is the longest sensitizable path of under a speci?ed delay model: its length is the delay the combinational circuit and it is a lower bound on the cycle For the sake of simplicity, we neglect settime , i.e., up and hold times and propagation delays through registers. These factors can be easily incorporated into our analysis and synthesis technique. B. Pseudo-Boolean Functions and ADD’s In the remainder of this manuscript, we assume the reader is familiar with the fundamental concepts of Boolean functions. In addition, we take for granted the knowledge of symbolic

222

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 3, MARCH 1998

techniques for the representation and the manipulation of such functions through binary decision diagrams (BDD’s) [7]. Therefore, in the following, we only brie?y recall the basic notions related to pseudo-Boolean functions and to the data structures, the algebraic decision diagrams (ADD’s) [8], which are commonly used for their representation. is a A -input pseudo-Boolean function mapping from a -dimensional Boolean space to a ?nite set Function can be ef?ciently stored and manipulated through an ADD, an extension of the BDD which allows values from an arbitrary ?nite domain to be associated with the terminal nodes (i.e., the leaves) of the diagram. Among the existing operators for ef?cient ADD manipulation, THRESHOLD is of particular importance for our purposes. It takes two arguments: a generic ADD, and val a threshold value. It sets to 0 all the leaves of whose value is smaller than val and to 1 all the leaves of whose value is is thus greater than or equal to val. The resulting ADD, restricted to have only 0 or 1 as terminal values; therefore, it is a BDD. C. ADD-Based Timing Analysis The problem of calculating the timing response of a combinational circuit can be formulated as follows [4]. Given the circuit, ?nd the set of input vectors for which the length of the critical path, under a speci?ed mode of operation and a gate delay model, is maximum. The length of the critical path gives the overall circuit delay. Consider a gate of the network and a primary input vector , where is the set of all the care input vectors of the is evaluated circuit. The arrival time at its output line in terms of the arrival times of its inputs and the delays of its Let be the connection to pin fanin connections of gate If all fanins of have noncontrolling values

(a)

(b)

(c) Fig. 1. (a) A combinational circuit, (b) its output arrival times, and (c) the corresponding ADD.

If at least one fanin of has a controlling value for input , where is the set of all possible care input vectors controlling

gate to be the same for all fanins and input vectors and to for each gate. By applying the be given as single values three equations above on a gate-by-gate basis, and proceeding from the inputs to the outputs of the circuit, we can determine for each input the arrival time of the output node of gate vector. The table of Fig. 1(b) provides such information. It is now possible to ef?ciently store the content of the table as an ADD. Fig. 1(c) depicts the ?nal data structure. D. Throughput and Latency

Finally, if

Differently from what happens with traditional delay analyzers, the use of the ADD-based timing analysis tool has made it possible to compute and store the length of the critical path for each input vector. The availability of the complete timing information regarding the combinational circuit is essential for the realization of the synthesis algorithms described in Section IV. Example 1: Consider the combinational circuit of Fig. 1(a), and assume, for simplicity, the connection delays for a single

of a unit is de?ned as the amount of The throughput computation (i.e., the number of times a new output value is produced) carried out per time unit. For instance, the throughput of a combinational logic circuit with delay of The latency of a digital 15 time units is system is de?ned as the number of clock cycles required for a computation to complete. A ?xed-latency unit with latency clocked with period has constant throughput For variable-latency units, we consider the average latency over a period of time The average throughput In the following sections, we use is simply and the shorthand notation and as opposed to to denote average latency and throughput, respectively.

BENINI et al.: NEW PARADIGM FOR PERFORMANCE OPTIMIZATION OF VLSI DESIGNS

223

(a)

Clearly, Inequality 3 is valid only for since 1 . In order to we have made the assumption that automatically synthesize telescopic units, two problems must be solved. First, we need to compute and synthesize the hold function, a combinational logic function that detects all input patterns that propagate to the outputs with delay larger Second, we must modify the controller of the datathan path where the telescopic unit is instantiated. The modi?ed controller synchronizes the environment with the telescopic The unit by delaying subsequent computations when following two sections deal with these problems in detail. IV. SYNTHESIS ALGORITHMS AND HEURISTICS The computation of the arrival time ADD for a combinational unit allows us to determine all input vectors for which the propagation through the unit will be slower than This information is exploited to synthesize the logic which generates the proper values of the hold output. , the Given the arrival time ADD of output which assumes the value 1 for all BDD for the function is greater the input vectors for which the arrival time of is computed as than the desired cycle time (4) Since we are interested in the set of input conditions for which at least one output of the unit has an arrival time , we have that the hold function can be greater than easily determined as (5) is the total number of outputs. where Clearly, the key issue for making telescopic units usable is implemented by the in practice concerns the way hold circuit. There are three main constraints that the ?nal must satisfy, and thus require particuimplementation of lar consideration during synthesis. They are listed below in decreasing order of importance. must be strictly smaller ? The arrival time of output for any possible input pattern. Otherwise, the than telescopic unit cannot be guaranteed to work correctly. to assume the value 1 must be small ? The probability of enough to guarantee an average throughput improvement, (Inequality 3). that is, ? The area and power of the hold circuit implementing must be kept under control.
1 The extension to L max > 2, not discussed in detail for the sake of clarity, is conceptually straightforward, but more complex to implement. This 1 2 k is because several hold signals fh ; fh ; ;f h are required to make the unit j work correctly. Function fh takes on the value one for all the input patterns that require (j + 1) cycles to complete the execution. The expression for P 3 can then be modi?ed to account for values of T 3 < T =2 as follows:

(b) Fig. 2. (a) A combinational unit and (b) a telescopic unit.

III. TELESCOPIC UNIT ARCHITECTURE Consider the problem of increasing the throughput of a combinational unit, such as the one shown in Fig. 2(a). This can be done by shortening the cycle time of the unit from its One possible way of providing original value to functional correctness is to extend the unit to provide an which is asserted for all input additional output signal time units to propagate to patterns requiring more than the outputs of the block [see Fig. 2(b)]. We call telescopic unit the modi?ed unit, since it may cycles to complete its execution, depending require on the speci?c patterns appearing at its primary inputs. We In this case, consider here the situation in which time units for patterns such the computation completes in and in time units for patterns such that that The average throughput of the original unit is (1) For the telescopic unit, the lower the probability of the to take on the value 1, the larger the overall hold signal throughput improvement. In fact, its average throughput is given by (2) is the probability of the hold signal to be where one. Thus, the use of the telescopic unit is advantageous only and , i.e., when for some values of More speci?cally, we can write the following condition for throughput improvement (3)

111

P

3=

(k + 1)T + 1

k Prob(fh )

1 0 Prob( h +
f T

3+

Prob(fh
kT f

3

k01 )

+
f

3

2 h+

111

k h)

1 1 1 Prob(3 h ) 2
f T

1

j j where Prob(fh ) represents the probability that hold function fh = 1:

224

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 3, MARCH 1998

It should be observed that the ON-set of , as de?ned by (5), contains all and only those input patterns that propagate Hence, any impleto the output with delay larger than mentation of the hold function must cover the ON-set of , but it may also include other input conditions. By enlarging the hold conditions, a faster and smaller hold circuit may be obtained. Functional correctness is preserved, but the average throughput of the telescopic unit (5) may be degraded, because the circuit will hold for some input patterns with propagation Obviously, if some input don’t care delay smaller than conditions are known, they can be pro?tably exploited to without affecting the throughput of the unit. enlarge We have exploited this observation to formulate two heuristic algorithms, described in the next two sections, whose target , such that is to determine an enlarged hold function only marginally degrades, but the implementation of meets the timing constraint and has a limited area. They The ?rst method both start from the BDD representation of generates the hold logic following an iterative paradigm. First, is mapped onto a multiplexor network. Then, the BDD of such network is optimized through traditional logic synthesis techniques. Finally, a check is made to ?nd out if the timing is met. If this is not the case, the constraint is enlarged to obtain by properly removing ON-set of some BDD nodes, and the process is repeated. The second heuristics produces a sum-of-products (SOP) description of directly from the BDD of the initial The ?rst heuristic is fast and works well for many examples, while the second is more computationally intensive and should be used when high-quality results are desired (i.e., maximum throughput improvement and minimum area overhead). Both methods are described in detail in the following sections. A. BDD-Based Heuristics The starting point of this method is the BDD of by (5). We search for a new hold function implementation satis?es the timing constraint as de?ned whose (6) The procedure starts from the conservative assumption that the hold logic will be generated by simply mapping the BDD onto a network of multiplexors [9], [10]. This straightof forward implementation can be obtained from the BDD of in time [9], where is the number of nodes in the BDD on which variable reordering has been applied with the purpose of reducing its size. Obviously, the network obtained by direct BDD mapping is highly unoptimized. Therefore, its performance can be sensibly improved by standard logic optimization. Under the assumption of a multiplexor-based implementation of the hold circuit, the longest path in the BDD gives us an estimate of the critical path for the hold network. Clearly, this is only a ?rst order estimate, since it neglects two factors: the output load on a multiplexor and the load on the control inputs of the multiplexors. If the BDD is very “wide” in the lower levels (i.e., there are many nodes marked with variables which are at the bottom of the global order), the

speed of the multiplexor-based network could be limited by the excessive load on the control inputs of the multiplexors. Similarly, if a node in the BDD is shared by many subtrees, the corresponding multiplexor has a large fan-out and its speed decreases. However, buffering can mitigate the problem and reduce the delay penalty in both situations. Our approach is to focus ?rst on the number of levels of logic in the multiplexorbased network. The algorithm for the constrained generation of consists is traversed and levelized: of two steps. First, the BDD of each node is marked with its level, that is, the length of the longest path between the node and the root of the BDD. Second, the constraint on the maximum number of levels is be the delay of a multiplexor enforced. Let and an input drive , where with a fan-out load of and are two constants representing the expected average load on a multiplexor and the expected driving strength on its inputs. The maximum number of levels of logic allowed in the multiplexor network is given by (7) is a scaling constant that factors the expected effect where of logic synthesis and optimization on the multiplexor network produces conservative results). Starting from the nodes marked with higher level, the BDD are is traversed, and all nodes for which replaced by the constant 1. This operation yields a function Notice that node elimination implies the reduction In of the number of paths in the BDD longer than particular, elimination of a single node may cause a length reduction for an exponential number of paths. We call supersetting the operator that eliminates a given node from a BDD, since it is the dual of the subsetting transformation proposed by Ravi and Somenzi in [11] in the context of reachability analysis of large ?nite-state machines. We have implemented the supersetting operator using a recursive procedure, described in [12], which is structured as most of the basic BDD operators. Supersetting techniques alternative to ours have been proposed in the recent literature. The interested reader may refer to [13] and [10] for more details on this subject. One issue that still needs to be addressed concerns timing constraint violations due to nodes with excessive fan-out or excessively loaded input signals. Supersetting can be exploited again to eliminate such violations. Simple heuristics for marking nodes that would generate heavily loaded multiplexors, or for reducing the load on the inputs have been devised and are not described here because they do not add much to the understanding of the algorithms. In addition, these heuristics have a very marginal impact on the quality of the results, since the modi?cation of the BDD’s for reducing their depth almost always yields implementations that satisfy the timing constraint Fig. 3 outlines the pseudocode of the BDD-based algorithm for the synthesis and optimization of Procedure Bdd2Logic takes, as inputs, the original func, the desired cycle time , and the area bound tion

BENINI et al.: NEW PARADIGM FOR PERFORMANCE OPTIMIZATION OF VLSI DESIGNS

225

Fig. 4. General multilevel implementation of a SOP.

Fig. 3. The Bdd2Logic algorithm.

for the hold circuit, and it returns the hold circuit implementing an which satis?es the required constraints. The size (i.e., the number of nodes) of the BDD for is ?rst reduced through variable reordering, and the corresponding BDD is synthesized as a multiplexer-based logic network and subsequently optimized and mapped. The arrival time ADD for the hold logic is computed. If both the timing and of , the area constraints are met by the implementation such implementation is returned. Otherwise, a modi?cation of is required, following the supersetting paradigm presented earlier in this section. , the BDD for is levelized and After computing supersetting is ?rst used to reduce the depth of the BDD and eventually to eliminate nodes possibly responsible for timing violations due to excessive load. At this point, the whole sequence of operations starts over. Notice that procedure Bdd2Logic is guaranteed to terminate because supersetting eliminates at least one node in the BDD at each iteration. In all the cases we examined, one iteration was suf?cient to ?nd an implementation of that satis?ed the timing constraint In the worst case, the procedure terminates in a number of iterations which is proportional to the size of the BDD of B. SOP-Based Heuristics The main advantage of the BDD-based heuristics for the generation of the hold circuit lies in its well-controlled complexity. If the timing analysis tool can compute the hold require linear function , all steps for the generation of time in the size of its BDD. The major limitation of the BDD-based approach is that the hold circuit is a multiplexor network obtained from the BDD. Since speed is the primary requirement for the hold circuit, we may need to apply logic optimization to obtain a fast implementation. Most speed-up algorithms perform some form of ?attening of the initial speci?cation in order to resynthesize a faster network. Flattening transforms the initial speci?cation into a twolevel sum-of-products form. Unfortunately, when ?attening a multiplexer network, the number of products in the ?attened implementation may be exponentially larger than the number of multiplexers in the initial speci?cation.

This implies an inherent dif?culty in obtaining a fast implementation of the hold circuit, since our BDD-based algorithm produces a small multiplexer network that might be actually very hard to speed-up by logic optimization. In this section, we propose a heuristics for the synthesis of the hold cir, it ?rst produces a cuit that, starting from the BDD of and then resorts to logic sum-of-products description of optimization to obtain a fast and compact gate-level netlist. In other words, we directly generate a ?attened version of from the BDD, without generating the multiplexor network. Since the hold function is subject to the constraint that the delay of its ?nal implementation must be shorter than , the SOP generation procedure is delay-driven. Assume inputs has been speci?ed as that a Boolean function with a SOP, using just the AND, OR, and NOT operators, and product terms. The largest cube (i.e., that it contains speci?ed literals, where product term) has If we neglect for simplicity the effect of the input loads (i.e., we assume in?nite input driving strengths), it is easy to realize that the function can be implemented by a multilevel network having the structure shown in Fig. 4. The delay of such network is given by (8) The ?rst term in the expression accounts for the delay through the balanced tree of two-input OR operators needed cubes. The second term to implement the logic sum of accounts for the delay through the balanced tree of twoinput AND operators needed to implement the cube with the maximum number of speci?ed literals. The last term accounts for the delay of a NOT operator (needed to complement the input variables, if they appear in negative phase in the cubes). represent the delay through the AND, Constants OR, and NOT gates with unit load. We cannot guarantee logarithmic delay for , since it is well known that there exist functions which can be represented only with an exponential number of cubes in SOP form. Fortunately, the hold circuit can implement any function Consequently, we do not generate a cover for the We enumerate hold function, but rather for its complement and we include them in a partial cover. When the cubes of enumeration is stopped (by a stopping criterion discussed later) If we the procedure returns a partial cover of implement the partial cover and we complement its value, we

226

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 3, MARCH 1998

C. Practical Issues An implicit assumption made throughout the paper is that the presence of the hold circuit does not perturb the timing behavior of the original logic of the unit. Unfortunately, in principle this is not true. Although the combinational logic implementing the is never shared with the logic of the original circuit, they are both driven by the same inputs. When we add the hold circuit in the telescopic unit, the load on the ?ip-?ops at the inputs of the stage increases, and the propagation delay increases as well. As a consequence, the timing in the original circuit may change. In particular, paths may become too with propagation delay originally below slow and violate the timing requirements. If this happens, the telescopic unit may malfunction, because the hold circuit is not guaranteed to be active for the paths that have become too slow due to its presence. To tackle this problem we have devised two strategies. The ?rst and more conservative approach speci?es an additional load on the ?ip-?op outputs (i.e., the inputs of the combinational logic) when performing the initial timing analysis of the stage. This can be done by connecting an additional gate (i.e., a buffer) to each ?ip-?op output. In the ?nal implementation of the telescopic unit, the additional gates are the input drivers for the hold circuit. The addition of the drivers allows us to decouple the timing analysis of the combinational logic from the synthesis of the hold circuit. The strategy is conservative since it assumes that an additional driver is needed to drive every input of the hold circuit. A more aggressive alternative assumes that the presence of the hold logic does not sensibly change the timing in the original logic. More speci?cally, it assumes that none of the paths in the combinational block which are not covered by becomes slower than Function is thus synthesized and wired to the original stage in the usual way. Timing analysis is then performed: if some violation is detected (i.e., paths longer in the original logic are activated by input vectors in than , the hold circuit is resynthesized using the OFF-set of The decrease a new, arti?cially reduced equals the maximum violation that occurs in the original circuit when the hold circuit is inserted. The process is iterated until the addition of the hold circuit no longer originates a violation. Although the second alternative may seem more risky and computationally expensive, we have empirically observed that often the insertion of the hold circuit does not create any violation, and the blind addition of buffers may thus be an overkill. In our experiments, we have chosen the second approach with good success. V. CONTROLLER DESIGN In all practical cases, computational units are embedded in a larger system and must be interfaced to the environment in a consistent and correct fashion. In this section, we show how to incrementally modify the controller of a data-path when the latter is transformed into a telescopic unit. For the sake of illustration, consider the following design scenario. The behavioral description of a system is provided by

Fig. 5. The Bdd2Cover algorithm.

obtain an implementation of , thereby achieving our original objective. The pseudocode for the cover generation algorithm is shown in Fig. 5. The procedure receives, as input parameters, the BDD representation of , the bound on the total number of cubes , and the bound on the maximum number of speci?ed The choice of the literals in any considered cube and is driven by the required timing values is split into constraints. More speci?cally, the constraint and two contributions: The coef?cients and represent the fraction of the total time to be spent in computing the logic products and the fraction of the time to be spent in computing the logic sum, because the number of respectively. Usually, cubes is much larger than the number of literals in a cube. , we compute Similarly, From is used to compute The algorithm consists of an outer loop for cube generation. Whenever a newly generated cube is examined, it is inserted in the selected cube list if and only if the number of speci?ed literals (i.e., literals appearing in the cube in either directed or complemented form) is smaller than or equal to Otherwise, the cube is simply dropped. The list is kept in decreasing order: large cubes are inserted at the top of the list (a large cube has a small number of speci?ed literals). If the , the number of cubes in the list becomes larger than cube at the bottom of the list is discarded. Upon completion largest of the loop, the list is returned. It contains the produced during cube enumeration. cubes of It should be observed that procedure Bdd2Cover requires the explicit enumeration of all cubes obtained from the BDD representation. Consequently, the main shortcoming of this simple algorithm is its worst case exponential number of iterations. A straightforward solution is to limit the number of iterations of the foreach loop to a user-de?ned upper bound. In this case, we cannot guarantee that the list will contain the largest cubes in the cover, but only the largest cubes generated during the selected number of iterations. is Given the list of cubes, the ?nal implementation of obtained through logic optimization. The hold circuit is then realized by simply inserting an inverter at the output. Since we start from a SOP speci?cation, the optimization procedures are more effective in ?nding fast implementations with small area, than in the case of multiplexer-based network, as con?rmed by the experimental results.

BENINI et al.: NEW PARADIGM FOR PERFORMANCE OPTIMIZATION OF VLSI DESIGNS

227

the designer. Behavioral synthesis is performed and an initial implementation, consisting of a controller and a data-path, is obtained. The designer (or a high-level design exploration tool) examines the data-path and decides which unit is to be transformed into a telescopic unit. For instance, the slowest unit in the data-path (which dominates the cycle time) can be chosen. Thus, the system can potentially run with a cycle The introduction of a telescopic unit implies time Such signal is an input to the existence of a new signal the controller that must be modi?ed (i.e., resynthesized) to take into account the variable latency of the telescopic unit. Controller resynthesis must satisfy two key requirements: 1) it must guarantee functional correctness; 2) the complete system (i.e., controller and data-path) must run at the new cycle time We brie?y describe a controller transformation procedure for data-paths containing telescopic units. A complete treatment of this topic can be found in [14]. Numerous controller generation algorithms for behavioral synthesis have been presented in the past [15]. Most synthesis techniques generate controller representations in terms of state tables (or equivalent formalisms) of a ?nite-state machine (FSM) model of the control unit. Other techniques [16] generate a network of simple state machines, each controlling a task or a set of concurrent tasks and where transitions are triggered by task completion signals. Since the complement of the hold signal denotes the completion of a computation, such schemes are easily adapted to support telescopic units. Unfortunately, the implementations may be inef?cient in terms of area utilization and may become impractical to control data-paths with large sets of tasks. We consider here the modi?cations needed to be applied to a state table representation of a control unit in order to control telescopic units. Thus, this procedure is compatible with most current behavioral synthesis methodologies. Before describing the procedure in detail, we de?ne two types of controller outputs (also called control signals), namely, the load signals and the steering signals. We call load signals the FSM outputs that control the loading of new values into the data-path registers. A load signal is active when it allows the overwriting of the old data in the register. When the load signal is inactive, the register holds its value. Without loss of generality, we assume that the active value is We call steering signals the FSM outputs that control the multiplexors in the data path. Such multiplexors implement the steering logic: at the input of a unit they are used to select the operands, while at the output, they select where to store the result of a computation. In the state table of the controller, a state is associated with each control step, and the edge leaving the state is labeled with the values of the load and the steering signals. is transformed into a Assume that the ?xed-latency unit telescopic one with hold function If unit is active in state , the controller’s state table is modi?ed applying the following rules. to state ? The unconditional transition from state is now conditioned by the event The output ?elds

(a)

(b) Fig. 6. (a) Fragments of ?xed-latency controller and (b) transformed (variable-latency) controller.

in the transition are left unchanged. In other words, if the telescopic unit requires just one cycle to complete, the system can move to the next control step. is added.2 The FSM transitions from ? A new state to state when The load signals in state the transition are all inactive, while the steering signals to have the same value as in the transition from This transition is taken when the telescopic unit takes one additional cycle to complete. If this is the case, the registers at the inputs and outputs of the unit must not be reloaded because the computation has not terminated. Clearly, the steering signals must not be changed because the input operands of the unit must be held constant. to state ? An unconditional transition from state is added. The outputs for the transition are exactly the to The same as the ones in the transition from additional transition is taken when the computation of the telescopic unit takes two cycles. Notice that the value of does not need to be sampled because, by construction, the unit completes its execution in either one or two cycles, but not more. Example 2: Consider the controller fragment shown in becomes a Fig. 6(a). Assume that a unit scheduled in state telescopic one. The transformation of the controller for state is shown in Fig. 6(b). Output signals ld x1 and ld r1 are load signals, while outputs m1 1, m2 2, m3 1, and m4 1 are steering signals. Observe that one state has been added
2 We are disregarding conditionals for the sake of simplicity. In the presence of conditionals, a state S can have multiple out-going edges. In this case, a new state SHk should be added for each out-going edge k:

228

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 3, MARCH 1998

extension of SIS [18] using CUDD [10] as the underlying BDD/ADD package. Experiments have been run on a DEC AXP 1000/400 with 256 MB of memory. We present two sets of data. The ?rst one concerns the use of telescopic units as a pure throughput optimization technique. The second set shows the applicability of telescopic units for area optimization under throughput constraints.
Fig. 7. Timing diagram of the interaction between fh and the control logic.

Moreover, transitions in the transformed controller depend on , the hold function of the telescopic unit. In the worst case, i.e., when the telescopic unit is active in all control steps, the number of states in the control FSM increases by a factor of two. More in general, if multiple telescopic units are instantiated, the increase in the number of states is exponential in the number of telescopic units. Thus, designs with many telescopic units should adopt a distributed controlgeneration style [16], where the controller is implemented as a network of small interacting FSM’s. When a single telescopic unit is instantiated in the data-path, the complexity increase in the controller is well controlled. The number of states remains linear in the number of control steps, and the number of input signals increases linearly with the number of telescopic units. Hence, the increase in area of the controller is not a serious concern (the total area is still dominated by the data path). Unfortunately, this is not the case for timing. Fig. 7 shows the timing diagram for the interaction between is the time required telescopic unit and controller. Delay by the controller for setting stable values on the input multiplexors of the telescopic unit. Such delay must be taken into is the time account even for ?xed-latency units. Delay to settle. Delay is the time required by required by the load signals of the controller to reach the stable value has settled. The path with delay after is exercised when, for example, the telescopic unit is fed with a pattern that requires two cycles immediately after a pattern requiring a single cycle. Checking for correct timing Obviously, requires to verify that this condition implies tighter timing constraints for the hold circuit: Another important timing-related issue is the presence of glitches on the steering signals when the controller’s FSM state to one of the new states. transitions from a Although the ?nal value of the steering signals is unchanged, a glitch during the transition may cause spurious transitions on the telescopic unit’s inputs while the unit is still completing its computation. Propagation of such spurious transitions may cause an increase in the time needed for the unit’s outputs to settle. Hence, the gate-level implementation of the steering signals should be glitch-free for all transitions from states to states. Glitch-free synthesis techniques [17] can be used to satisfy this requirement. VI. EXPERIMENTAL RESULTS We have implemented procedures Bdd2Logic and Bdd2Cover, as well as the surrounding software as an

A. Throughput Optimization We have considered all large 100 gates) benchmarks combinational multilevel suite [19] (53 in the examples). The circuits have been ?rst optimized for speed using a modi?ed version of the script.delay SIS script, in which the full simplify -l, and sometimes the red removal commands have been removed to allow the optimization to complete on the large examples. Then, they have been mapped for speed with either the map -n1 AFG or the map -m1 command onto a cell library containing inverters, buffers, and two-input NAND and NOR gates. The unit gate delay model has been adopted for the ADD-based timing analysis. 1) BDD-Based Synthesis Procedure: We have run the BDD-based synthesis procedure on the delay-optimized circuits trying to obtain maximum-throughput telescopic units. To accomplish this task we have speci?ed several decreasing , and we have synthesized the hold circuit until values for we have found a value for which a further cycle time reduction caused a decrease in throughput (due to the high probability of the hold function). For 43 examples, the use of telescopic units has produced a substantial throughput improvement. On the other hand, in ?ve cases (circuits C499, i3, i4, i6, and i7), the throughput did not increase. The reason for the failure is due to the delay distribution in the circuits. For example, the critical path delay time units. If we specify and we extract of i6 is , we obtain Finally, in ?ve cases (circuits C1355, C2670, C3540, C6288, and i10), the ADD-based timing analysis did not complete, due to the size of either the BDD’s of the output functions of the unit or the arrival time ADD’s to be constructed. Thus, our tool could not proceed to the generation of Table I reports the data for the 43 examples on which throughput optimization has succeeded. Benchmarks are sorted by increasing size. Columns Circuit, In, Out, Gt, , and give the name, the number of inputs, outputs, and gates, the true delay, and the throughput of the original circuit, respectively. shows the probability of , column Column gives the total number of gates of the telescopic unit, column reports the cycle time at which the telescopic unit is clocked to achieve the increased throughput of column , and tells the arrival time of the hold signal. Columns column and give the percentage of throughput improvement and area overhead (in terms of gates) of the telescopic unit. Finally, column Time reports the central processing unit (CPU) time, in seconds, required to perform the ADD-based timing analysis, as well as the synthesis and the optimization of for a given

BENINI et al.: NEW PARADIGM FOR PERFORMANCE OPTIMIZATION OF VLSI DESIGNS

229

TABLE I THROUGHPUT OPTIMIZATION USING THE BDD-BASED SYNTHESIS PROCEDURE

In all cases, we have obtained a noticeable average throughput increase (27.5% on average) with a limited area overhead (7.7% on average). It is important to observe that the speed optimization for the initial circuits has been pushed all the way to the limit; therefore, the throughput increase achieved on each example can be totally awarded to the use of telescopic units. In some cases the optimization could have been even more aggressive than what we implemented by choosing 2) SOP-Based Synthesis Procedure: In order to check the effectiveness of the SOP-based heuristics, we have chosen those circuits (out of the 43 considered before) where either the throughput improvement was smaller than 10%, or the area penalty was larger than 10%. A total of 16 examples has thus been selected from Table I. The SOP-based procedure has

been run on the reference versions of such examples for heavyduty optimization of Table II compares, for each circuit, the results obtained with the BDD-based and the SOP-based synthesis procedures. Our primary interest was the evaluation of the impact of the SOP-based heuristics on the area of the telescopic units. However, in order to make the comparison of the two heuristics as fair as possible, we have not allowed any throughput degradation with respect to the units obtained through the BDD-based procedure. In addition, for each example, we have to the value used for the BDD-based synthesis. This kept is for better identifying the effects of the synthesis heuristics on the implementation of the hold circuit. The results of the comparison are in favor of the SOP-based approach by an amount which goes beyond our expectations.

230

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 3, MARCH 1998

TABLE II THROUGHPUT OPTIMIZATION USING THE SOP-BASED SYNTHESIS PROCEDURE

In fact, not only the average area overhead has decreased from 13.4 to 10.8%, but a further average throughput increase from 17.0 to 18.8% has been achieved as a by-product. In a few cases, the worst-case delay of the hold circuit has also decreased. As expected, the computation time of the SOP-based heuristics is higher than that of the BDD-based one. Even though in most of the cases the difference in running time is negligible, there are examples where the SOP-based synthesis has required a few minutes to complete. B. Area Optimization For this set of experiments, the initial circuits have been optimized for area by iteratively applying the script.rugged SIS script (without the full simplify -m nocomp) followed by the red removal command (whenever this has been possible), and mapped for area using the map -m0 command onto the usual cell library. Then, a 20% throughput optimization has been targeted using two different approaches: ?rst, by transforming the circuits into telescopic units using the BDD-based heuristics; second, by optimizing the circuits for delay using SIS. Table III reports the experimental data. (Examples are sorted as in Tables I and II). In particular,

columns Original–Gt, Original–T, and Original–P report the number of gates, the cycle time, and the average throughput of the original (minimum area) circuits. Columns Telescopic Unit–Gt and Telescopic Unit– give the number of gates and the percentage of gates overhead required by the telescopic units to produce a 20% throughput increase. Columns Delay show similar data Optimized–Gt and Delay Optimized– for the delay optimized circuits. Finally, the right-most column and losses of the telescopic units indicates area wins over the delay optimized examples. Due to the dif?culty of exactly controlling the throughput increase, a 2.5% slack has been allowed. The symbol — in a column indicates that the desired throughput improvement could not be obtained. This situation has occurred in one case only for the telescopic units (circuit C432) and in 19 cases for the delay optimized circuits. It should be observed that the telescopic units have out-performed the delay optimized circuits, in terms of gate count, in the majority of the cases (36 examples out of 43). Only on benchmark C432 both optimizations failed. On average, the area overhead due to the use of telescopic units has been around 16.5%, while for the delay optimized circuits it has been approximately 30.5%. Such average is obviously computed only for the examples on which both optimizations have succeeded.

BENINI et al.: NEW PARADIGM FOR PERFORMANCE OPTIMIZATION OF VLSI DESIGNS

231

TABLE III AREA OPTIMIZATION UNDER THROUGHPUT CONSTRAINTS

VII. CONCLUSIONS We have presented a technique for the automatic generation of variable-latency, high-performance units that allow us to increase the performance of digital systems beyond the limits achievable by traditional logic optimization. Thanks to symbolic timing analysis, we identify the conditions for which a computation takes longer than the desired cycle time. We then generate a signal which communicates to the environment when the correct result is available. We also automatically modify the controller to properly handle the variable-latency units that are introduced in the system. Experimental results

demonstrate that the technique represents a valuable alternative to existing throughput optimization approaches. ACKNOWLEDGMENT The authors wish to thank I. Bahar for helping them with the ADD-based timing analysis code, and the anonymous DAC-97 reviewers for their useful comments on paper [12]. REFERENCES
[1] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Francisco, CA: Morgan Kaufmann, 1996.

232

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 3, MARCH 1998

[2] S. F. Oberman and M. J. Flynn, “Design issues in division and other ?oating-point operations,” IEEE Trans. Comput., vol. 46, pp. 154–161, Feb. 1997. [3] M. A. Kishinevsky, Concurrent Hardware: The Theory and Practice of Self-Timed Design. New York: Wiley, 1994. [4] R. I. Bahar, H. Cho, G. D. Hachtel, E. Macii, and F. Somenzi, “Timing analysis of combinational circuits using ADD’s,” presented at the EDTC-94: IEEE Eur. Design Test Conf. , pp. 625–629, Paris, France, Feb. 1994. [5] S. Hassoun and C. Ebeling, “Architectural retiming: Pipelining latencyconstrained circuits,” presented at the DAC-33: ACM/IEEE Design Automation Conf., pp. 708–713, Las Vegas, NV, June 1996. [6] S. Nowick, “Design of a low-latency asynchronous adder using speculative completion,” IEE Proc. Comput. Digital Techniques, vol. 143, pp. 301–307, Sept. 1996. [7] R. Bryant, “Graph-based algorithms for Boolean function manipulation,” IEEE Trans. Comput., vol. C-35, pp. 79–85, Aug. 1986. [8] R. I. Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E. Macii, A. Pardo, and F. Somenzi, “Algebraic decision diagrams and their applications,” Formal Methods Syst. Design, vol. 10, pp. 171–206, 1997. [9] L. Burgun, N. Dictus, A. Greiner, E. Prado Lopes, and C. Sarwary, “Multilevel optimization of very high complexity circuits,” presented at the EuroDAC-94: IEEE Eur. Design Automation Conf., pp. 14–19, Grenoble, France, Sept. 1994. [10] F. Somenzi, “CUDD: University of Colorado decision diagram package, release 2.1.2,” Dept. ECE, Univ. Colorado, Boulder, Tech. Rep., Apr. 1997. [11] K. Ravi and F. Somenzi, “High-density reachability analysis,” presented at the ICCAD-95: IEEE/ACM Int. Conf. Computer-Aided Design, pp. 154–158, San Jose, CA, Nov. 1995. [12] L. Benini, E. Macii, and M. Poncino, “Telescopic units: Increasing the average throughput of pipelined designs by adaptive latency control,” presented at the DAC-34: ACM/IEEE Design Automation Conf., pp. 22–27, Anaheim, CA, June 1997. [13] T. Shiple, “Formal analysis of synchronous circuits,” Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Univ. California, Berkeley, 1996. [14] L. Benini, E. Macii, and M. Poncino, “Ef?cient controller design for telescopic units,” presented at the ISIS-97: IEEE Int. Conf. Innovative Syst. in Silicon, pp. 290–299, Austin, TX, Oct. 1997. [15] G. De Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, 1994. [16] D. Ku and G. De Micheli, High-Level Synthesis of ASIC’s Under Timing and Synchronization Constraints. Norwell, MA: Kluwer, 1992. [17] L. Lavagno and A. Sangiovanni-Vincentelli, Algorithms for Synthesis and Testing of Asynchronous Circuits. Norwell, MA: Kluwer, 1993. [18] E. M. Sentovich, K. J. Singh, C. W. Moon, H. Savoij, R. K. Brayton, and A. Sangiovanni-Vincentelli, “Sequential circuits design using synthesis and optimization,” presented at the ICCD-92: IEEE Int. Conf. Comput. Design, pp. 328–333, Cambridge, MA, Oct. 1992. [19] S. Yang, “Logic synthesis and optimization benchmarks user guide version 3.0,” MCNC: Microelectronics Center of North Carolina, Research Triangle Park, NC, Tech. Rep., Jan. 1991.

Enrico Macii (M’92) received the Dr.Eng. degree in electrical engineering from Politecnico di Torino, Italy, in 1990, the Dr.Sc. degree in computer science from Universit` di Torino in 1991, and the Ph.D. a degree in computer engineering from Politechnico di Torino in 1995. From May 1991 through August 1991, he was Visiting Faculty at the University of California at Los Angeles, and from September 1991 through September 1994, he was Adjunct Faculty at the University of Colorado at Boulder. Currently he is an Assistant Professor of Electrical and Computer Engineering at Politecnico de Torino. His research interests include synthesis, veri?cation, simulation, and testing of digital circuits and systems.

Massimo Poncino (M’97) received the Dr.Eng. degree in electrical engineering from Politecnico di Torino, Italy, in 1989, and the Ph.D. degree in computer engineering from Politecnico di Torino in 1993. From March 1993 through May 1994, he was Visiting Faculty at the University of Colorado at Boulder. Currently he is an Assistant Professor of Electrical and Computer Engineering at Politecnico di Torino. His research interests include synthesis, veri?cation, simulation, and testing of digital circuits and systems.

Luca Benini received the B.S. degree (summa cum laude) in electrical engineering from the University of Bologna, Italy, in 1991, and the M.S. and Ph.D. degrees in electrical engineering from Stanford University, CA, in 1994 and 1997, respectively. Since 1997, he has been a Research Associate in the Department of Electronics and Computer Science at the University of Bologna and a PostDoctoral Fellow in the Department of Electrical Engineering at Stanford University. He also holds a position as a Visiting Scientist at the HewlettPackard Laboratories, Palo Alto, CA. His research interests are in all aspects of computer-aided design of digital circuits, with special emphasis on lowpower applications. Dr. Benini has been a Member of the Technical Program Committee for the 1998 Design and Test in Europe Conference and International Symposium on Low Power Design.

Giovanni De Micheli (F’94) received the Nuclear Engineer degree from Politecnico di Milano, Italy, in 1979, and the M.S. and Ph.D. degrees in electrical engineering and computer science from the University of California at Berkeley in 1980 and 1983. He is Professor of Electrical Engineering, and by courtesy, of Computer Science at Stanford University, CA. Previously, he held positions at the IBM T. J. Watson Research Center, Yorktown Heights, NY, at the Department of Electronics of the Politecnico di Milano, Italy, and at Harris Semiconductor, Melbourne, FL. His research interests include several aspects of the computer-aided design of integrated circuits and systems, with particular emphasis on automated synthesis, optimization, and validation. He is the author of Synthesis and Optimization of Digital Circuits (New York: McGraw-Hill, 1994); coauthor of Dynamic Power Management: Circuit Techniques and CAD Tools (Norwell, MA: Kluwer, 1998) and of High-level Synthesis of ASIC’s under Timing and Synchronization Constraints (Norwell, MA: Kluwer, 1992); and co-editor of Hardware/Software Co-design (Norwell, MA: Kluwer, 1995) and of Design Systems for VLSI Circuits: Logic Synthesis and Silicon Compilation (The Netherlands: Martinus Niijoff Publishers, 1986). He was also codirector of the NATO Advanced Study Institutes on Hardware/Software Co-design, held in Tremeezo, Italy, in 1995, and on Logic Synthesis and Silicon Compilation, held in L’Aquila, Italy, in 1986. Dr. De Micheli was granted a Professional Young Investigator Award in 1988. He received the 1987 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS Best Paper Award and two Best Paper Awards at the Design Automation Conference in 1983 and 1993. He is the Editor-in-Chief of IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. He was the Program Chair (for Design Tools) of the Design Automation Conference in 1996 and 1997, and he is currently serving in the DAC Executive Committee. He was Program and General Chair of International Conference on Computer Design (ICCD) in 1988 and 1989, respectively.


赞助商链接

更多相关文章:
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图