9512.net
甜梦文库
当前位置:首页 >> >>

Register minimization in cost-optimal synthesis of DSP architecture


REGISTER MINIMIZATION IN COST-OPTIMAL SYNTHESIS OF DSP ARCHITECTURES
Kazuhito Ito Dept. of Elec. and Elect. Syst. Eng. Saitama University Urawa, Saitama 338, Japan Keshab K. Parhi Dept. of Electrical Engineering University of Minnesota Minneapolis, MN 55455

Abstract— In this paper we propose a generalized technique to count the number of registers supporting overlapped scheduling and a general digit-serial data format. This technique is integrated into an integer linear programming model which minimizes the cost of registers as well as the cost of processors and data format converters to synthesize a cost-optimal architecture for a given digital signal processing algorithm. It is shown that by including the cost of registers in the synthesis task as proposed in this paper leads to upto 12.8% savings in the total cost of the synthesized architecture when compared with synthesis performed without including the register cost in the total cost.

1 Introduction
The objective of high-level synthesis for digital signal processing is to design a hardware architecture for a given digital signal processing algorithm and speci?ed speed constraints [1]. Scheduling is an important design process in the high-level synthesis where time assignment and processor allocation are performed. The time assignment step assigns starting time to each computational task of the processing algorithm. The processor allocation step assigns each computational task to a processor in the synthesized multiprocessor system. Integer linear programming (ILP) solutions have been used to solve the scheduling problem [2]–[9]. Modeling the scheduling problem as an ILP problem provides ?exibility to include new design considerations into scheduling. Most of the past research efforts in high-level digital signal processing (DSP) synthesis have considered uniform implementation styles such as bit-serial or bit-parallel [10, 11]. In a more useful synthesis environment, it is important to consider a heterogeneous synthesis environment where non-critical operations can be implemented using slower bit-serial processors, critical operations can be implemented using faster bit-parallel processors, and moderately critical operations can be implemented using digit-serial processors [12]. In a heterogeneous synthesis environment, multiple processors of different formats cannot communicate with each other directly and data-format converters must be connected between processors of different formats. Hence, [13] proposed high-level DSP synthesis which allows heterogeneous implementation styles and includes data format converter costs in the total cost. However, [13] did not include the cost of registers in the total cost. In section 2, the techniques to count the number of registers during time-constrained scheduling by ILP are proposed. These techniques support precise calculation of the number of registers in overlapped scheduling [2, 14, 15] and in the case of general digit-serial data formats [12]. In section 3, an ILP model for automatic processor type selection and data format converter insertion [13] is extented to include the register minimization to design the

1

architecture with the lowest combined cost of processors, converters, and registers. In section 4 we present experimental results.

2 Counting the Number of Registers During Scheduling
In [5], a technique was proposed to count the number of registers during resource-constrained scheduling. Since the technique was developed for non-overlapped scheduling, it cannot be directly applied to the time-constrained overlapped scheduling. In this section, we generalize the technique for the overlapped scheduling. Furthermore, it is extended so that registers of general digit-serial data can be counted. In this section, xa;j is assumed to be a binary variable and xa;j = 1 indicates that the computation of node a starts at time step j . Since the computation of each node has only U Ba xa;j = 1 holds for every node a, where LBa and U Ba are the one starting time, j =LBa lower and upper bounds of time at which the computation could start. The closed interval [LBa ; U Ba ], i.e., the set LBa , LBa + 1, . . ., U Ba , is called scheduling range of node a. The computational latency is the difference in time steps from an input of a data to an output of a result associated with that input data. Let Ca denote the computational latency of a processor executing the computation of node a. We assume that each processor has only registers for storing intermediate results of pipelined computations. If a processor is not pipelined, then it does not have any register. If the computation of node a starts at time step j , its result is stored in a register outside the processor at the end of the time step j + Ca 1 and the data can be utilized for other computations after and at the time step j + Ca . Although this assumption would impose longer logic level critical path on the last pipeline stage, it could lead to synthesized systems which use less number of registers and therefore less chip area. With slight modi?cation, our technique could handle the processors with registers for the last pipeline stage. In this paper, registers inside processors are not taken into account.

P

f

g

0

2.1 The technique to count the number of registers
In this section, the technique to count the number of registers proposed in [5] is brie?y reviewed. The life-time of data is de?ned as the duration from the time step the data is produced to the time step the data is last used. If the life-time of data contains a time step j , the data is said to be live at j . The data live at time step j must be stored in a register at j . Therefore, the required number of registers at a particular time step is equal to the number of data live at that time step. Let b denote the node which last uses the data output from node a. Note that the output data of node a becomes live at the time step j if the computation of node a begins at the time step j Ca or earlier and the computation of node b begins at time step j or later. Whether the data produced by the execution of node a is live at time step j is checked by

0

j

0 Xa
C

xa;ja

0

UB

a X

ja=LB

(1) The left-hand side of (1) never becomes negative as long as the precedence constraint between node a and node b is maintained. By summing the left-hand side of (1) for all the nodes in the processing algorithm, we get twice the number of live data at time step j . Consequently, the required minimum number of registers in the schedule is obtained as the maximum of the number of live data over all the time steps.

a

j a=j

0

xa;ja

0

01 X
j jb=LB

UB

xb;jb

+

b X
jb=j



xb;jb

=

C

a +1

b

2 0

if data is live at j if data is not live at j

:

2.2 The number of registers in overlapped schedule
While non-overlapped scheduling of an iterative processing algorithm derives the schedule where all the computations in the current iteration are executed within an iteration period, the

0 a

J

Tr

J+Tr b

2Tr

t

0

a

J

Tr

b
(a) (b)
Fig. 1. Register usage in an overlapped schedule. (a) The life-time is longer than the iteration period Tr . Then, more than one register is used at the time class J as shown in (b).
j
Node a Node b

t ? +
Node a Node b

J +1 ?1 ?1 +1

J+Tr J+2Tr J+3Tr J+4Tr ?3 +3 ?5 +5 ?7 +7 ?9 +9

t

+ ?

(a)

(b)

Fig. 2. Dividing time steps into groups and assignment of coef?cients. (a) For nonoverlapped scheduling. (b) For overlapped scheduling. computations in the current iteration are distributed over several iteration periods in overlapped schedules [2, 14, 15]. In this case, the life-time of a data may be longer than the iteration period. In an overlapped schedule with the iteration period Tr units of time (u.t.), computations executed at time steps J , J + Tr , J +2Tr , . . . are executed concurrently. For J = 0; 1; . . . ; Tr 1, the set of time steps J + kTr where k is any integer is called time class J with respect to the iteration period Tr . If the life-time of a data is longer than the iteration period, then it overlaps with itself for some time classes as shown in Fig.1. We must use as many registers as the number of overlaps to store such data. Hence, to count the number of registers precisely, the life-time of data must be examined not only whether it contains a particular time class but also how many times it contains that time class. In the technique to count the number of live data described in [5], time steps are divided at time step j into two domains. For node a, the variables xa;ja are accumulated into (1) with a coef?cient +1 for ja j Ca and 1 for ja > j Ca . For the immediate successor node b, the variables xb;jb are accumulated into (1) with a coef?cient 1 for jb < j and +1 for jb j . This is illustrated in Fig.2(a). Time steps can also be divided into more than two domains if necessary. We divide time steps at time class J as illustrated in Fig.2(b). Then, we associate the coef?cient 1 2k to the variables xa;ja where J + (k 1)Tr Ca < ja J + kTr Ca and the coef?cient 1 + 2k to the variables xb;jb where J + (k 1)Tr jb < J + kTr and accumulate the variables to derive the following equation

0

 0

0

0



0

0

0 0





0

0

0

a X(
k

J +kT

r0 X
T

C

a
C

k=k

a

ja=J +(k

01) r 0

( 2k + 1)xa;ja +
a +1

0

J +kT

r X0

1

j b=J +(k

01)

(2k
T

0 1)

)

xb;jb

a = 2MR (J );

(2)

r

a where MR (J ) is the number of live data of node a at time class J and the upper and lower bound of k, ka and ka , respectively, are chosen so that we include every time step in the scheduling ranges of node a and its immediate successor node b. These bounds are calculated as

ka

=

min

nl

LBa

+ Ca + J
Tr

ml
;

LBb

+1

0

Tr

+1

0 mo
J

Tr

;

(3)

ka

=

max

nl

U Ba

+ Ca + Tr
Tr

010 m l
J ;

U Bb

+1
Tr

0 mo
J

:

(4)

Moreover in the overlapped scheduling, the number of delays on the edges are considered in the precedence constraints. If the number of delays on the edge e = (a; b) is We , then the data output from node a is used by the node b after We iterations. In other words, the life-time of the data contains a particular time class another We times. Therefore, the inequality (2) is modi?ed by taking the delays into account as follows:
a X(
k J +kT

r0 X
T

C

a
C

k=k

a

ja=J +(k

01) r 0

( 2k + 1)xa;ja +
a +1

0

J +kT

r X0

1

jb=J +(k

01)

(2(k + We )
r

0 1)

)

xb;j b

a = 2MR (J )

T

(5) By summing the left-hand side of (5) for all the nodes in the processing algorithm, we get twice the number of live data at time class J . The minimum number of registers is obtained as the maximum of the number of live data over all time classes.

2.3 Registers for digit-serial data architecture
Digit-serial architecture is used where the inexpensive bit-serial architecture is too slow and the expensive bit-parallel architecture is faster than necessary [12]. The number of bits processed per cycle is referred to as the digit-size. If the word-length is w bits and the digit-size is d bits, then one word of data consists of w=d digits. We consider the case where w is a multiple of d. Let n denote the number of digits in a word, i.e., n = w=d. If the ?rst digit of a data is input to node a at time step j0 , then the second digit is input at j0 + 1, and the last digit is input at j0 + n 1. The computational latency Ca of node a is the time difference from the input of i-th digit to the output of i-th digit. A digit-serial register is the set of d 1-bit registers. One digit-serial register stores one digit at a time. Generally, in the case of overlapped scheduling, the number of digit-serial registers, i.e., the number of live digits of the data output by node a at time class J (= 0; 1; . . . ; Tr 1) is calculated as

0

0

k=k

8 0 n r0 > X > (02 + 2 + 1 + 2( + 1) ? ) > > > X > r0 > ? a < X >+ 0 n r (02 + 2( 0 ) 0 1 + 2( > 0n r X a >+ )02 +102 ) (2 ( + > > > r0 X r 0 > >+ 0 n (2 ( + ) + 1 + 2 ) > :
n

? l T

1

nk

p

p

ln xa;J +kTr

0 a0
C

p

p=0

T

1

k

nk

n

ln Tr

p

? + 1)ln )xa;J +kTr 0C

p=n n

? l T
T

l

n k

We

p

pln xb;J +kTr

0

p

p=1

T

(n

l

T

) 1

n k

We

pln xb;J +kTr +p

p=0

9 > > > > > > > a0 > = >=2 > > > > > > > ;
p

MR J

a

( )

(6)

where
We ln ln

= the number of delays on the edge e = (a; b) = =
n Tr n

?

j k j 0 1k
Tr

0
node a

3

6

9 t

node b

Fig. 3. Examples of time assignment and life-time of digits (n = 5).
ka ka

= min = max

nl LB + C 0 J m l LB + n 0 T + 1 0 J mo b a a r ; Tr Tr nl U B + C + T 0 1 0 J m l U B + n 0 J mo
a a r

Tr

;

b

Tr

:

Example: Fig.3 shows results of the time assignment of the nodes a and b for the iteration period Tr = 3. We assume that n = 5, Ca = 2, and We = 0. In the case of the time assignment shown in Fig.3, node a and node b are assigned time steps 0 and 5, respectively. Seven digitserial registers must be used at the time class 0 since the time class 0 is contained twice at time step 3, 4 times at time step 6, and once at time step 9. The ?rst term of the left-hand side of (6) becomes 3 when k = 1 and p = 1 and the third term of the left-hand side of (6) becomes 17 when k = 2 and p = 1. Therefore, in this case, the left-hand side of (6) is 14 which is also equal to twice the number of required registers.

0

3 Register Minimization in Architectures with Multiple Data Formats
Generally, using the slower but less expensive processors for the computations which do not require fast execution may result in a system with lower cost. If both bit-parallel and bit-serial processors are used in a system and these processors must communicate with each other, then we must use a data format converter which converts data format from bit-parallel to bit-serial and vice versa. The ILP model is proposed to synthesize an architecture with minimized total cost of processors and data format converters by optimally selecting processor types and inserting necessary data format converters [13]. In this section the ILP model is extended to calculate the cost of registers precisely and minimize the total cost of processors, converters, and registers. The following terminology is used [13]: PROC is the library of available processors. Ct is the computational latency of a processor of type t. Fi denotes the set of processor types capable of executing node i. FORM is the set of input and output formats for all the processors. I (t) and O (t) are respectively the input and output data formats of processor t. The binary variable xi;j;t = 1 means that node i starts at time step j on a processor of type t. CONV is the library CONV denotes a data format converter which converts data of available converters. vqf from format q to format f . The binary variable yi;j;v = 1 means that a data format converter of type v is used and the conversion for the output data of node i starts at time step j . Ra and a Rv denote the scheduling ranges of the computation of node a and the data format conversion of type v for the data of node a, respectively.

2

3.1 Counting the number of registers during the time assignment
To calculate the cost of registers, we must know the exact number of registers, MRf , of each data format f . For the edge (a; b), registers of data format f are used if either node a or node b is assigned to a processor of data format f and no register of data format f is used otherwise. Which case really occurs depends on the processor type selection and cannot be known prior to solving the ILP model. Moreover, in the case where node a has more than one immediate successor nodes, we cannot know which immediate successor node last uses the format f version of

the data output from node a until the time assignment and processor type selection have been determined. Therefore, new binary variables ga;j;f are introduced and ga;j;f = 1 means that the format f version of the output data of node a is last used at time step j . The data is last used either by an immediate successor node assigned to a processor of data format f or by a data format converter vf h inserted to convert the data format of node a from format f to some format h. If such format f version of data is not used, then all the variables ga;j;f are 0. To compute the value of ga;j;f , we use the following inequalities

X
2
R

j

j

a f

1

ga;j;f


j

XX
b2 b b

X
j

t F j I (t )=f

2

(j + We Tr )

1

xb;j;tb ;

8 2
a
1h

N; e

= (a; b)

2

E

(7)

R

b

2

R

a f

1

ga;j;f


2

X X
f1 = f

X
j R

v

f1 h j 2Raf h v 1
ga;j;f ;

j

1

ya;j;vf

;

8 2
a

N

(8)

a f

1 8 2
a

N

(9)

a where N is the set of nodes and Ra is the union of the scheduling ranges Ra and Rvfh . f Then the required number of format f registers for the output data of node a at time class J , a MR (f; J ), is calculated as

k=k

8 f 0 nf r 0 > X > ( ( > > > X > ( > > > > r 0a2a a > X >+ > ( ( a > X < f 0 nf r > (X a> > > 0 n a2a a > f Xf r > >+ ( ( > > > r0 f 0 n r 0 > f > >+ X ( :
n

? l

T

1

P1 k; p; nf

)

0

Sa

f

)

2

p=0

xa;J +kTr

0

C

ta 0p;ta +

X
v f1 =f

ya;J +kTr

t F O(t )=f

qf1

0

C

vqf1 0p;v

T

1

k

P2 k; p; nf

)

0

Sa

f

)

2

p=n

? l

T

xa;J +kTr

0Cta 0p;ta +

X
v f1 =f

ya;J +kTr

0

C

t F O(t )=f T

qf1

vqf1 0p;v

n

l

P3

e

k; p; nf

f ) + Sa )ga;J +kTr 0p;f

p=1

T

(n

l

T

) 1

P4

e

f (k; p; nf ) + Sa )ga;J +kTr +p;f

p=0

9 > > > > > > qf )> > > > > > > > = >=2 qf )> > > > > > > > > > > > ;
1 1

MR f; J

a

(

)

(10) where nf is the number of digits of data format f and parameters P1 (k; p; nf ), P2 (k; p; nf ), 3 e P3 (k; p; nf ), and P4 (k; p; nf ) are as de?ned as follow: ( ) P2 (k; p; n)
P1 k; p; n P3 P4
e e

= = = =

02 02

? + 2p + 1 + 2(p + 1)ln ; ?n Tr ) 1 + 2(p + 1)ln ; ? nk + 2(n l
nk

(k; p; n) (k; p; n)

2n(k + We )

0 02

p

+1

0 02

(11) (12) (13) (14)

pln ;

2n(k + We ) + 1 + 2pln :

1

2 D 5 D 6 8 7

3

4

Fig. 4. A data ?ow graph of a biquad ?lter. The upper and lower bounds, ka and ka , are calculated so that we include every time step in the scheduling ranges of node a, all the immediate successor nodes of node a, and data format f f converters used for node a. The integer constants Sa and Sa cancel each other. Thus, f the value of Sa can be chosen freely. We have chosen such a value that every coef?cient, e f e f P3 (k; p; nf ) + Sa or P4 (k; p; nf ) + Sa , is positive so that the variables ga;j;f unnecessarily f f become 1. On the other hand, the coef?cients P1 (k; p; nf ) Sa and P2 (k; p; nf ) Sa may be negative. In that case, binary variables ya;j;vqr may unnecessarily become 1 to falsely reduce the value of the left-hand side of (10) and give an incorrect number of registers. To prevent this, we must use constraints

0

0

0

XX XX
v 1 j r1 =r

qr

2Raqr1 v

ya;j;vqr 1

+

v 1 j r1 =r

qr

2Rar1 f v

ya;j;vr f 1

XX XX 
t F O(t )=r

a2 a a

ja

2Ra

xa;ja;ta

1 8 2
a

N;

(15)

t F jb I (t )=r

b2 b b

2Rb

xb;j b;tb

8(

a; b

)

2

E

(16)

so that converters are not unnecessarily used. Note that the constraints (15) and (16) do not eliminate any assignment possibility. The ILP model to synthesize the architecture with lowest cost of processors, converters, and registers minimizes the cost (17), subject to the constraints (15), (16), and (18) as well as the constraints given in [13]. Here, Mt , Mv , and Mf are respectively the number of processors of type t, the number of converters of type v , and the number of format f registers. mt , mv , and mf are respectively the cost of a processor of type t, the cost of a converter of type v , and the cost of a format f register. Minimize COST =

X
2N

X

t

2PROC

mt Mt

+
v

X

2CONV

m v Mv

+
f

X

2FORM

mf Mf

(17)

a 2MR (f; J )

2

Mf

8 2 FORM
f

; J

= 0; 1; . . . ; Tr

01

:

(18)

a

4 Experimental result
In this experiment, the legitimacy of the ILP model to minimize the cost of registers as well as the cost of processors and converters is con?rmed. All the ILP models are solved by the ILP solver GAMS/OSL [16] running on a SparcStation 20. We use a DFG of a biquad ?ler illustrated in Fig.4. Table 1 shows a library of available processor types. In Table 1, the computational latency, C , the pipeline period, L (the minimum time between successive computations), the input and output data format, I and O, and the cost, m, are shown for each processor type. Adder A1 and multiplier M3 input and output data of format bp, and adder A2 and multiplier M4 input and output data of format hp. These

Table 1 Processor speci?cations type C L I O m A1 1 1 bp bp 10 A2 1 2 hp hp 5 M3 2 2 bp bp 50 M4 3 3 hp hp 25
1 2 bp D 5 bp hp D hp 7 hp bp 8 4

Table 2 Data format converter speci?cations type conversion C L m vbp;hp bp hp 0 1 1 vhp;bp hp bp 1 1 1

! !
3

0

6 5 2 1 7
8 1 5 2 7 4

9 2 4 3 4

3
6 8 6

t

6

2D

bp 3
6

2 1

4 0

(a)

hp 1

(b) Fig. 5. Time assignment result with ILP model division. (a) The assignment of nodes to processor types. (b) Time chart of the time assignment and life-time of data. formats bp and hp imply the bit-parallel and the half-word parallel, respectively. The halfword parallel data format is the digit-serial where the digit-size is half the wordlength. For the DFG of the biquad ?lter shown in Fig.4, Nodes 1, 2, 3, and 4 can be assigned to either adder A1 or A2 in Table 1. Similarly, nodes 5, 6, 7, and 8 can be assigned to either multiplier M3 or M4 in Table 1. Furthermore to support data format conversion, a library of data format converters as shown in Table 2 is prepared for the given library of processors. Each of the data format converters is classi?ed according to its conversion type, its conversion latency, C , its pipeline period, L, and its cost, m. We choose the costs of a register as mbp = 2 and mhp = 1. Fig.5 shows a time assignment result for the iteration period Tr = 3 obtained by solving the complete ILP model with register minimization. As shown in Fig.5(a), Nodes 1, 2, and 4 are assinged to A1 adder, node 3 to A2 adder, nodes 5, 7, and 8 to M3 multiplier, and node 6 to M4 multiplier. Boxes are then inserted to represent data format converters. In Fig.5(b), a box represents either a computation of node or a data format conversion. An arrow represents the life-time of a data in the case of format bp or a digit in the case of format hp. For example, the computation of node 6 starts at time step 0 and its result is output at time step 3 since the computational latency of M4 multiplier is 3. The ?rst digit of the result is stored in a register of format hp at the time step 3 and input to a data format converter of type vhp;bp which is executed at time step 3 (represented by a half shaded box with ‘6’ inside). The second digit is output at time step 4 and input to the data format converter. The converted data ( bp format) is immediately output from the converter and then input to node 1 at time step 4. In this case, 1 A1 adder, 1 A2 adder, 2 M3 multipliers, 1 M4 multipliers, 1 vbp;hp converter, and 1 vhp;bp converter, 4 bp registers, and 1 hp register are used in this architecture of 151 units of cost. The CPU time for solution is 5.5 seconds. For more practical results, we have synthesized architectures for some benchmark data?ow graphs. In this case, we assume the library of processors and the library of converters as shown in Tables 3 and 4. Table 5 shows the speci?cation of the register of each data format. In this table n is the number of digits of one word and m is the cost of one register of each data format. Table 6 shows; the speci?ed iteration period Tr ; the synthesized architecture; the number of registers; the total cost; and CPU time in seconds to solve the ILP model for each data-?ow

Table 3 Library of Processor Types
(wordlength = 16)

Table 4
Converter Types type C L m vbp;hp 0 1 3 vbp;ds 0 3 4 vhp;bp 1 1 3 vhp;ds 0 2 3 vds;bp 3 3 4 vds;hp 2 2 3

Table 5

A A A M M M

type processor
bp hp ds bp

C L m I O
1 1 1 2 1 4 53 bp bp 19 hp hp 6 ds ds

Bit-parallel adder Half-word parallel adder 4-bit digit-serial adder Bit-parallel multiplier Half-word parallel multiplier 4-bit digit-serial multiplier

f n m bp 1 8 hp 2 4 ds 4 2

Registers

hp ds

5 1 331 bp bp 6 2 173 hp hp 9 5 86 ds ds

Table 6 Time Assignment Benchmarks

T

with register minimization
r

w/o reg. min.

Lowest cost architecture

Registers 5Rbp 5Rbp 3Rbp , 4Rhp 6Rbp 7Rhp , 4Rds 9Rbp 9Rbp 6Rbp , 7Rhp 6Rbp 6Rbp 6Rbp 13Rhp 13Rhp 11Rbp , 8Rds 8Rbp , 2Rhp , 8Rds 6Rbp , 24Rds 8Rbp , 56Rds 4Rbp , 30Rds 3Rbp , 22Rds

Cost 861 477 449 432 259 562 509 504 485 432 432 263 263 1941 1528 1187 3376 1692 1245

CPU 0.87 2.23 13.5 14.1 17.7 1.68 16.0 447 5.07 9.92 21.1 40.9 152 650 1611 122 0.99 1.78 14.9

Cost 869 485 461 460 283 562 509 520 493 448 448 267 271 2155 1732 1361 3376 1756 1337

CPU 1.58 3.15 18.0 21.2 11.9 3.16 26.2 658 14.9 14.3 39.9 24.7 48.6 23.6 58.2 40.6 3.53 5.65 7.85

4th Order Lattice Filter 14 3Abp , 2Mbp 15 2Abp , Mbp 16 Abp , Ahp , Mbp , vbp;hp , vhp;bp 17 Abp , Mbp 18 Ahp , Ads , Mhp , vhp;ds , vds;hp 5th Order Elliptic Wave Filter 25 3Abp , Mbp 26 2Abp , Mbp 27 Abp , 2Ahp , Mbp , vbp;hp , vhp;bp 4th Order Jaumann Filter 16 2Abp , Mbp 17 Abp , Mbp 18 Abp , Mbp 19 2Ahp , Mhp 20 2Ahp , Mhp 4-stage Pipelined Lattice Filter 3 2Abp , 8Ads , 5Mbp , 7vbp;ds 4 Ahp , 9Ads , 4Mbp , 2vbp;hp , 6vbp;ds , vhp;bp , vds;bp 5 9Ads , 3Mbp , 9vbp;ds , vds;bp 16 Point FIR Filter 1 60Ads , 8Mbp , 24vbp;ds , 24vds;bp 2 30Ads , 4Mbp , 12vbp;ds , 12vds;bp 3 20Ads , 3Mbp , 8vbp;ds , 8vds;bp

graph. The maximum numbers of variables and constraints of these ILP models are 367 and 586, respectively. The rightmost two columns of Table 6 are the cost and CPU time (in seconds) of the synthesis results where only the cost of processors and data format converters are minimized [13] and the number of registers is calculated by analyzing the schedule. In all the cases, the proposed ILP model for register minimization synthesizes a less or equally expensive architecture compared to the architecture synthesized by the ILP model without register minimization. For the 4-stage pipelined lattice ?lter, the register minimization approach can reduce the cost of synthesized architecture from 1361 to 1187 for the iteration period of 5 u.t. which corresponds to a saving of 12.8%.

5 Conclusion
In this paper we generalized the technique to count the number of registers during the scheduling by ILP to support the overlapped scheduling and registers of digit-serial formats. This technique was combined with automatic processor type selection and data format converter insertion into the ILP model for the time assignment in the time-constrained scheduling where the hardware architecture is synthesized with the lowest cost of processors, converters, and registers. Some practical benchmarks are applied to our ILP model and the lowest cost architectures were synthesized.

Acknowledgement
This research was supported by the Advanced Research Projects Agency and monitored by Solid State Electronics Directorate of the Wright-Patterson AFB under contract number F33615-93-C-1309.

References
[1] M. C. McFarland, A. C. Parker, and R. Camposano, “The High-Level Synthesis of Digital Systems,” Proc. of the IEEE, vol. 78, pp. 301–318, Feb. 1990. [2] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu, “A Formal Approach to the Scheduling Problem in High Level Synthesis,” IEEE Trans. Computer-Aided Design, vol. CAD-10, pp. 464– 475, Apr. 1991. [3] C. H. Gebotys and M. I. Elmasry, “Optimal Synthesis of High-Performance Architectures,” IEEE Journal of Solid-State Circuits, vol. 27, pp. 389–397, Mar. 1992. [4] C. H. Gebotys, “Synthesis of Throughput-Optimized Multichip Architectures,” in Proc. IEEE Custom Integrated Circuits Conf., San Diego, pp. 5.2.1–5.2.4, May 1993. [5] C. H. Gebotys and M. I. Elmasry, “Global Optimization Approach for Architecture Synthesis,” IEEE Trans. Computer-Aided Design, vol. CAD-12, pp. 1266–1278, Sept. 1993. [6] C.-T. Hwang and Y.-C. Hsu, “Zone Scheduling,” IEEE Trans. Computer-Aided Design, vol. CAD-12, pp. 926–934, July 1993. [7] L. E. Lucke and K. K. Parhi, “Generalized ILP Scheduling and Allocation for High-Level DSP Synthesis,” in Proc. IEEE Custom Integrated Circuits Conf., San Diego, pp. 5.4.1– 5.4.4, May 1993. [8] A. Bachmann, M. Sch¨ binger, and L. Thiele, “Synthesis Methods for Domain Speo ci?c Multiprocessor Systems including Memory Design,” in VLSI Signal Processing, VI, pp. 417–425, 1993. [9] S. Chaudhuri and R. A. Walker, “ILP-Based Scheduling with Time and Resource Constraints in High Level Synthesis,” in Proc. IEEE 7th Int. Conf. VLSI Design, pp. 17–20, Jan. 1994. [10] R. I. Hartley and J. R. Jasica, “Behavioral to Structural Translation in a Bit-Serial Silicon Compiler,” IEEE Trans. Computer-Aided Design, vol. CAD-7, pp. 877–886, Aug. 1988. [11] H. De Man, F. Catthoor, and et al, “Architecture Driven Synthesis Techniques for VLSI Implementation of DSP Algorithms,” Proceedings of the IEEE, vol. 78, pp. 319–335, Feb. 1990. [12] K. K. Parhi, “A Systematic Approach for Design of Digit-Serial Processing Architecture,” IEEE Trans. Circuits Syst., vol. CAS-38, pp. 358–375, Apr. 1991. [13] K. Ito, L. E. Lucke, and K. K. Parhi, “Module Selection and Data Format Conversion for Cost-Optimal DSP Synthesis,” in Proc. ACM/IEEE Int. Conf. on Computer-Aided Design, San Jose, pp. 322–329, Nov. 1994. [14] P. G. Paulin and J. P. Knight, “Force-Directed Scheduling for the Behavioral Synthesis of ASIC’s,” IEEE Trans. Computer-Aided Design, vol. CAD-8, pp. 661–679, June 1989. [15] C.-Y. Wang and K. K. Parhi, “Loop List Scheduler for DSP Algorithms Under Resource Constraints,” in Proc. IEEE Int. Symp. Circuits and Systems, Chicago, pp. 1662–1665, May 1993. [16] A. Brooke, D. Kendrick, and A. Meeraus, GAMS: A User’s Guide, Release 2.25. South San Francisco, CA: The Scienti?c Press, 1992.


赞助商链接

更多相关文章:
河南理工DSPandMCU
DSP and MCU(Micro Controller Unit) DSP 采用的哈佛结构(Harvard Architecture ),数据空间和存储空间是分开的,通过独立的数 据总线在程序空间和数据空间同时访问。...
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图