12.12
Optimization of the Viterbi Decoder
Returning to the Viterbi decoder example (from Section 12.4), we first set the
environment
for the design using the following worst-case conditions: a die temperature of 25
∞
C (fastest logic) to 120
∞
C (slowest logic); a power supply voltage of
V
DD
= 5.5 V (fastest logic) to
V
DD
= 4.5 V (slowest logic); and worst process (slowest logic) to best process (fastest logic). Assume that this ASIC should run at a clock frequency of at least 33 MHz (clock period of 30 ns). An initial synthesis run gives a critical path delay at nominal conditions (the default setting) of about 25 ns and nearly 35 ns under worst-case conditions using a high-density 0.6
m
m standard-cell target library.
Estimates (using simulation and calculation) show that data arrives at the input pins 5 ns (worst-case) after the rising edge of the clock. The reset signal arrives 10 ns (worst-case) after the rising edge of the clock. The outputs of the Viterbi decoder must be stable at least 4 ns before the rising edge of the clock. This allows these signals to be driven to another ASIC in time to be clocked. These timing constraints are particularly devastating. Together they effectively reduce the clock period that is available for use by 9 ns. However, these figures are typical for board-level delays.
The initial synthesis runs reveal the critical path is through the following six modules:
subset_decode -> compute_metric ->
compare_select -> reduce -> metric -> output_decision
The logic synthesizer can do little or no optimization across these module boundaries. The next step, then, is to rearrange the design hierarchy for synthesis.
Flattening
(
merging or
ungrouping) the six modules into a new cell, called
critical
, allows the synthesizer to reduce the critical path delay by optimizing one large module.
At present the last module in the critical path is
output_decision
. This combinational logic adds 2–3 ns to the output delay requirement of 4 ns (this means the outputs of the module
metric
must be stable 6–7 ns before the rising clock edge). Registering the output reduces this overhead and removes the module
output_decision
from the critical path. The disadvantage is an increase in latency by one clock cycle, but the latency is already 12 clock cycles in this design. If registering the output decreases the critical path delay by more than a factor of 12 / 13, performance will still improve.
To register the output, alter the code (on pages 575–576) as follows:
module
viterbi_ASIC
...
wire
[2:0] Out, Out_r; // Change: add Out_r.
...
asPadOut #(3,"30,31,32") u30 (padOut, Out_r); // Change: Out_r.
Outreg o_1 (Out, Out_r, Clk, Res); // Change: add output register.
...
endmodule
module
Outreg (Out, Out_r, Clk, Res); // Change: add this module.
input
[2:0] Out;
input
Clk, Rst;
output
[2:0] Out_r;
dff #(3) reg1(Out, Out_r, Clk, Res);
endmodule
These changes move the performance closer to the target. Prelayout estimates indicate the die perimeter required for the I/O pads will allow more than enough area to hold the core logic. Since there is unused area in the core, it makes sense to switch to a high-performance standard-cell library with a slightly larger cell height (96
l
versus 72
l
). This cell library is less dense, but faster.
Typically, at this point, the design is improved by altering the HDL, the hierarchy, and the synthesis controls in an iterative manner until the desired performance is achieved. However, remember there is still no information from the layout. The best that can be done is to estimate the contribution of the interconnect using wire-load models. As soon as possible the netlist should be passed to the floorplanner (or the place-and-route software in the absence of a floorplanner) to generate better estimates of interconnect delays.
|
TABLE 12.13
Critical-path timing report for the Viterbi decoder.
|
|
Instance name
|
Delay information
|
|
v_1.u100
u1.subout5.Q_ff_b0
B1_i67
B1_i66
B1_i64
B1_i68
B1_i316
u3.add_rip1.u4
u5.sub_rip1.u6
u5.sub_rip1.u8
B1_i301
u2.metric3.Q_ff_b4
|
inPin --> outPin incr arrival trs rampDel cap(pF) cell
CP --> QN 1.65 1.65 F .20 .10 dfctnb
A1 --> ZN .63 2.27 R .14 .08 ao01d1
B --> ZN .84 3.12 F .15 .08 ao04d1
B2 --> ZN .91 4.03 F .35 .17 fn03d1
I --> ZN .39 4.43 R .23 .12 in01d1
S --> Z .91 5.33 F .34 .17 mx21d1
B0 --> CO 2.20 7.54 F .24 .14 ad02d1
... 28 other cell instances omitted ...
B0 --> CO 2.25 23.17 F .23 .13 ad02d1
CI --> CO .53 23.70 F .21 .09 ad01d1
A1 --> Z .69 24.39 R .19 .07 xo02d1
setup: D --> CP .17 24.56 R .00 .00 dfctnb
slack: MET .44
|
Table 12.13
is a timing report for the Viterbi decoder, which shows the critical path starts at a sequential logic cell (a D flip-flop in the present example), ends at a sequential logic cell (another D flip-flop), with 37 other combinational logic cells in-between. The first delay is the clock-to-Q delay of the first flip-flop. The last delay is the setup time of the last flip-flop. The critical path delay is 24.56 ns, which gives a
slack
of 0.44 ns from the constraint of 25 ns (reduced from 30 ns to give an extra margin). We have
met
the timing constraint (otherwise we say it is
violated
).
In
Table 12.13
all instances in the critical path are inside instance
v_1.u100
. Instance name
u100
is the new cell (cell name
critical
) formed by merging six blocks in module
viterbi
(instance name
v_1
).
The second column in
Table 12.13
shows the timing arc of the cell involved on the critical path. For example,
CP --> QN
represents the path from the clock pin,
CP
, to the flip-flop output pin,
QN
, of a D flip-flop (cell name
dfctnb
). The pin names and their functions come from the library data book. Each company adopts a different naming convention (in this case
CP
represents a positive clock edge, for example). The conventions are not always explicitly shown in the data books but are normally easy to discover by looking at examples. As another example,
B0 --> CO
represents the path from the B input to the carry output of a 2-bit full adder (cell name
ad02d1
).
The third column (
incr
) represents the incremental delay contribution of the logic cell to the critical path.
The fourth column (
arrival
) shows the arrival time of the signal at the output pin of the logic cell. This is the cumulative delay to that point on the critical path.
The fifth column (
trs
) describes whether the transition at the output node is rising (
R
) or falling (
F
). The timing analyzer examines each possible combination of rising and falling delays to find the critical path.
The sixth column (
rampDel
) is a measure of the input slope (ramp delay, or slew rate). In submicron ASIC design this is an important contribution to delay.
The seventh column (
Cap
) is the capacitance at the output node of the logic cell. This determines the logic cell delay and also the signal slew rate at the node.
The last column (
cell
) is the cell name (from the cell-library data book). In this library suffix
'd1'
represents normal drive strength with
'd0'
,
'd2
', and
'd5'
being the other available strengths.
[ Chapter start ] [ Previous page ] [ Next page ] |