Frequency-Independent Warning Detection Sequential for Dynamic Voltage and Frequency Scaling in ASICs

Author(s)
Das, Bishnu Prasad; Onodera, Hidetoshi

Citation
IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2014), 22(12): 2535-2548

Issue Date
2014-12

URL
http://hdl.handle.net/2433/197829

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. This is not the published version. Please cite only the published version.

Type
Journal Article

Textversion
author
Kyoto University
Abstract—In this paper, a metastability immune warning flipflop (FF) is proposed which consists of an edge detector, a warning window generator and a warning detector along with a traditional FF. The delayed data is monitored during the warning window to flag a warning signal before the data enters the erroneous zone. In this scheme, the warning window is independent of input clock frequency and hence is suitable for frequency scaling application. A 16-bit Kogge-stone adder is implemented in 65 nm technology which uses warning FF for dynamic voltage and frequency scaling (DVFS). The warning FF based DVFS allows elimination of safety margins and operates till the point of first warning of the adder without any erroneous results. Experiments were conducted with different supply voltages, phase-shifted clocks and process conditions. The circuit is helpful to determine when to stop further reduction in supply voltage by producing the warning signal with pre-defined timing slacks in DVFS application. The test chip results demonstrate that the proposed circuit can track the critical path delay of 2.4 ns to 7.5 ns at warning voltage of 1.15 V to 0.72 V respectively. The measured results from 10 different chips show effectiveness of the proposed concept across process variation.

Index Terms—Resilient circuits, error detection sequential (EDS), warning prediction sequential, dynamic voltage scaling (DVS).

I. INTRODUCTION

As the process scales to the nanometer regime; there is a great demand to achieve maximum performance with stringent power envelope. Variation is a major bottleneck in achieving such a goal. Variations can be static like process variation or dynamic like environmental variations. Static process variations can be handled by chip binning or post silicon supply/body voltage tuning. However, the dynamic variations are transient phenomena those depend on time and environment. The examples of the dynamic variations are supply voltage variation, temperature variation, and aging effect such as negative bias temperature instability (NBTI). The traditional approach utilizes the concept of worst case supply voltage and frequency to avoid such kind of undue effects [1]–[3]. Adaptive techniques have been developed to eliminate design margin by dynamically modifying the supply voltage, the body bias and the clock frequency [4], [5].

A traditional dynamic voltage scaling (DVS) approach employs look-up tables which store pre-characterized voltage and frequency points [1]–[3]. In this approach, supply voltage of the circuit is determined based on the frequency of the critical path monitoring circuit. However, the voltage-frequency points are characterized by considering the worst case process, voltage and temperature variations. Hence, the designer loses some margins in this kind of pessimistic choice of voltage and temperature points.

Razor I flip-flop shows the direction for reduction of worst case margin [5]. It uses an error detecting FF on the critical path of the design to reduce the supply voltage, which finds the first failure point for a given frequency. It allows the reduction in design margins leading to the significant energy saving. However, the technique requires additional circuitry like shadow latch and meta-stable detector for error detection. The crucial limitation of this technique is that it checks the error at the output of a FF which requires a meta-stable detector to resolve the output of the FF. The canary FF uses a delayed data and a shadow FF along with the traditional FF to detect the timing error [6], [16]. Since, it compares the data at the output of a FF, it also requires meta-stable detector.

Razor II is another flavor of Razor I where data transition is checked at the input of a FF [7], [8]. Hence, it does not require a meta-stable detector. However, Razor I and Razor II are used in a processor framework where the corrective action is performed using re-execution of instructions.

Authors in [9], [10] proposed an error detection sequential using transition detector with time borrowing (TDTB) and double sampling with time borrowing (DSTB). However, all these techniques are useful in a processor architecture where instruction replay mechanism is readily available.

The yield enhancement technique using the defect predictive FF (DPFF) is proposed in [11]. The DPFF produces a warning signal based on the timing error which is used to replace the faulty block with a working block. However, warning signal is generated by comparing the output of two FFs. The issue arises when the output of one of the FFs falls into the meta-stable zone. Hence, it requires an extra hardware like meta-stable detector to resolve the data at the output of the FF.

Tunable replica circuit (TRC) is proposed which can be used to tune the supply voltage, after fabrication, to match the critical path on the die [12]. The error detecting sequential is used in the TRC to report the timing error if the delay of the circuit exceeds the clock period of the TRC. However, in this technique post-silicon calibration is required which increases
testing cost. Recently, error detection in a FF using transition detector is proposed in [13]. This technique uses an on-die adaptive frequency controller to adjust the frequency based on workload and error-rate. The major limitation of the [12], [13] is that it requires error recovery mechanism which is available in case of a processor. However, in case of an application specific integrated circuit (ASIC), the error would lead to malfunctioning of the whole system.

Majority voting and glitch filtering are mostly used in Single event upset (SEU) and Single event transient (SET) hardened FF design. In case of majority voting, the delayed clocks are used in three FFs in parallel and the output of the FFs is compared using a voting unit [14]. The area overhead is huge as three FFs are used. In case of glitch filtering, the glitch generated due to SET is filtered by delaying the data path [15]. The major difference between majority voting or glitch filtering and our work is that majority voting and glitch filtering are used for error correction whereas the proposed FF is used for warning detection. Dynamic voltage and frequency scaling (DVFS) requires a control signal when to stop further reduction of supply voltage or clock frequency. Hence, the majority voting or glitch filtering is not suitable for DVFS application.

The wear-induced failure prediction of various part of the microprocessor is presented in [17] using the online delay monitoring and statistical analysis of delay data. The basic concept of delaying mechanism is proposed first by Kehl [18] for automatic tuning of hardware by monitoring the performance of the system at different time interval. Incoming data is sampled at three different sampling times. All three samples are compared to the incoming data and resultant signal is used to adjust the clock frequency. In [18], the incoming data is sampled at three different sampling time whereas in our proposed technique delayed data is sampled once by the edge detector. The area requirement is higher in [18] compared to proposed technique.

Technology scaling allows packing billion devices in a small area. However, the impact of process variation is rapidly increasing in the scaled technology node. To encounter process variation, statistical static timing analysis (SSTA) has been developed which incorporates the intra-die and inter-die variations to provide statistical guarantees to the timing budget of the circuit. The voltage and temperature scalable SSTA [19] has been developed to analyze the statistical behavior of the circuits at different supply voltage and temperature conditions which is helpful to reduce the timing margins prior to circuit fabrication. The post silicon tuning [20] is emerging as a technique to reduce timing margins after the fabrication of the chip.

Authors in [21], [22] proposed two aging sensors such as (1) stability checker design and (2) double sampling design. Both the proposed technique and the stability checker design have the similar approach as both use edge detection circuit. However, the stability checker design in [21], [22] is frequency dependant whereas the proposed technique is not frequency dependant. To make a fair comparison, we have compared the edge detection approach of [21], [22] with the proposed technique in the paper. The double sampling design in [21], [22] and canary FF in [6] have similar approach as both use two FFs. In this approach, the warning window is not frequency dependent.

An aging sensor for the combinational part of the critical path is proposed in [23]. The circuit compares two code words in order to predict the aging effect. However, a comparator is needed to monitor the two code words which in turn increase the area of the sensor circuit. The timing slack monitoring circuit with a window generator and a sensor cell is presented in [24]. As large number of buffers are required in the window generator circuit which would lead to difficulty in balanced clock tree synthesis and large clock power consumption. The major contribution of the paper is the usage of delayed data in an edge detector which can make the warning FF frequency independent without any effect on clock network. In author’s view, the above concept is not published so far. The previous warning FF published either use the concept of double sampling [21] and canary FF [6] or the stability checker with edge detection [21] which is not frequency independent.

The proposed technique in this paper is a metastability immune warning FF which is used in an ASIC framework [25]. The proposed circuit uses the concept of delayed data in the transition/edge detector which flags the “warning” instead of the “error”. Since it reports the warning; the appropriate data is captured correctly by the FF in the same clock cycle. The contribution in this paper includes on-chip demonstration of proposed warning FF for DVFS application in ASIC framework. The measured results from a test chip show the effectiveness of the proposed concept across supply, clock frequency (or phase shifted clock) and process condition.

The paper is organized as follows. Section II describes the concept of warning detection by defining various timing parameters. The proposed warning flip-flop is described in Section III. The test structure and testing strategies are explained in Section IV. The experimental results are presented in Section V. The usage of warning FF in real application is described in Section VI and final conclusion is presented in Section VII.

II. CONCEPT OF WARNING DETECTION

Setup time: It is the time before the clock edge during which the data should be available such that the data can be sampled properly by the FF. The worst case setup time is the worst value of the setup time after performing statistically meaningful simulations (e.g. Monte Carlo) considering variation to the process parameters such as the gate length and the threshold voltage and environmental parameters such as the supply voltage and the temperature. The worst case setup time is denoted as \( t_{\text{setup}}^w \).

Hold time: It is the time after the clock edge during which the data should be available such that the data can be sampled properly by the FF. The worst case hold time is found out following the method explained in the definition of worst case setup time. The worst case hold time is denoted as \( t_{\text{hold}}^w \).

In our warning detection scheme, the warning is detected by monitoring the delayed data transition. In case of the delayed
data, the warning window $t_{\text{warning}}$ is after the rising edge of the clock as shown in Fig. 1(a). The minimum delay bound between delayed data and direct data is equal to the sum of $t_{\text{setup}}$ and $t_{\text{warning}}$ as shown in Fig. 1(a) so that the warning window will appear after the rising edge of the clock. The same amount of delay between direct data and delayed data is maintained in both Fig. 1(a) and Fig. 1(b). However, delayed amount is only shown in Fig. 1(a) and not shown in Fig. 1(b) to maintain the clarity of figure. When the data arrives early, both the data and delayed data are outside the warning window as shown in Fig. 1(a). When the data arrives late, the data is outside the warning window and the delayed data is inside the warning window as shown in Fig. 1(b). Since, the delayed data is inside the warning window, the data is safely sampled by the flip-flop in the rising edge of the clock as shown in Fig. 1(b). The corrective action should be taken in the next clock period such that the delayed data transition would not happen in the warning window. In this work, the delayed data transition is monitored in the warning window to flag warning signal.

III. PROPOSED WARNING FLIP-FLOP

The circuit consists of an edge detector, a warning window generator and a warning detector sub-circuits along with a traditional FF as shown in Fig. 2. The timing error can be monitored by two methods such as (i) monitoring input data transition during the warning window and (ii) comparing the output of a FF with that of another FF. The latter method requires meta-stable detector to resolve the time critical data. However, in this work, the data transition at the input of the FF is monitored to prevent the timing error. In the proposed approach, the output of the warning detector would be in the state of partial pull down if there is small overlap between the data edge and warning window. The partial pull down of the warning detector would only occur when the input data edge is on the borderline of generating warning signal. It would only affect the adaptive response by the controller as the warning signal is not evaluated properly. The main FF would sample the correct value. Hence, the design is immune to impact of a partial pull down phenomena in warning detector. To avoid such kind of effect, the warning window width should include the maximum propagation delay of the warning detector considering worst case process variations and number of fan-outs of the warning signals. Hence, this approach does not require a meta-stable detector and is otherwise termed as metastability immune method in Razor II [7]–[10]. Instead of inputting the direct data to edge detector which is used in error detector sequential, the delayed data is fed to the edge detector to detect the warning in this work. The FF works with the direct data whereas the delayed data is used in the edge detector.

Figure 3 shows the conceptual timing diagram of a warning flip-flop which does not include internal delay of each block. The delayed data transition is monitored during the warning window. The edge of the delayed data is generated using edge detector. The warning window is generated from the clock and the delayed inverted clock. The edge of the delayed data is generated at the rising and falling transition of the delayed data whereas the warning window is created only at the rising edge of the clock as in Fig. 3. In case of late data arrival, the delayed data enters the warning window first and flags a warning signal. Since the warning signal is flagged based on the transition of the delayed data, the direct data is sampled safely by the FF before entering the erroneous window. The warning signal is flagged at the rising edge of the third clock signal as in Fig. 3. Since the warning signal is generated from the transition of the delayed data, no erroneous data transition occurs at the output of the FF. The correction action can take many clock cycles based on the response time of the controller. The corrective action includes adjusting the supply voltage, the clock frequency and the body bias.

It is easy to control the slow changing variation such as temperature variation and transistor aging as the change in the critical path delay is gradual. However, in case of high
frequency voltage droop, the change in the critical path delay is fast. The proposed warning FF has one limitation that is common to any warning FF: it can not function correctly where the response time for detecting and responding to the variation is not sufficient enough to avoid an actual timing violation. The proposed FF could not predict the timing violations due to a high frequency voltage droop; rather these types of variations would have a guard band quite similar to a conventional design.

The maximum critical path delay of the proposed technique is more than the conventional design as number of buffers added in the path of edge detector. In our proposed technique, configurable delay buffers are used in the path of the edge detector which can adjust the critical path delay based on process corner to ensure the proper functionality of warning FF.

In the proposed warning FF, some extra circuits such as an edge detector, warning window generator and warning detector are added to conventional FF to especially monitor the setup time violation of the FF. However, the proposed FF is not designed for monitoring hold time constraint. We suggest that hold time violation needs to be fixed during the circuit design time at all operating conditions. One example way of the fixing the hold constraint is to simulate the design at best/fast process corner. Hence, like [21], in proposed FF, there is no additional hold time penalty as compared to a conventional FF. In other words, hold time constraint is not required to be included in the width of the warning window.

A. Edge Detector

Normal edge detectors use either static CMOS logic style [13] or dynamic logic style [26]. However, the proposed edge detector is a pass transistor based design as shown in Fig. 4 and its conceptual timing diagram is shown in Fig. 5. The proposed edge detector consists of two inverters $I_1$ and $I_2$, a conditional inverter $I_3$ and a transmission gate $T$ as shown in Fig. 4. In this approach, the output of the conditional inverter $I_3$ and the output of the transmission gate $T$ are connected to generate the output of the edge detector. When Delayed data = 1, the conditional inverter $I_3$ behaves as a normal inverter and its output acts as the output of the edge detector. In this case, transmission gate $T$ is not operational. When Delayed data = 0, the transmission gate $T$ is operational and its output acts as the output of the edge detector. In this case, the conditional inverter $I_3$ is not operational. The control signals (i.e. output of $I_1$ and output of $I_2$) for the transmission gate $T$ and inverter $I_3$ are same which allows either inverter $I_3$ or transmission gate $T$ to be active at one time. The input signal (i.e. delayed data) for transmission gate $T$ and inverter $I_3$ are same. However, the delay of inverter $I_3$ and transmission gate $T$ are not same. Hence, there would be small amount of race due to the delay difference between inverter $I_3$ and transmission gate $T$. We have performed extensive Monte Carlo simulation as well as all the corner simulation such as typical, slow, fast, fast-slow and slow-fast corners of the proposed edge detector at low supply voltage 0.5V to verify the contention or race issue of the proposed edge detector. It shows that proposed edge detector works properly until 0.5V. Hence, this edge detector can be used in low supply voltage applications. We have performed the substrate noise simulation [33] of the edge detector which shows that it is immune to substrate noise even at low supply voltage of 0.5V.

The delay of the inverter is very small compared to other gates in any technology node; hence more numbers of buffers are needed instead of just one buffer and two inverters as shown in Fig. 4. The buffer $B_1$ and inverter $I_3$ before the node E1 in Fig. 4 determines the width of the edge as shown in the conceptual timing diagram in Fig. 5. Since these buffers are also needed for the edge detector designs in [13], [24], [26], these extra buffers are excluded for each design during comparison in Table I. Inserting more buffers creates wider edge which is needed for proper functioning of the warning
In our approach, the buffers are inserted before inverter $I_1$ in Fig. 4 for creating wider edge. We have compared the different types of edge detectors based on area, speed, power dissipation, maximum response time and minimum supply voltage of operation and are presented in Table I. To make a fair comparison, two extra delay buffers are added in each type of edge detectors. It is found that the power dissipation of the proposed edge detector is $9\mu W$ which is least among all types of edge detectors. The dynamic implementation of the previous edge detectors increases the power dissipation. The maximum response time of the proposed edge detector is $12\text{ps}$ which is least among all the types of edge detectors. This is due to the fact that the previous implementations use the stacking of transistors which increase the response time. The minimum supply voltage of operation of different edge detectors is presented in the $6^{th}$ column of Table I. It shows that the proposed edge detector can operate at minimum supply voltage of 0.5V which is the least among other implementations. The stacked transistor implementation of the other edge detectors increases the minimum operating voltage. From Table I, it is also found that proposed edge detector uses the least number of transistors among all of the previously proposed ones. The simulation results in Table I are performed at typical corner, supply voltage of 1.2V and temperature of 25$^\circ$C. These results have been obtained by simulating the schematic of the circuits with the diffusion capacitance of each transistor. We have simulated the RC extracted layout of the proposed edge detector and found that maximum response time is $17\text{ps}$ and power dissipation of $14\mu W$ which is in close agreement with the schematic simulation. We expect similar variation in other previously proposed edge detectors. Since, the previously proposed edge detectors are custom circuits; we have compared all the approaches by performing schematic simulation.

### B. Warning Window Generator

The warning window is also known as guard band interval in [21], time window control(TWC) in [23] and detection window in [24]. The creation of the warning window is explained in the following two cases.

1) **Case I: Before the rising clock edge:** The guard band interval in [21] and the TWC in [23] are created before the rising clock edge of a positive edge triggered FF. In this approach, the warning window is generated for a rising edge from the previous rising edge. Accordingly, the previous rising edge acts as the reference signal for generating warning window for the present rising edge. In this case, the warning window width depends on clock frequency and designed for a fixed clock frequency. It requires large number of buffers to create the required warning window before the rising clock edge which in turn leads to increase in area and power dissipation in low clock frequency. However, the edge detector in this approach operates with input data directly. The detection window in [24] is generated from the leaf clock by inserting buffer cells in the path which leads to huge dynamic power consumption. In this approach [24], the clock for the flip-flop also have delay cells so that the detection window is created before the rising clock edge for the flip-flop which create difficulty in balanced clock tree synthesis.

Authors in [21] proposed two build-in aging sensors such as (1) stability checker design and (2) double sampling design. Both the proposed technique and the stability checker design have the similar approach as both use edge detection circuit. However, the stability checker design in [21] is frequency dependant whereas the proposed technique is not frequency dependant. To make a fair comparison, we have compared the edge detection approach of [21] with the proposed technique in the paper. The clocking power of the stability checker approach of [21] large as more number of buffers is needed for creating small warming window which also depends on system clock frequency.

The double sampling design in [21] and canary FF in [6] have similar approach as both use two FFs. We have simulated the layout extracted netlist of the double sampling approach in [21] and the proposed warning FF. The SPICE simulation results show that the clock-only power of the double sampling approach in [21] and the proposed FF consumes $13\mu W$ and $26\mu W$ respectively. It shows that proposed FF has $1 \times$ time
The $t_{\text{setup}}$ depends on the following factors:

1) Find the worst case setup time of FF by simulating the FF at the worst process corner ($t_{\text{setup}}$)

2) The maximum delay change ($t_{\text{VR}}$) either due to the one step change in the voltage regulator or due to instantaneous power supply drop or due to fast moving transients.

3) The maximum delay of the warning detector ($t_{\text{WD}}$)

Now $t_{\text{warning}}$ is determined as follows:

$$t_{\text{warning}} \leq t_{\text{setup}} + t_{\text{VR}} + t_{\text{WD}} \quad (1)$$

The value of $t_{\text{warning}}$ is basically determined using the above equation. The higher value of $t_{\text{warning}}$ is good for safe operation of the warning FF. However, it would reduce the savings of the delay margin due to dynamic variations. The lesser value of $t_{\text{warning}}$ would lead to mal-functioning of the warning FF. So, there exists a trade-off between the saving of the delay margin versus functionality of the warning FF. The higher value of $t_{\text{warning}}$ would increase the area/power dissipation of the warning FF as a higher value of $t_{\text{warning}}$ is achieved by inserting buffers in the warning window generator sub-circuit shown in Fig. 6. We have presented how much area/power increases with the different values of warning window width in Table III. As width of warning window is increased by 27%, the area of the warning window generator increases by 17% and power dissipation increases by 20%. Table III summaries that more width of the warning window requires more number of buffers which increases the area and power dissipation of the circuits.

### C. Width of warning window

The warning FF can report warning only if the edge of the delayed data is moving gradually towards clock edge. The supply voltage and/or clock frequency should be decreased gradually so that the delayed data will not miss the warning window. If the direct data transition falls directly into the erroneous window, then the warning signal can not be detected. The generalization of the warning window width is presented here.

<table>
<thead>
<tr>
<th>Edge detector</th>
<th>Number of transistors</th>
<th>Num of Extra Buffers needed</th>
<th>Power dissipation ($\mu$W)</th>
<th>Max Response time (ps)</th>
<th>Min Voltage Operation (V)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bull et. al. [13]</td>
<td>16</td>
<td>2</td>
<td>18.1</td>
<td>74</td>
<td>0.85</td>
</tr>
<tr>
<td>Hirose et. al. [26]</td>
<td>9</td>
<td>2</td>
<td>16.6</td>
<td>49</td>
<td>0.7</td>
</tr>
<tr>
<td>Rebaud et. al [24]</td>
<td>12</td>
<td>2</td>
<td>17.5</td>
<td>52</td>
<td>0.65</td>
</tr>
<tr>
<td>Proposed</td>
<td>8</td>
<td>2</td>
<td>9.07</td>
<td>12</td>
<td>0.5</td>
</tr>
</tbody>
</table>

### TABLE II

<table>
<thead>
<tr>
<th>Clock frequency</th>
<th>Case I</th>
<th>Case II</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reference signal</td>
<td>Previous clock edge</td>
<td>Present clock edge</td>
</tr>
<tr>
<td>Area/Power (in low clock frequency)</td>
<td>More</td>
<td>Less</td>
</tr>
<tr>
<td>Edge detector operates on</td>
<td>Direct data signal</td>
<td>Delayed data signal</td>
</tr>
<tr>
<td>Proposed in</td>
<td>[21] and [23]</td>
<td>This paper</td>
</tr>
</tbody>
</table>

### TABLE III

<table>
<thead>
<tr>
<th>Warning window width (ps)</th>
<th>Area ($\mu$m$^2$)</th>
<th>Power ($\mu$W)</th>
</tr>
</thead>
<tbody>
<tr>
<td>83</td>
<td>6.84</td>
<td>14</td>
</tr>
<tr>
<td>113</td>
<td>8.28</td>
<td>17.5</td>
</tr>
<tr>
<td>143</td>
<td>9.72</td>
<td>21</td>
</tr>
</tbody>
</table>

The summary of both the warning window generation schemes is presented in Table II where it is clearly shown that case II is superior to case I in both area and power comparison.
The value of $t_{buffer}$ is basically determined by the circuit designer based on experience. In our case, a reconfigurable delay buffer is used which can be adjusted from outside the chip. The delay buffers are used to delay the input data to the edge detector. The reconfigurable delay buffer allows the opportunity to adjust the delay margin of the critical path for the warning FF. Another reason for this is that the warning FF generates warning signal and not the error signal. Hence, we need to observe the delayed data.

Warning detector is used to monitor edge of the delayed data in warning window interval. If the transition of the delayed data happens in the warning window, then it flags warning signal. The transistor level circuit of the warning detector is shown in Fig. 7. The circuit performs the logical AND operation of both the signals to generate a warning signal. The dynamic high impedance node of the warning detector is susceptible to charge leakage or external coupling. Hence, it is good to have a feedback keeper in order to avoid any external coupling or charge leakage in the high impedance node of the warning detector.

E. Pulse Widening Circuit (Optional)

This circuit is purely optional in the proposed warning FF design. The warning signal produced is of the order of width of the warning window which is typically very small. The warning signal of such a small width could not be possible to bring outside the chip. If we want to observe the warning signal outside the chip, then this circuit is needed. However, if warning signal is used in a controller inside the chip, then this circuit is not required. The pulse widening circuit is simply a normal FF as shown in Fig. 8 with data input connected to “logic 1”, clock input connected to the narrow warning signal and reset input connected to system clock. The output of the circuit is a wide warning signal. The conceptual timing diagram of the pulse widening circuit is shown in Fig. 9. The pulse widening circuit is just a normal FF with reset. It is important to note that in some master-slave based FFs, a minimal clock pulse width requirement needs to be monitored for the proper functionality of the circuit. In some cases, a pulse triggered FF might be a better alternative. The area of the overall warning FF is increased by 33% by adding the pulse widening circuit. The power dissipation of the warning FF with and without the pulse widening circuit are 46$µ$W and 40 $µ$W respectively. Hence, the pulse widening circuit increases the power dissipation of the warning FF by 15%. The impact of area and power dissipation of pulse widening circuit can be reduced by sharing among many warning FFs.

IV. TEST STRUCTURE AND TESTING STRATEGIES

The block diagram of warning system consists of main blocks of the test chip and FPGA as shown in Fig. 10. The main blocks of the test chip consist of a 16-bits Kogge-stone adder, a set of normal FF (33-bits), a set of warning FF (17-bits) and three shift registers. The PLL inside the FPGA is used to generate all the clocks for the test chip.

Here, we have presented a strategy of testing the warning FF with limited number of input pins to the chip. Due to the limitations of number of input pins, one shift register is used to store the input $A$ and $B$ of the adder. However, the carry input of adder has been assigned a dedicated input pin. This is because, for testing the warning FF, toggling of the data is required. The adder sums up the inputs $A$, $B$ and $C_{in}$ to produce the sum and $C_{out}$. So, the toggling of the $C_{in}$ input is reflected at the output of the adder. $Clk{0}$ is used as the clock in the input shift register. The two output pins Adder sout and Warning sout are used to serially get the adder and the warning signal output outside the chip respectively. These two pins are observable points for the test chip.

Another limitation of our chip is that high frequency signal can not be fed to the chip. One way to deal with this problem is to create long critical path. However, designing a circuit with long critical path requires more area. For example, if input

![Fig. 7. Circuit for generating the warning signal from the warning window and the data edge](image1)

![Fig. 8. Circuit for generating wide warning signal from narrow warning signal](image2)

![Fig. 9. Conceptual timing diagram of Pulse widening circuit](image3)
signal to the chip is 1 MHz, then the delay of critical path should be at least 1 µs which requires large area. However, in our design, the critical path delay is 2.4 ns which requires less area. This critical path is measured from a typical corner test chip at supply voltage of 1.15V in room temperature of 27°C. In our case, two clocks Clk1 and Clk2 are used for sampling the input and output data of the adder. These two clocks are having same frequency with a phase shift of 2.4 ns or more. The basic purpose of using phase shifted clock is to emulate high frequency clock inside the chip using the low frequency clock available outside the chip. The phase between two clocks is the fraction of clock cycle which is shifted in between two signals at any arbitrary point and graphically explained in Fig. 11. The Clk3 is used to sample the two output shift registers i.e one for adder output and other for warning signal output. To sample the warning signals, the Clk3 should be either negative edge of Clk2 or phase shifted Clk2. In our measurement setup, phase shifted Clk2 acts as Clk3. These three clocks such as Clk1, Clk2 and Clk3 are having same frequency and differ in phase as shown in Fig. 11. These clocks are generated off-chip using the PLL inside the FPGA. These are our testing strategies. However, in real design, two extra clocks are not needed.

V. EXPERIMENTAL RESULTS

The experimental setup consists of the test board along with FPGA board. Figure 12 shows the experimental setup along with chip micrograph. A test chip has been fabricated in 65nm industrial process technology to demonstrate the functionality of the proposed warning FF for low power application. The key design parameters of the test chip is summarized in Table IV. The experiments were carried out to verify the proposed FF for different supply voltage, frequency (or phase shift) and process conditions. The shift register inside the chip was programmed to activate one of the critical path of the adder.

Fig. 10. Block diagram of warning system consists of main blocks of the test chip and FPGA. The test chip contains 16-bits Kogge-stone adder, 33-bits normal FF, 17-bits warning FF, 3 shift registers and FPGA’s PLL generates the required clock for the test chip.

Fig. 11. Showing three clocks Clk1, Clk2 and Clk3. The phase shift between Clk1 and Clk2 is \(\phi_1\) and phase shift between Clk1 and Clk3 is \(\phi_2\).
TABLE IV

<table>
<thead>
<tr>
<th>Key Design Parameters</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>Process</td>
<td>65nm Bulk CMOS</td>
</tr>
<tr>
<td>Supply voltage</td>
<td>1.2V</td>
</tr>
<tr>
<td>Vth flavor</td>
<td>Nominal Vth</td>
</tr>
<tr>
<td>Design Area (Width X Height)</td>
<td>576 µm x 115 µm</td>
</tr>
</tbody>
</table>

Fig. 12. The experimental setup consists of the FPGA board and test board. The micrograph of the test chip of size 2.1mm X 2.1mm contains the warning system of size 115.2µm x 576µm.

The two 16-bits A and B inputs of the adder are set to (0000)\text{16} and (FFFF)\text{16} respectively. The Cin input to the adder is given toggled waveform so that the data transition happens at the output of the adder. The output of the adder has two possible outputs based on the value of Cin input to the adder. If the Cin input to the adder is 1, then the output of the adder is a group of zeros for 17-bits and if the Cin input to the adder is 0, then the output of the adder is a group of ones for 17-bits. The same pattern of the output of the adder is repeated as the input Cin to the adder is changing as shown in Fig. 14(a). 17-bit warning FFs are inserted at the output of the adder. The 17-bits adder output and 17-bits warning signal are shifted out using two parallel-in serial out shift registers. The same critical path is activated for all types of measurement below.

### A. Impact of supply voltage variation on Warning FF

The supply voltage is an important parameter to study the feasibility of the proposed approach. The proposed warning FF was simulated across supply voltage and temperature variation. Fig. 13 shows the impact of the supply voltage on setup time and the warning window width across supply voltage at different temperature. It is found that the setup time and the warning window width increases with decrease in the supply voltage as the gate overdrive decreases with decrease in the supply voltage at different temperature. This is an important requirement for the warning FF design. It also confirms that the warning window generator works under supply voltage and temperature variation.

To verify the functionality of warning system, the test chip was measured across various supply voltages. The 17-bit warning FFs at the output of the adder monitors the timing violation at the output of the adder. As mentioned earlier, 17-bits warning signal are shifted out using a parallel-in serial out shift register. The measured input clock, shifted output of the adder and warning signal of the warning system are shown in Fig. 14. When the supply voltage was above 1.09 V, no error and warning were observed as shown in Fig. 14(a). At supply voltage 1.08V, the one bit of the adder is critical and flags the first warning signal as Fig 14(b). If the supply voltage is reduced further, then multiple critical paths are activated. Hence, the multi-bit warning signals are flagged at supply voltage 0.96 V as shown in Fig 14(c). If the supply voltage is reduced further, some of the critical paths are failed. The first error bit is shown at supply voltage 0.94 V as shown in Fig 14(d). In this type of experiment, the phase shift was kept constant to isolate the impact of supply voltage on warning FF. The whole operation of the supply voltage on warning system can be divided into three zones such as:

1) No warning and no error zone (Supply voltage above 1.09 V)
2) Warning zone (Supply voltage lies between 0.95 V and 1.08 V)
3) Error zone (Supply voltage starts at 0.94 V)

### B. Impact of phase shift or clock frequency on Warning FF

As it is mentioned earlier, due to the limitation of input pins, the high frequency clock can not be fed to the chip. Two clocks Clk1 and Clk2 having same frequency with some amount of phase shift was given to the chip to study the impact of phase shift or clock frequency on warning FF. The two clocks Clk1 and Clk2 waveforms generated from FPGA PLL are shown in Fig. 15. It is very convenient to generate the phase shifted clock of the order of nanoseconds using the PLL inside FPGA by Megafuction tools and interested reader may refer [29]. The test chip was applied various phase shifted clock waveforms to study the impact of phase shift in warning system. The clock, adder output and warning signal for different phase shifted clock is shown in Fig. 16. When the phase shift between two clocks were 5.4°(or 6 ns), then no error and no warning signal was observed as shown in Fig. 16(a). However, warning signal was flagged, while the phase shift between two clocks were 3.5° (or 3.88 ns) as shown in Fig. 16(b). In the both cases, the data is sampled correctly until the warning signal is observed. In this experiment, the supply voltage is kept constant (i.e. 0.86 V) to isolate the impact of phase shift on warning system.
Fig. 14. Impact of supply voltage on warning system showing the waveform of clock, warning signal and 16-bit Kogge-stone adder output. (a) Measured no error and no warning signal at supply voltage above 1.09 V, (b) Measured first warning signal at supply voltage 1.08 V, (c) Measured multi-bits warning signals and no error at supply voltage 0.96 V, and (d) Measured first one bit error in the output of the adder at supply voltage 0.94 V.

Fig. 16. Impact of phase shift on warning system showing the waveform of clock, warning signal and 16-bit Kogge-stone adder output. (a) Measured no warning and no error at supply voltage 0.86 V and at phase shift between \( \text{Clk}_1 \) and \( \text{Clk}_2 \) 5.4° (or 6 ns), and (b) Measured warning signal at supply voltage 0.86 V and at phase shift between \( \text{Clk}_1 \) and \( \text{Clk}_2 \) 3.5° (or 3.88 ns).

Based on the process corner, operating point, workload and switching activity of the ASIC, many critical paths may be critical. By performing the statistical static timing analysis on the ASIC, a few critical paths are derived and the warning FF is used in those critical paths. The first warning signal voltage is defined as the voltage at which one of the paths become critical and flag warning signal while the system is running at a fixed clock frequency or phase shifted clock. Figure 17 shows the impact of phase shift on first warning signal voltage. As phase shift between two clocks are increased, the delay margin is increased. The increase in delay margin can be met with decrease in supply voltage as shown in Fig 17. Hence we can achieve low power by running system at low supply voltage and low frequency. In this case, the warning signal acts as a reference for when to stop further decrease in supply voltage without causing any malfunctioning to the system. The phase shift between the two clocks \( \text{Clk}_1 \) and \( \text{Clk}_2 \) is fundamentally tracking the critical path of the design. Figure 17 shows the
C. Number of buffers in delayed data

In this proposed warning FF, the transition of delayed data is monitored to flag a warning signal. We need to quantify the number of delay buffers in the data. The number of buffers in the delayed data path represents the delay margin of the critical path. The proposed design uses a configurable delay buffers for creating the delayed data for monitoring. The configurable delay buffers can be controlled based on the process condition and hence, the supply voltage, frequency and body bias can be changed adaptively based on the process corner. The proposed technique has full flexibility to modify the critical path delay based on process corners. This type of adaptive modification of critical path delay is not possible in case of conventional design.

Figure 18 shows the number of buffers in the delayed data versus measured first warning voltage and first error voltage. The first error voltage is the supply voltage at which one bit error appears for a given critical path at a fixed clock frequency. Since, error does not depend upon the delayed data. So the error voltage is constant for a critical path at a given clock frequency. Here, the mean first error voltage is 0.95 V as shown in Fig. 18. The first warning voltage depends on the number of buffers in the data path. If the number of buffers in the data path are more, we are monitoring the delayed data earlier in time and hence the first warning voltage is large and saving in power is less. In this case, delay margin is large as the gap between the first error voltage and first warning voltage is large. If the number of buffers in the data path are less, we are monitoring the delayed data later in time and hence the first warning voltage is small. In this case, delay margin is small as the gap between the first error voltage and first warning voltage is small and saving in power is more. Hence, a trade-off exists between the delay margin versus power saving. So, the designer need to choose the number of buffers based on the available delay margin for a design. It is found that the warning voltage varies linearly with number of buffers as shown in Fig. 18. The measured results of first warning voltage and first error voltage from 10 different chips are shown as the error bar in Fig. 18. This error bar is due to the manufacturing die-to-die variation of critical path across chips.

D. Power dissipation

This section describes the comparison of power dissipation between the proposed design and the conventional design while considering only the dynamic variations (i.e. supply and temperature) at a given process corner. In the conventional design, we assume that process monitor is available to monitor the process condition of the design. However, the power dissipation due to process monitor is not included in both the conventional and proposed design in this comparison.

In the conventional approach, the Kogge-stone adder with 17-bits normal FFs at the output is simulated whereas in the
TABLE V
COMPARISON OF PROPOSED DESIGN WITH THE CONVENTIONAL DESIGN AT DIFFERENT PROCESS AND TEMPERATURE CONDITIONS

| Process | Temperature(°C) | Min Period (ns) | Conventional Design | Proposed Design | % Saving in power Diss. 
|---------|-----------------|-----------------|---------------------|-----------------|-------------------
|         |                 | Supply (V) | Power Diss. (mW) | Supply (V) | Power Diss. (mW) | (P_{c} - P_{p}) \times 100% |
| Best    | 25              | 0.8          | 1.05               | 0.364         | 0.87             | 0.288             | 26.4                  |
| Best    | 125             | 0.8          | 1.05               | 0.393         | 0.94             | 0.360             | 9.2                   |
| Typical | 25              | 1.3          | 1.05               | 0.218         | 0.94             | 0.204             | 6.8                   |
| Worst   | 25              | 3.2          | 1.05               | 0.088         | 0.94             | 0.082             | 7.3                   |
| Worst   | 125             | 3.2          | 1.05               | 0.089         | 0.86             | 0.070             | 27.1                  |

The proposed scheme, the Kogge-stone adder with 16-bits normal FFs and 1-bit warning FF at the output is simulated. In the proposed approach, one warning FF is inserted in the critical path of the design. The conventional design requires supply margins as it does not have any monitoring mechanism. However, the proposed warning FF based design does not require extra margin for dynamic supply voltage and temperature variations. Hence, we can operate the proposed design at lower supply voltage till the warning signal appears and hence the power dissipation of the design can be reduced. We have assumed 10% dynamic supply voltage variation in case of conventional design considering worst case temperature at a given process corner. In this analysis, we have varied the clock frequency of the design across process corner at a fixed supply voltage of 1.05V for the conventional design. The clock frequency of the design is determined considering 10% supply voltage variation at worst case temperature. The maximum value of warning voltage at any corner is 0.94V which is around 10% voltage variation as shown in Table V.

The minimum clock period of the conventional design is determined considering supply voltage variation at worst case temperature for a given process corner. Both the conventional and the proposed designs are simulated in SPICE at minimum clock period. Table V shows comparison of the power dissipation between the conventional design and proposed design at different process and temperature conditions. It shows that the maximum power saving of up to 27% can be achieved in case of the proposed design at the worst process corner with temperature of 125°C. In the typical corner, the warning voltage does not change with temperature variation. This is because of inverted temperature dependence effect of the gate delay on temperature at low supply voltage and explained in detail in [32]. At low supply voltage, the gate delay can decrease with increase of temperature. This is due the opposite behavior of mobility and threshold voltage on gate delay. Due to this effect, the critical path delay of the design does not change much with temperature variation for typical corner at low supply voltage. Hence, the warning voltage does not change for the typical corner at low supply voltage with the temperature variation.

E. Impact of Area

The layout of warning FF consisting of a traditional flip-flop, a delay buffer, a warning window generator, an edge detector and a warning detector is shown in Fig. 19. The proposed warning FF is around three times the area of a normal FF by about two times. The warning window generator can be shared among all warning FFs in the design which reduces the area penalty of the proposed FF. The power dissipation of proposed warning FF and a conventional FF are 40 µW and 11 µW respectively. Hence, the warning FF dissipated 2.6× more power compared to a conventional FF. However, the proposed scheme can achieve power saving of up to 27% compared to the conventional approach as explained in Section V-D.

F. Comparison with the existing techniques

The similarity and difference of existing warning detection schemes is presented in this paragraph. The warning detection scheme can be implemented in two different methods such as double sampling and edge detection. The major disadvantage of warning detection using double sampling method is the metastability of FF which can lead to malfunctioning of the system. The major disadvantage of warning detection using edge detection method is the partial evaluation of warning detection which can lead to misleading adaptive response. The proposed warning FF is based on the edge detection scheme. The major problem of the existing edge detector based warning FF is the effective generation of warning window for monitoring data edge detection. The warning window generation scheme in stability checker [21] depends on the input clock frequency as discussed elaborately in Section III-B. Authors in [24] proposed a specialized clocking circuit for warning window generation which may lead to imbalance clock network. The warning window generation scheme in the proposed approach is simple and independent of input clock frequency. It does not affect the clock network. The summary of the comparison of existing techniques are presented in Table VI.

VI. DESIGN FLOW OF WARNING FF

In this section, the design flow of the warning FF is presented. Based on the process corner, operating point, workload and switching activity of the ASIC, many critical paths may
be critical. Monitoring all the critical paths of the design is not logically feasible as it would increase the area and power dissipation of the design. Authors in [24] suggest using statistical static timing analysis flow to create a few important critical paths in the design. A set of equivalent replica critical paths of these important paths is needed to be created as in [30] and guarded with warning FFs. The replica paths should be activated all the time. Then, one can make sure that warning FF will continuously monitor the critical path as required. To find best replica critical path is out of scope of the present paper and interested reader may refer [30].

The replica critical paths are created and guarded by warning FFs. The warning signals of multiple replica critical paths are combined using a multiple input OR gate. The final output of the OR gate is fed to the pulse widening circuit in Fig. 8 to generate the wide warning signal. The wide warning signal is used as a trigger signal for the adaptive controller. We recommend the user not to reduce the supply voltage further after the first warning voltage. Some margin between the first warning voltage and first error voltage should be maintained to preserve the data integrity at the lowest possible supply voltage. The adaptive controller is used to switch the supply voltage, clock frequency and body bias based on warning signal. The present test chip does not contain any adaptive controller. However, an adaptive frequency controller has been implemented externally inside FPGA which can switch to the different phase shifted clock based on number of warning signals. When warning signal is detected, the data is sampled properly by the FF. We need to wait for many clock cycles (for example 10,000 cycles) as in [31]. Then, the number of occurrence of warning signal is counted. If the count value is less, then it is just a very low probability effect such as SER. If the count value is more, then controller will increase the supply voltage by 10mV after 10,000 cycles to avoid further warning. Some representative controllers are presented in [27], [28], [31] for the interested reader.

We have compared the design with and without frequency controller inside FPGA. The design without the frequency controller can operate till 1.1 V with critical path delay of 2.03ns. The design with frequency controller can operate till 0.8V with critical path delay of 5.09ns. The controller automatically switches to different phase shifted clock and operate till supply voltage up to 0.8V. In future, attempt would be made to design adaptive controller inside the chip.

### VII. Conclusion

This paper proposes a new metastability immune warning detection sequential using the concept of delayed data in the edge detection circuit and a traditional FF. It also presents how to use the warning FF for dynamic voltage and frequency scaling in ASIC which typically lacks an error recovery mechanism unlike a processor. A test chip is fabricated in 65 nm technology node to show the usage of the FF in dynamic voltage and frequency scaling applications in ASICs. The results across supply voltage, phase-shifted clocks and number of buffer in delayed data shows the effectiveness of our approach. The measured results demonstrate that the proposed circuit can track the critical path delay of 2.4 ns to 7.5 ns at warning voltage of 1.15 V to 0.72 V respectively. The future work includes the design of adaptive controllers with warning FF in real designs.

### ACKNOWLEDGEMENT

The VLSI chip in this study has been fabricated in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with STARC, e-shuttle, Inc., and Fujitsu Ltd.

### REFERENCES


Bishnu Prasad Das received the M.Sc. degree in electronics from Sambalpur University, Orissa, India in 1999, the M.Tech. degree in computer application from ISM, Dhanbad, India, in 2002 and the Ph.D. degree in electronics from Indian Institute of Science, Bangalore, India, in 2009. Bishnu Prasad Das is currently a Post Doctoral Researcher at Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA, from June 2012. He was a Post Doctoral Researcher at Kyoto University, Kyoto, Japan, from Oct 2009 to May 2012. He has worked at Texas Instruments, Bangalore, India during his Ph.D. work under Texas Instruments University Programme. He was a topper and gold medalist in M.Sc. His research interests include error tolerant circuit design, on-chip test structure for variability measurement, automatic standard cell library generation in scaled technology node and modeling under process, voltage and temperature variation and design for manufacturability.

Hidetoshi Onodera (M’87 - SM’12) received the B.E., and M.E., and Dr. Eng. degrees in Electronic Engineering, all from Kyoto University, Kyoto, Japan. He joined the Department of Electronics, Kyoto University, in 1983, and currently a Professor in the Department of Communications and Computer Engineering, Graduate School of Informatics, Kyoto University. His research interests include design technologies for Digital, Analog, and RF LSIs, with particular emphasis on low-power design, design for manufacturability, and design for dependability.

Dr. Onodera served as the Program Chair and General Chair of ICCAD and ASP-DAC. He was the Chairman of the IPSJ SIG-SLDM (System LSI Design Methodology), the IEICE Technical Group on VLSI Design Technologies, the IEEE SSCS Kansai Chapter, and the IEEE CASS Kansai Chapter. He is currently the Chairman of IEEE Kansai Section. He served as the Editor-in-Chief of IEICE Transactions on Electronics and IPSJ Transactions on System LSI Design methodology.