PAPER

# **E**ff**ective Domain Partitioning for Multi-Clock Domain IP Core Wrapper Design under Power Constraints**

**Thomas Edison YU**†a)**,** *Nonmember***, Tomokazu YONEDA**†**,** *Member***, Danella ZHAO**††**,** *Nonmember***,** *and* **Hideo FUJIWARA**†**,** *Fellow*

**SUMMARY** The rapid advancement of VLSI technology has made it possible for chip designers and manufacturers to embed the components of a whole system onto a single chip, called System-on-Chip or SoC. SoCs make use of pre-designed modules, called IP-cores, which provide faster design time and quicker time-to-market. Furthermore, SoCs that operate at multiple clock domains and very low power requirements are being utilized in the latest communications, networking and signal processing devices. As a result, the testing of SoCs and multi-clock domain embedded cores under power constraints has been rapidly gaining importance. In this research, a novel method for designing power-aware test wrappers for embedded cores with multiple clock domains is presented. By effectively partitioning the various clock domains, we are able to increase the solution space of possible test schedules for the core. Since previous methods were limited to concurrently testing all the clock domains, we effectively remove this limitation by making use of bandwidth conversion, multiple shift frequencies and properly gating the clock signals to control the shift activity of various core logic elements. The combination of the above techniques gains us greater flexibility when determining an optimal test schedule under very tight power constraints. Furthermore, since it is computationally intensive to search the entire expanded solution space for the possible test schedules, we propose a heuristic 3-D bin packing algorithm to determine the optimal wrapper architecture and test schedule while minimizing the test time under power and bandwidth constraints.

*key words: multi-clock domain, wrapper design, SoC, embedded core test, test scheduling*

# **1. Introduction**

The recent popularity of advanced technologies such as broadband internet, 3-G cellular phones and high-speed workstations is due to many factors, one of which is the rapid advancement in the design and production of VLSI chips. More importantly, it has now become possible to put entire systems onto a single chip which is commonly known as System-on-Chip (SoC). Currently, SoCs are widely used in devices intended for telecommunications, networking and digital signal processing. Moreover, they are increasingly being utilized in mobile on-the-field devices, which increase the demand for highly-reliable, defect-free chips that have very low power requirements.

To ensure that SoCs and other VLSI chips operate as intended, testing must be conducted per chip. As production capabilities improve, clock-rates rise exponentially and

†The authors are with the Computer Design and Test Lab, Nara Institute of Science and Technology, Ikoma-shi, 630–0101 Japan.

††The author is with the The Center For Advanced Computer Studies, University of Louisiana at Lafayette, USA.

a) E-mail: tomasu-y@is.naist.jp

DOI: 10.1093/ietisy/e91–d.3.807

transistor density increases dramatically, the cost of testing newer VLSI chips also becomes higher. More specifically, the increased complexity of the circuitry in SoCs also means an increase in the amount of test data, which usually results in longer test application time. Furthermore, test access becomes a problem since the cores cannot be directly accessed from the I/O pins of the chip. Moreover, modern IP cores operate at various frequencies internally, which have advantages such as reduced power and silicon area. These *multiclock domain cores* present clock skew and at-speed testing problems. Clock skew problems arise from unsynchronized clock sources such as two signals of the same frequency but different clock trees that arrive out-of-sync to their destinations thereby causing data corruption. Furthermore, power consumption and heat during test has become a big issue because of high switching activity during scan-shift operations as well as the possibility of more active components than expected during normal operation.

The most common Design-for-Testability (DFT) used for SoCs with multiple cores is the design of a test data delivery method, more commonly known as Test Access Mechanism (TAM), and the use of *wrappers*, which enables independent core testing. The wrapper isolates a core during test and provides an interface to apply and collect test data from it. More recently, the IEEE 1500 standard for embedded core test has been approved to provide guidelines for core wrapper design and interfacing to TAMs. Furthermore, several approaches to optimize wrapper designs for single frequency embedded core testing [1], [2] as well as wrapper and TAM co-optimization algorithms [3], [4], [11], [12] have already been suggested. Still, these approaches don't directly address the problem of testing multi-clock domain cores.

To address this problem, we propose an IEEE 1500 compliant power-aware multi-clock domain core wrapper that partitions the IP core into smaller sub-groups and utilizes gated-clocks to control the start times of scan-shift operations and enable a more flexible and efficient use of the external bandwidth under a power constraint compared to previous methods. A heuristic 3-D rectangular bin packing algorithm is also introduced, which forms the basis of the proposed wrapper design method.

# **2. Related Work**

Most at-speed multi-clock domain core testing techniques

Manuscript received November 15, 2006.

Manuscript revised August 7, 2007.

that have been proposed are based on BIST [5]–[7] and utilizing techniques such as programmable capture windows [5] and directly controlling separate launch and capture clocks [6] to solve the clock-skew problems while still allowing at-speed testing. The first non-BIST based multiclock domain core wrapper design for IP cores was proposed in [8], where the core was divided into its clock domains, calling them virtual cores. Single frequency wrapper design was performed on each virtual core to assign a virtual wrapper to each of them. Virtual test bus lines from each virtual core are connected to the external TAM via demultiplexing and multiplexing interfaces to synchronize the flow of the test data. The method employs a single separate shift clock for all virtual cores, and it is multiplexed with the capture clock signals. In [9], the authors of [8] improved upon their design by allowing each virtual core to have a distinct shift frequency. In both [8] and [9] all virtual cores are concurrently active and the lowering of shift frequencies that lead to large increases in test time might result under a tight power constraint. In [10], the use of gated-clocks to control the start and end times of the shift activity of each virtual core has been proposed. The authors used a 3-D bin packing algorithm which grouped virtual cores into *shelves* wherein all cores belonging to the same shelf become active at the same time and each shelf becoming active sequentially. This gives more flexibility during scheduling but can result in significant idle time because available bandwidth and power cannot be immediately used until the scan operation of the current shelf group is done.

This paper proposes a wrapper design method which uses a novel 3-D bin packing algorithm as its basis. The design allows for two things which improve upon previous work: (i) each clock domain can be further partitioned into sub domains and (ii) a virtual core can be activated as soon as bandwidth and power are available.

The rest of the paper is organized as follows. The overview of the proposed wrapper architecture and its scancontrol block is given in Sect. 3. Section 4 gives the problem formulation and Sect. 5 discusses the proposed 3-D bin packing algorithm. Section 6 discusses the experimental results and compares them with the results of previous work and Sect. 7 concludes this paper.

#### **3. Multi-Clock Domain Core Wrapper (MDCW)**

In this paper, it is assumed that the clock domains of the IPcore can be further divided into sub-domains. Furthermore, we assume that the following information is provided by the core designer:

*Pmax*: Maximum allowed peak or average power dissipation (during scan)

*N<sub>C</sub>*: Number of clock domains

*N<sub>si</sub>*: Number of sub-domains for each clock domain  $D_i$  (1 ≤  $i \leq N_C$ 

For each sub-domain  $S_{ij}$ ( $1 \le j \le N_{si}$ ) of clock domain  $D_i$ 

•  $pi_{ij}$ : Number of primary input pins

- $po_{ij}$ : Number of primary output pins
- $\bullet$  *bi*<sub>ii</sub>: Number of bidirectional I/O
- $\mathfrak{SC}_{ij}$ : Number of internal scan chains and their lengths  $l_{ijk}$ (1  $\leq k \leq$  *sc<sub>ii</sub>*)
- $p_{ij}$ : Power dissipation at ATE frequency  $f_{ATE}$

We now extend the definition of the virtual core from [8]. A group of one or more sub-domains from  $D_i$  is a virtual core  $G_{ip}(1 \le p \le N_{qi})$ .  $N_{qi}$  is the total number of possible combinations of the sub-domains of *Di*. Furthermore, to differentiate from the virtual core defined in [8]–[10], we would refer to virtual core in this paper as *P-vc*, short for *Partitioned virtual core*, and denotes the fact that it can represent one complete clock domain or just a part of it.

*Wrapper Architecture* The basic architecture of the proposed multi-clock domain core wrapper is shown in Fig. 1. The core is divided into smaller *P-vc*'s, each having its own virtual wrapper. These *P-vc*'s are connected to the external TAM via *virtual test bus* lines and a pair of *de-multiplexing* and *multiplexing interface circuits* (DMIC-MIC) that perform bandwidth matching and test data flow-control between the external TAM and the internal virtual test bus lines.

*Scan Control Block* The scan control circuitry for the proposed wrapper is shown in Fig. 2. It is assumed that there are *m* external clock sources available, either from the Automatic Test Equipment (ATE) or on-chip PLL circuits. Furthermore, an external signal, *TestStart*, is triggered when the wrapper enters test mode. Normally, the test application process is divided into two phases, scan and capture phase. For this work, test data is scanned into the scan chains of each *P-vc* during the scan phase at a frequency that can be different from their functional frequencies. These scan clock signals are generated by the *Clock Divider* in Fig. 2. We add a *gate and MUX control circuit* to control *Ntot*(total number of *P-vc*'s) gated clock signals during test as well as switch to the functional clocks during normal operation. In the timing diagram of Fig. 3, test data is scanned into  $P$ - $vc<sub>1</sub>$  until time  $t_1$ , when it becomes inactive and  $P$ - $v_2$  starts the scan operation at a different frequency until time *t*2. At *t*2, the process begins the capture phase.



**Fig. 1** Proposed multi-clock domain wrapper.

Grouping the wrapper scan chains according to their clock domains prevents the effects of clock skew during this phase. To avoid clock skew at the capture phase, the test enters this phase before or after the last bit of test data is scanned in. In the capture phase, all the scan chains are driven by their functional clocks to simulate normal operation. For multi-clock domain cores, a capture window similar to that proposed in [8], [9], as shown in Fig. 3, is used during this phase and the *P-vc*'s are activated in such a way that ensures inter-domain and intra-domain signals are prop-



Fig. 2 Proposed scan control block.

erly captured for analysis. A comparison of schedules obtained in [9], [10] and by our method is shown in Fig. 4. The use of gated clocks enables a more flexible and efficient test schedule compared to the concurrent shifting method in [9]. Unlike the shelf method in [10], our scheme allows partitioned testing with more flexibility and less wasted bandwidth during scheduling. Since we are using a capture window as proposed in [8], [9], this work will only focus on minimizing the shift time during the scan phase.

## **4. Problem Formulation**

The wrapper design problem  $P_{MDCW}$  is defined as follows:

Problem  $P_{MDCW}$ : Given the test parameters for a multiclock domain core *C* as described in Sect. 2 and the information below:

*Wext*: TAM width allotted to the core  $f_{ATE}$ : ATE frequency  $F = \{f_1, f_2, \ldots, f_m | f_j = 2 \times f_{j+1}, j \in 1, \ldots, m-1\}$ : The set of allowed shift frequencies

determine the multi-clock domain wrapper design for *C* including:

*N<sub>vi</sub>*: Number of *P-vc*'s  $G_{ip}(1 \le p \le N_{vi})$  under domain  $D_i$ For each *P-vc Gip* of clock domain *Di*

- $f_{ip} \in F$ : Shift frequency
- $w_{ip}$ : Number of virtual bus lines



- Its wrapper scan chain design and *lipmax*, which is the length of its longest wrapper scan chain
- *tsip*: Scan-in start time

such that the following constraints are satisfied and the test application time is minimized:

1. The bandwidth used by all the active *P-vc*'s at any time *t* cannot exceed the total bandwidth coming from the ATE:

$$
W_{ext} \times f_{ATE} \ge \max_{0 \le t \le T} \left( \sum_{i=1}^{N_C} \sum_{p=1}^{N_{ei}} a_{ipt} f_{ip} w_{ip} \right)
$$
  
where: 
$$
a_{ipt} = \begin{cases} 0, t < t_{sip} | t > (t_{sip} + l_{ipmax} \times \frac{f_{ATE}}{f_{ip}}) \\ 1, t_{sip} \le t \le (t_{sip} + l_{ipmax} \times \frac{f_{ATE}}{f_{ip}}) \end{cases}
$$

$$
(1)
$$

2. The total power dissipation of all active *P-vc*'s at any time *t* cannot exceed the maximum allowed power dissipation *Pmax*:

$$
P_{max} \ge \max_{0 \le t \le T} \left( \sum_{i=1}^{N_C} \sum_{p=1}^{N_{vi}} a_{ipt} p_{ip} \times \frac{f_{ip}}{f_{ATE}} \right) \tag{2}
$$

Since all the *P-vc*'s become active and inactive independently, the total scan-shift time for one test pattern can be computed from the *P-vc* with the latest end time as shown below:

$$
T = \max_{1 \le i \le N_C} \left( \max_{1 \le p \le N_{vi}} \left( t_{sip} + l_{ipmax} \times \frac{f_{ATE}}{f_{ip}} \right) \right) \tag{3}
$$

Each *P-vc* will have a distinct shift frequency and to simplify the clock generation circuitry, the ratio of usable frequencies is a two's exponent. Our initial experiments have shown that completely dividing cores into sub-domains wouldn't always yield shorter test times so we developed a 3-D bin packing algorithm to determine how and when the domains should be partitioned to minimize the scan-shift time *T* while also optimizing the final number of *P-vc*'s.

## **5. Proposed Test Scheduling Algorithm**

Similar to [4], rectangular 2-D bin packing have been extensively used to solve the test scheduling problem for embedded cores. For scheduling under a power constraint, it can be extended into a restricted 3-D bin packing problem where the length, width and height represent pin, peak power and total test time, respectively, of an SoC [12]. Specifically, a core (or in our case, *P-vc*), is represented as a cube whose length, width and height represents the alloted TAM width, power dissipation and test time, and when a cube overlaps in the time domain, they cannot overlap at any of the other two domains. In this paper, instead of pin count or TAM width, the length represents the external bandwidth  $BW_{ext}$  =  $W_{ext} \times f_{ATE}$  and each *P-vc G<sub>ip</sub>* of an IP-core *C* can be represented by one cube among a set of permissible cubes. The cubes are packed into the 3-D bin until all the sub-domains  $S_{ij}$  of each domain  $D_i$  have been included in

**Table 1** hCADT01 clock and sub-domain information.

| sd# | $f_{func}$ | $N_{in}$ | $N_{out}$ | $N_{bi}$ | P    | $N_{sc}$       | $Lsc_{ii}$         |
|-----|------------|----------|-----------|----------|------|----------------|--------------------|
|     | (MHz)      |          |           |          |      |                |                    |
| 1.1 | 200        | 50       | 8         | 18       | 668  | 4              | 168, 168, 166, 166 |
| 1.2 | 200        | 25       | 8         | 18       | 652  | $\overline{4}$ | 163, 163, 163, 163 |
| 1.3 | 200        | 25       | 8         | 18       | 648  | 4              | 162, 162, 162, 162 |
| 1.4 | 200        | 9        | 8         | 18       | 604  | $\overline{4}$ | 151. 151. 151. 151 |
| 2.1 | 133        | 144      | 67        | 72       | 450  | 3              | 150, 150, 150      |
| 3.1 | 120        | 39       | 4         | 24       | 465  | 5              | 93, 93, 93, 93, 93 |
| 3.2 | 120        | 30       | 3         | 24       | 279  | 3              | 93, 93, 93         |
| 3.3 | 120        | 20       | 1         | 24       | 186  | $\overline{2}$ | 93.93              |
| 4.1 | 75         | 61       | 10        | 36       | 657  | $\mathcal{R}$  | 219, 219, 219      |
| 4.2 | 75         | 50       | 21        | 36       | 657  | 3              | 219, 219, 219      |
| 5.1 | 50         | 87       | 200       | 36       | 1563 | 3              | 521, 521, 521      |
| 5.2 | 50         | 30       | 24        | 36       | 1042 | $\overline{2}$ | 521, 521           |
| 6.1 | 33         | 96       | 10        | 24       | 278  | 5              | 82, 81, 81, 17, 17 |
| 6.2 | 33         | 30       | 18        | 24       | 198  | $\overline{4}$ | 82, 81, 18, 17     |
| 6.3 | 33         | 20       | 40        | 24       | 100  | $\overline{c}$ | 82, 18             |
| 7.1 | 25         | 15       | 30        | 72       | 40   | 4              | 10, 10, 10, 10     |

the packed *P-vc*'s while minimizing the total height. Since it has been shown that the restricted 3-D bin packing problem is *NP-Hard* in [12], we propose the following heuristic algorithm to solve the problem.

#### 5.1 Initialization: Cube Creation and Ordering

To illustrate the various steps in the algorithm, the benchmark multi-clock domain IP core *hTCADT01*, first introduced in [9] and shown in Table 1, is used throughout the following sections. This IP core has seven clock domains and *sd*# denotes the sub-domain number while *P* is the power dissipation of the sub-domains when shifting at 100 MHz and is made equal to the sum of all the scan chain lengths *Lsc<sub>ii</sub>* belonging to that sub-domain to simplify the setup. Further details are given in Sect. 6.

*P-vc*'s can be created to represent any combination of sub-domains belonging to the same clock domain. If a domain  $D_i$  has  $N_{si}$  sub-domains, then the total number of possible  $P$ -*vc*'s from  $D_i$  is just the sum of all the possible combinations of *S i j*.

$$
N_{gi} = \sum_{j=1}^{N_{si}} N_{si} C_j
$$
\n<sup>(4)</sup>

Single frequency wrapper design such as in [3] is performed on all *P-vc*'s. In [3], given the alloted virtual test bus width, the I/O boundary scan cells and internal scan chains are connected into wrapper scan chains in such a way that length of the longest wrapper scan chain, *lipmax*, is minimized.  $l_{immax}$  determines the test time for the core (Eq. 7) and it can vary depending on the alloted test bus width. For example, the  $l_{ipmax}$  for domain 7 in Table 1 at  $w_{ip} = 15$ and 3 is 10 and 48, respectively. At  $f_{ip} = 100 MHz$ , this will give us test times of  $0.10 \mu$ sec. and  $0.48 \mu$ sec., respectively. We denote the maximum number of virtual test bus lines that can be assigned to a *P-vc* as *Vtbmax*. Each *P-vc*  $G_{ip}(1 \le p \le N_{qi})$  can have a maximum of  $Vtb_{max}$  possible wrapper designs and the same number of cubes is created. From the list of cubes of  $G_{ip}$ , the cube with the shortest test time regardless of power or bandwidth constraints would be selected as its ideal cube. It was proven in [10] that halving the shift frequency of the *P-vc* while doubling the virtual test bus width can result in either an increase in scan-shift time or no increase at all. We take advantage of this property to maintain the test time while still minimizing the power dissipation of the core. Each  $P$ -*vc*  $G_{ip}$  is expressed as a triplet  $C_{ip} = {BW_{ip}, p_{ip}, t_{ip}}$ , where  $BW_{ip} = w_{ip} \times f_{ip}$  is the bandwidth of  $G_{ip}$ ,  $w_{ip}$  is the virtual test bus width assigned to it, and  $f_{ip} \in F$  represents its shift frequency. The power dissipation *pip* at *fip* can be expressed as:

$$
p_{ip} = \frac{p_{ipmax} \times f_{ip}}{f_1} \tag{5}
$$

where  $f_1$  is the maximum allowed shift frequency. In this paper, we set  $f_1 = f_{ATE}$  and  $p_{ipmax}$  is the power dissipation at *f*<sup>1</sup> as expressed below:

$$
p_{ipmax} = \sum_{j=1}^{N_{si}} b_{ij} p_{ij}
$$
 (6)

where  $b_{ij} = 1$  when sub-domain  $S_{ij}$  belongs to  $G_{ip}$  and  $b_{ij} =$ 0 if not, and  $p_{ij}$  is the power dissipation of  $S_{ij}$  at  $f_{ATE}$ . The minimum test time *tip* can be computed as follows:

$$
t_{ip} = \frac{l_{ipmax}}{f_{ip}}\tag{7}
$$

The ideal cubes of *P-vc*'s representing whole domains that satisfy the bandwidth and power restrictions is put into a master cube list *LM*. The remaining ideal cubes not in *LM* but satisfy the bandwidth and power constraints are then added to it. Finally, the ideal cubes of *P-vc*'s representing only one sub-domain is then added to  $L_M$  to ensure that all sub-domains are tested in the final schedule. Then, an area attribute  $A_{ip} = BW_{ip} \times t_{ip}$  is also computed for each cube of *P-vc*'s representing single sub-domains. Since the bigger the area attribute, the harder it is to pack, it follows that sub-domain groups which have big sub-domains must be prioritized by sorting  $L_M$  from the cube of the *P-vc* that has the member sub-domain with the biggest area attribute to the *P-vc* with the smallest one. If two cubes have the same sized sub-domains, then the overall area attribute of the cubes themselves are compared during sorting. After the list preparations, bin packing can be started from Step 1.

## 5.2 Step 1: Packing Domain Cubes

Before packing, the algorithm takes note of the current time in the schedule, denoted by a variable *curr time*. Since the algorithm only divides domain *P-vc*'s when necessary, in this step the algorithm only looks at cubes representing whole domains in *LM* until it finds a cube that has  $p_{ip} \leq p_{avail}$  and  $BW_{ip} \leq BW_{avail}$ .  $p_{avail}$  is the available power and *BWa*v*ail* is the available bandwidth, respectively. If a cube is found and packed into the bin, the algorithm checks



**Fig. 5** Packing results from steps 1-3.

what sub-domains belong to it and updates  $L_M$  by removing all cubes that has at least one member sub-domain equal to any of the sub-domains of the packed *P-vc*. It continues the above process until (a)  $BW_{avail} = 0$  and/or  $p_{avail} = 0$  or (b) if a proper cube cannot be found. Under condition (a), the algorithm looks among currently scheduled *P-vc*'s for the *P-vc* with the earliest test end time and sets *curr time* equal to it and repeats Step 1. Under condition (b), the algorithm proceeds to Step 2. For the benchmark core *hTCAD01* [9] shown in Table 1 with external bandwidth  $BW_{ext} = 1600$ and  $P_{max}$  = 3000, Step 1 packs the *P-vc* of domain no.5 (Pvc that combines sub-domain 5.1 and 5.2 from Table 1) with power  $p = 2605$ , as shown in Fig. 5.

#### 5.3 Step 2: Packing Sub-Domain Group Cubes

Not finding a cube in Step 1 means that partitioning a domain is necessary. In Step 2 the algorithm only looks at the cubes not tried in Step 1, which represent sub-domain groups, until it finds a cube that satisfies  $p_{ip} \leq p_{avail}$  and  $BW_{ip} \le BW_{avail}$ . If a cube is found and packed into the bin, the algorithm again checks what sub-domains belong to it and updates  $L_M$ . Step 2 is repeated while there is available power and bandwidth or if a proper cube cannot be found. If *BWa*v*ail* and/or *pa*v*ail* become zero, *curr time* is again updated and the algorithm goes back to Step 1. But if a proper cube cannot be found, the algorithm proceeds to Step 3. In Fig. 5, Step 2 is illustrated when the *P-vc* of sub-domain group  $S_{61} \cup S_{63}$  of domain 6 is packed into the bin.

# 5.4 Step 3: Filling Idle Space by Decreasing Virtual Test Bus Lines

In Step 3, the algorithm searches for the packed cube using the biggest bandwidth and denotes its test end time as  $T_{endmax}$ .  $L_M$  is then traversed for a cube that satisfies  $p_{avail}$ . It then determines the new scan-shift time *tipne*w for the cube being tried given a new bandwidth *BWipne*w <sup>=</sup> *BWa*v*ail*. Because of the limited selection of usable shift frequencies, choosing the next lowest  $f_k$  would automatically lead to a doubling of the scan-shift time. So for Step 3, the assigned virtual test bus width is decreased and *tipne*w is computed. The shift frequency is halved and the virtual test bus is doubled to minimize power as long as *tipne*w remains constant. If  $t_{\text{ipnew}} \leq T_{\text{endmax}} \times (1 + \text{dmax})$ , then the cube is packed



**Fig. 6** Packing result when domain 2 is packed in step 4.

into the bin and  $L_M$  is updated as before. *dmax* is a heuristic value (in %) which expresses how much *tipne*w can go over *Tendmax*. The algorithm repeats Step 3 until there is no available bandwidth and/or power or no suitable cubes can be found. If *BWa*v*ail* and/or *pa*v*ail* become zero, *curr time* is updated and the algorithm goes back to Step 1. If no suitable cubes were found, the algorithm proceeds to Step 4. For example, the ideal cube of domain no. 7 has  $f_{in} = 100 \text{ MHz}$ ,  $w_{ip} = 15$ ,  $p_{ip} = 40$  and  $t_{ip} = 0.10 \mu$ sec. In Fig. 5, the idle bandwidth was 300 MHz and for *P-vc*  $\bar{z}$ ,  $w_{ip}$  was first reduced to 3 at  $f_{ip} = 100 \text{ MHz}$ . Consequently  $t_{ip}$  increased to  $0.48 \mu$ sec. While keeping  $t_{ip}$  constant, we are able to increase the  $w_{in}$  to 12, while decreasing the shift frequency  $f_{ip}$  to 25 MHz and the power  $p_{ip}$  was lowered to 10 before packing the cube into the bin.

# 5.5 Step 4: Filling Idle Space by Decreasing Shift Frequency

Reaching Step 4 means that no cube satisfies *pa*v*ail*. There is no choice but to lower the shift frequency of a *P-vc* to fit the available idle space in the bin. The algorithm determines  $T_{endmax}$  and makes a copy of  $L_M$  called  $L_{tmp}$ . Then the shift frequencies of all cubes remaining in  $L_{tmp}$  are lowered until their power is less than or equal to  $p_{avail}$ . For each cube, the new scan-shift time *tipne*w is computed given a new bandwidth  $BW_{ipnew} = BW_{avail}$ . The shift frequency is halved and the virtual test bus is doubled to minimize power as long as *tipne*w remains constant. Similar to Step 3, the algorithm looks for a cube that satisfies  $t_{ipnew} \leq T_{endmax} \times (1 + emax)$ and closest to *Tendmax* as we have found during experimentation that this gives better results than simply packing the first cube that satisfies the first condition. The cube is packed into the bin and  $L_M$  is updated as before. Step 4 is repeated until no cubes are found or until  $BW_{avail}$  and/or  $p_{avail}$  becomes zero. The algorithm again updates *curr time* and goes back to Step 1. Note that *emax* is a heuristic value independent of *dmax* which expresses how much *tipne*w can go over *Tendmax*. Steps 1 through 4 are repeated until *LM* is empty. In Fig. 6, although there is a large  $BW_{avail}$ ,  $p_{avail}$  is only 302 (see Fig. 7(b) so the  $f_{ip}$  of *P-vc* 2 is decreased to 50 MHz, and this leads to a decrease in  $p_{ip} = 225$  while  $w_{ip}$ remained the same and  $t_{ip}$  doubled to  $3.00 \mu$ sec. Figure 7 shows the finished schedule separated into two 2-D graphs ((a)bandwidth vs. time and (b)power vs. time) to make it



**Fig. 7** Finished test schedule for hTCAD01 at  $BW_{ext} = 1600$  (vs. time (a)) and  $P_{max} = 3000$  (vs. time (b)).

easier to see that there are times when available bandwidth could not be utilized due to very low available power. Also note that by partitioning the clock domains, we increase the chance that a *P-vc* can be scheduled (due to the presence of smaller cubes) during instances of low available power and/or bandwidth.

#### **6. Experimental Result**

The experiments were done using the benchmark multiclock domain IP core *hTCADT01* which has seven clock domains, and we assumed that each domain can be further partitioned into sub-domains as shown in Table 1. In the table,  $sd\#$  denotes the sub-domain number,  $f_f$ *unc* is the functional frequency,  $N_{in}$ ,  $N_{out}$ ,  $N_{bit}$  and  $N_{sc}$  are the number of inputs, outputs, bidirectional I/O and scan chains in the specific sub-domain respectively,  $Lsc_{ij}$  is the length of each scan chain and *P* is the power dissipation of the sub-domains when shifting at 100 MHz and is made equal to the sum of all the scan chain lengths  $Lsc_{ij}$  belonging to that sub-domain to simplify the setup.

The experiment was conducted under four different power constraints  $P_{max}$ : 1500, 3000, 4500 and  $\infty$ . The maximum allowed frequency  $f_1$  is 100 MHz, which is equal to  $f_{ATE}$  to synchronize the internal shift frequencies with the ATE. To see the effectiveness of the proposed method, the **Table 2** Comparison of scan-shift time for hCADT01 under various *Pmax*.









resulting shift times denoted by  $T_{new}$  are compared to the results from [9], marked  $T$  [9], and from [10], marked  $T$  [10], in Table 2. All times are in microseconds. %*dif f* [9] and %*dif f* [10] are the differences in percentage with respect to [9] and [10] respectively. During the experiments, *dmax* and *emax* were independently varied from 0% to 200% to find the optimal combination that yields a minimum scan-shift time. The experiments were done using a Sun Fire V490 1.35 GHz UltraSPARC IV workstation with 32 GB memory and the 40,000 looped reruns of the program didn't take more than 1 sec. of CPU time.

At the tightest power constraint of  $P_{max} = 1500$ , our algorithm was able to decrease the shift time with a maximum of 24.42% compared to [9]. The average gain in test time, as *Pmax* is decreased, increases dramatically from 8.69% at  $P_{max} = \infty$  to 18.36% at  $P_{max} = 1500$ . This can be attributed to the fact that wider *Wext* and lower *Pmax* makes partitioning the domains more effective because of the extra freedom it gives during scheduling. Compared to [10], the trend is different and there is an almost constant average

<sup>1</sup> 116.04 98.44 98.63 15.00 <sup>−</sup>0.<sup>19</sup> gain of around 4% across all power constraints and a maximum gain of 14.25% at  $P_{max} = 3000$ . Small differences in time (0-1%) are attributed to discrepancies in rounding-off among the programs used and makes them negligible, so our algorithm matches or exceeds [10] in 90% of the cases.

In [8], [9], the area increase due to the demultiplexingmultiplexing circuitry, scan-control modules and other necessary logic to implement the multi-clock domain wrapper was stated to be less than 10% the area size taken by the IEEE 1500 wrapper and other scan logic. Since our approach only requires a slight modification of the scan control circuitry in [8], [9], it is safe to assume that the added overhead would be minimal. Furthermore, as manufacturing processes become smaller and transistor count becomes higher, the probable DFT overhead becomes more and more negligible in light of the possible gains in test application time. Also, the added flexibility of domain partitioning and partitioned test scheduling would be of greater benefit as designers start to re-use older generation multi-clock domain circuits as IP-cores in newer, more complex designs.

813



 $\frac{3}{2}$   $\frac{38.36}{58.02}$   $\frac{34.06}{50.36}$   $\frac{34.08}{50.48}$   $\frac{11.16}{13.00}$   $\frac{-0.06}{0.24}$  $\begin{array}{|c|c|c|c|c|c|c|c|}\n\hline\n2 & 58.02 & 50.36 & 50.48 & 13.00 & -0.24 \\
\hline\n1 & 116.04 & 98.44 & 98.63 & 15.00 & 0.19\n\end{array}$ 

#### **7. Conclusion**

We have presented a novel method of designing a test wrapper for multi-clock domain cores by effectively utilizing clock domain partitioning, bandwidth matching and gatedclocks. With minimal hardware overhead for gated-clock control, we have dramatically improved upon earlier methods which concurrently activate all the domains during test, especially under tight power constraints. Overall, the division of clock domains enabled us to give better results than all the previous methods with comparable area overhead.

# **Acknowledgements**

This work was supported in part by the Japan Society for the Promotion of Science (JSPS) under Grants-in-Aid for Scientific Research B(No.15300018), JSPS under Grants-in-Aid for Young Scientists (B) (No.18700046) and the grant of JSPS Research Fellowship (No.S06089).

#### **References**

- [1] E.J. Marinissen, S.K. Goel, and M. Lousberg, "Wrapper design for embedded core test," Proc. IEEE International Test Conference (ITC), pp.911–920, 2000.
- [2] S.K. Goel and E.J. Marinissen, "Effective and efficient test architecture design for SoCs," Proc. IEEE International Test Conference (ITC), pp.529–538, 2002.
- [3] V. Iyengar, K. Chakrabarty, and E.J. Marinissen, "Test wrapper and test access mechanism co-optimization for system-on-chip," J. Electron. Test., Theory Appl., vol.18, pp.213–230, April 2002.
- [4] V. Iyengar, K. Chakrabarty, and E.J. Marinissen, "Test access mechanism optimization, test scheduling, and tester data volume reduction for system-on-chip," IEEE Trans. Comput., vol.52, no.12, pp.1619–1632, Dec. 2003.
- [5] G. Hetherington, T. Fryars, N. Tamarapalli, M. Kassab, A. Hassan, and J. Rajski, "Logic BIST for large industrial designs: Real issues and case studies," Proc. IEEE International Test Conference (ITC), pp.358–367, 1999.
- [6] K. Hatayama, M. Nakao, and Y. Sato, "At-speed built-in test for logic circuits with multiple clocks," Proc. 11th Asian Test Symposium (ATS), pp.292–297, 2002.
- [7] V. Jain and J. Waicukauski, "Scan test volume reduction in multiclocked designs with safe capture technique," Proc. IEEE International Test Conference (ITC), pp.148–153, 2002.
- [8] Q. Xu and N. Nicolici, "Wrapper design for testing IP cores with multiple clock domains," Proc. Design, Automation and Test in Europe (DATE), pp.416–421, 2004.
- [9] Q. Xu, N. Nicolici, and K. Chakrabarty, "Multi-frequency wrapper design and optimization for embedded cores under average power constraints," Proc. ACM/IEEE Design Automation Conference (DAC), pp.123–128, 2005.
- [10] D. Zhao and U. Chandran, "Design of a time-gated multi-frequency wrapper architecture for modular SoC testing," Proc. IEEE 15th North Atlantic Test Workshop (NATW), May 2006.
- [11] Y. Xia, M. Chrzanowska-Jeske, B. Wang, and M. Jeske, "Using a distributed rectangle bin-packing approach for core-based SoC test scheduling with power constraints," Proc. the International Conference on Computer-Aided Design (ICCAD), pp.100–105, 2003.
- [12] Y. Huang, et al., "Optimal core wrapper width selection and SOC test scheduling based on 3-D bin packing algorithm," Proc. IEEE International Test Conference (ITC), pp.74–82, 2002.



**Thomas Edison Yu** received his B.S. degree in Physics from Ateneo de Manila University, Philippines in 2000. He received his B.S. degree in Computer Engineering from the same university in 2001. In 2006, he received his M.E. degree in Information Science from the Nara Institute of Science and Technology, Japan and is currently pursuing a doctorate degree at the same institute. His research interests include SoC, embedded core based system, and low power system design and testing. He is also

a student member of IEEE.



**Tomokazu Yoneda** received the B.E. degree in information systems engineering from Osaka University, Osaka, Japan, in 1998, and M.E. and Ph.D. degree in information science from Nara Institute of Science and Technology, Nara, Japan, in 2001 and 2002, respectively. Presently he is an assistant professor in Graduate School of Information Science, Nara Institute of Science and Technology. His research interests include VLSI CAD, design for testability, and SoC test scheduling. He is a member

of the IEEE Computer Society.



**Danella Zhao** is an Assistant Professor with the Center for Advanced Computer Studies (CACS), University of Louisiana at Lafayette. She received her Ph.D. and M.S. degrees in Computer Sciences and Engineering from the State University of New York at Buffalo in 2004 and 2001, respectively. She gained her B.E. degree from Zhejiang University, China. Dr. Zhao's current research interests are in the broad area of computer-aided design and test with special emphasis on system-on-chip (SoC) design

and test, high-performance intra-chip interconnect and communication, nanoscale application and system architecture, and design automation techniques for MEMS/NEMS and biochips. Dr. Zhao received the Japan Society for the Promotion of Science (JSPS) Fellowship Award in 2006 and is a member of IEEE.



**Hideo Fujiwara** received the B.E., M.E., and Ph.D. degrees in electronic engineering from Osaka University, Osaka, Japan, in 1969, 1971, and 1974, respectively. He was with Osaka University from 1974 to 1985 and Meiji University from 1985 to 1993, and joined Nara Institute of Science and Technology in 1993. Presently he is a Professor at the Graduate School of Information Science, Nara Institute of Science and Technology, Nara, Japan. His research interests are logic design, digital sys-

tems design and test, VLSI CAD and fault tolerant computing, including high-level/logic synthesis for testability, test synthesis, design for testability, built-in self-test, test pattern generation, parallel processing, and computational complexity. He is the author of Logic Testing and Design for Testability (MIT Press, 1985). He received many awards including Okawa Prize for Publication, IEEE CS (Computer Society) Meritorious Service Awards, IEEE CS Continuing Service Award, and IEEE CS Outstanding Contribution Award. He served as an Editor and Associate Editors of several journals, including the IEEE Trans. on Computers, and Journal of Electronic Testing: Theory and Application, and several guest editors of special issues of IEICE Transactions of Information and Systems. Dr. Fujiwara is a fellow of the IEEE, a Golden Core member of the IEEE Computer Society, and a fellow of the IPSJ.