

## Reliability of multi-core systems-onchip by interacting Markovian agents

<u>D. Cerotti<sup>1</sup></u>, A. Miele<sup>2</sup>, M. Gribaudo<sup>2</sup>, A. Bobbio<sup>1</sup>, C. Bolchini<sup>2</sup>

1 DiSIT- Università Piemonte Orientale

2 DEIB- Politecnico di Milano

UNIVERSITÀ DEL PIEMONTE ORIENTALE

### Motivations

- Advances in chip manufacturing process led to
  - shrink the dimension of transistors to tens of nanometers
  - Increase aging phenomena
- Integration of several components in same chip increases
  - Current densities
  - Operating temperature
- Such effects reduce device reliability and lifetime



# Modeling challenges

- The model should handle:
  - Realistic non-Markovian reliability and lifetime distributions
  - Workload variations
  - Interdependencies between:
    - Workload Temperature
    - Temperature Reliability
    - Geometry-dependent heat diffusion among cores

UP UNIVERSITÀ DEL PIEMONTE ORIENTALE



$$\begin{cases} \frac{d \boldsymbol{\pi}_{c}^{D}(t;v)}{dt} = \boldsymbol{\pi}_{c}^{D}(t;v) \boldsymbol{K}_{c}(t;v;[\boldsymbol{\Pi}_{V}]) \\ \frac{d \boldsymbol{\pi}_{c}^{R}(t;v)}{dt} = \boldsymbol{d}_{c}(t;v;\boldsymbol{\Pi}_{V}) \end{cases}$$

InfQ 2018 - November 23, 2018



#### Aging model

2

 Lifetime reliability of a digital component modeled as a Weibull distribution\*

$$\begin{split} R(t,T) &= e^{-\left(\frac{t}{\alpha(T)}\right)^{\beta}} \\ \text{with} \\ \alpha(T)_{EM} &= \frac{A_0(J - J_{\text{crit}})^{-n} e^{\frac{E_a}{kT}}}{\Gamma\left(1 + \frac{1}{\beta}\right)} \end{split} \\ \text{Black equation for} \qquad \beta = 0 \end{split}$$

when workload changes also T may vary, but *reliability conserves*, so:



\* JEDEC Solid State Tech. Ass. Failure mechanisms and models for semi-conductor devices. JEDEC Publ. JEP122G, 2010



#### Aging model

$$A = H_Y(t) = \int_0^t h_Y(u) du$$

$$\begin{array}{c} \lambda(v,\Pi_{V}) \\ \hline \\ \delta(v,\Pi_{V}) \\ \hline \\ \hline \\ \gamma(v,\Pi_{V}) \\ \hline \\ \mu_{I}(v,\Pi_{V}) \\ \hline \end{array} \\ \begin{array}{c} F \\ F \\ \hline \\ \mu_{I}(v,\Pi_{V}) \\ \hline \end{array} \\ \end{array}$$

$$f_W(t) = \frac{\beta}{\alpha} \left(\frac{t}{\alpha}\right)^{\beta-1} e^{-\left(\frac{t}{\alpha}\right)^{\beta}} F_W(t) = 1 - e^{-\left(\frac{t}{\alpha}\right)^{\beta}}$$
$$h_W(t) = \frac{\beta}{\alpha} \left(\frac{t}{\alpha}\right)^{\beta-1} H_W(t) = \left(\frac{t}{\alpha}\right)^{\beta}$$
$$H_W^{-1}(A) = \alpha A^{\frac{1}{\beta}} \lambda_W(A) = \frac{\beta}{\alpha} A^{\frac{\beta-1}{\beta}}$$

# UP UNIVERSITÀ DEL PIEMONTE ORIENTALE

#### Thermal model

$$\begin{split} U_{S}(i,j) &= \pi_{W}(t;v_{i+1,j}) + \pi_{W}(t;v_{i-1,j}) + \\ &+ \pi_{W}(t;v_{i,j+1}) + \pi_{W}(t;v_{i,j-1}) \\ U_{C}(i,j) &= \pi_{W}(t;v_{i+1,j+1}) + \pi_{W}(t;v_{i+1,j-1}) + \\ &+ \pi_{W}(t;v_{i-1,j+1}) + \pi_{W}(t;v_{i-1,j-1}) \\ T_{W}(i,j) &= c_{0} + c_{2} \cdot U_{S}(i,j) + c_{3} \cdot U_{C}(i,j) \\ T_{I}(i,j) &= c_{0} + c_{1} + c_{2} \cdot U_{S}(i,j) + c_{3} \cdot U_{C}(i,j) \end{split}$$



- In the first scenario we assume:
  - uniform workload over a system with 36 cores
  - Per-core utilization 40%





- In the second scenario we assume:
  - A 144 core-CPU with primary and spare core
  - Per-core utilization 60% and 90%



UP UNIVERSITÀ DEL PIEMONTE ORIENTALE

