# Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor

Hamza Rihani, Matthieu Moy, Claire Maiza, Robert I. Davis, Sebastian Altmeyer

RTNS'16, October 19, 2016













Single-core code generation

```
int main_app(i1, i2)
{
    na = NA(i1);
    ne = NE(i2);
    nb = NB(na);
    nd = ND(na);
    nf = NF(ne);
    o = NC(nb,nd,nf);
    return o;
}
```

static non-preemptive scheduling









#### Contributions

1 Precise accounting for interference on shared resources in a many-core processor



#### Contributions

1 Precise accounting for interference on shared resources in a many-core processor



2 Model of a multi-level arbiter to the shared memory



#### Contributions

1 Precise accounting for interference on shared resources in a many-core processor



2 Model of a multi-level arbiter to the shared memory



3 Response time and release dates analysis respecting dependencies.

#### Outline

- 1 Motivation and Context
- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work

### Outline

- 1 Motivation and Context
- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work



- Kalray MPPA 256 Bostan
- 16 compute clusters + 4 I/O clusters
- Dual NoC



#### Per cluster:

- ∘ 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks per cluster (total 2 MB)



#### Per cluster:

- 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks per cluster (total 2 MB)
- Multi-level bus arbiter per memory bank











- Tasks mapping on cores
- Static non-preemptive scheduling





- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation different tasks go to different memory banks





- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation different tasks go to different memory banks
- Interference from communications





- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation
   different tasks go to different memory banks
- Interference from communications
- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank

Single phase: execute and write data.

memory access pattern







- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation
   different tasks go to different memory banks
- Interference from communications
- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank





Two phases: execute then write data.





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order



- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- Processor Demand, Memory Demand





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- o Processor Demand, Memory Demand
- $\circ$  Release date ( $\mathit{rel}_i$ ), response time ( $\mathit{R}_i$ )





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- o Processor Demand, Memory Demand
- $\circ$  Release date ( $\mathit{rel}_i$ ), response time ( $\mathit{R}_i$ )





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- Processor Demand, Memory Demand
- Release date  $(rel_i)$ , response time  $(R_i)$



- $\bigcirc$  Find  $R_i$  (including the interference)
- $\bigcirc$  Find  $rel_i$  respecting precedence constraints

## Outline

- 1 Motivation and Context
- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work

$$R = PD + I^{\mathit{BUS}}(R)$$
   
 • Response Time









• Recursive formula  $\Rightarrow$  fixed-point algorithm.



- Recursive formula ⇒ fixed-point algorithm.
- Multiple shared resources (memory banks)



- Recursive formula ⇒ fixed-point algorithm.
- Multiple shared resources (memory banks)

$$I^{BUS}(R) = \sum_{b \in R} I_b^{BUS}(R)$$

where B: a set of memory banks



- Recursive formula ⇒ fixed-point algorithm.
- Multiple shared resources (memory banks)

$$I^{BUS}(R) = \sum_{b \in R} I_b^{BUS}(R)$$

where B: a set of memory banks

Requires a model of the bus arbiter

## Model of the MPPA Bus



 $I_b^{\mathsf{BUS}}$ : delay from all accesses + concurrent ones







 $I_h^{\mathsf{BUS}}$ : delay from all accesses + concurrent ones

 $S_i^b$ : number of accesses of task  $au_i$  to bank b

 $S_{i}^{b}$  = Memory Demand to bank b

 $A_i^{\boldsymbol{\gamma},b} \colon$  number of concurrent accesses from core  $\boldsymbol{y}$  to bank b





 $I_b^{\mathsf{BUS}}$ : delay from all accesses + concurrent ones  $S_i^b$ : number of accesses of task  $\tau_i$  to bank b  $S_i^b = \mathsf{Memory\ Demand\ to\ bank\ } b$   $A_i^{y,b}$ : number of concurrent accesses from core y to bank b





 $I_b^{\mathsf{BUS}}$ : delay from all accesses + concurrent ones  $S_i^b$ : number of accesses of task  $\tau_i$  to bank b  $S_i^b$ : Memory Demand to bank b  $A_i^{y,b}$ : number of concurrent accesses from core y to bank b





$$Lv_1 = S_i^b$$

$$Lv_2 = Lv_1 + \sum_{y=1}^{15} \min(A_i^{y,b}, Lv_1)$$

$$Lv_3 = Lv_2 + \min(A_i^{G2,b}, Lv_2)$$

$$Lv_4 = Lv_4 + A_i^{G3,b}$$

 $I_b^{\mathsf{BUS}}$ : delay from all accesses + concurrent ones  $S_i^b$ : number of accesses of task  $\tau_i$  to bank b  $S_i^b$  = Memory Demand to bank b  $A_i^{y,b}$ : number of concurrent accesses from core y to bank b





$$Lv_{1} = S_{i}^{b}$$

$$Lv_{2} = Lv_{1} + \sum_{y=1}^{15} \min(A_{i}^{y,b}, Lv_{1})$$

$$Lv_{3} = Lv_{2} + \min(A_{i}^{G2,b}, Lv_{2})$$

$$Lv_{4} = Lv_{4} + A_{i}^{G3,b}$$

 $I_b^{\mathsf{BUS}}$ : delay from all accesses + concurrent ones  $S_i^b$ : number of accesses of task  $\tau_i$  to bank b  $S_i^b = \mathsf{Memory\ Demand\ to\ bank\ } b$   $A_i^{y,b}$ : number of concurrent accesses from core y to bank b  $A_i^{y,b} = \sum$  overlapping concurrent accesses







1 Start with initial release dates.





- 1 Start with initial release dates.
- 2 Compute response times

...





- 1 Start with initial release dates.
- 2 Compute response times

... ...





- 1 Start with initial release dates.
- 2 Compute response times

... ... a fixed-point is reached!





- Start with initial release dates.
- Compute response times
  ... ... a fixed-point is reached!
- 3 Update the release dates.





- Start with initial release dates.
- Compute response times
  ... ... a fixed-point is reached!
- 3 Update the release dates.
- 4 Repeat until no release date changes (another fixed-point iteration).





- Start with initial release dates.
- Compute response times
  ... ... a fixed-point is reached!
- 3 Update the release dates.
- 4 Repeat until no release date changes (another fixed-point iteration).





 $\circ$  Convergence of the  $1^{st}$  fixed-point iteration:





- $\circ$  Convergence of the  $\,1^{\it st}\,$  fixed-point iteration:
  - Monotonic and bounded





- Convergence of the 1<sup>st</sup> fixed-point iteration:
  - Monotonic and bounded
- Convergence of the  $2^{nd}$  fixed-point iteration:





- $\circ$  Convergence of the  $1^{st}$  fixed-point iteration:
  - Monotonic and bounded
- Convergence of the 2<sup>nd</sup> fixed-point iteration:
  - no monotonicity:  $R_i$  and  $rel_i$  may grow or shrink at each iteration.





- Convergence of the 1<sup>st</sup> fixed-point iteration:
  - Monotonic and bounded
- Convergence of the 2<sup>nd</sup> fixed-point iteration:
  - on o monotonicity:  $R_i$  and  $rel_i$  may grow or shrink at each iteration.

#### **Theorem**

At each iteration, at least one task finds its final release date.

Full proof in our technical report:

http://www-verimag.imag.fr/TR/TR-2016-1.pdf





- $\circ$  Convergence of the  $1^{st}$  fixed-point iteration:
  - Monotonic and bounded
- Convergence of the 2<sup>nd</sup> fixed-point iteration:
  - on o monotonicity:  $R_i$  and  $rel_i$  may grow or shrink at each iteration. ?

#### **Theorem**

At each iteration, at least one task finds its final release date.

Full proof in our technical report:

http://www-verimag.imag.fr/TR/TR-2016-1.pdf





- Convergence of the 1<sup>st</sup> fixed-point iteration:
  - Monotonic and bounded
- Convergence of the 2<sup>nd</sup> fixed-point iteration:
  - on o monotonicity:  $R_i$  and  $rel_i$  may grow or shrink at each iteration.

#### **Theorem**

At each iteration, at least one task finds its final release date.

Full proof in our technical report:

 ${\tt http://www-verimag.imag.fr/TR/TR-2016-1.pdf}$ 





- Convergence of the 1<sup>st</sup> fixed-point iteration:
  - Monotonic and bounded
- Convergence of the 2<sup>nd</sup> fixed-point iteration:
  - on o monotonicity:  $R_i$  and  $rel_i$  may grow or shrink at each iteration.

#### **Theorem**

At each iteration, at least one task finds its final release date.

Full proof in our technical report:

 ${\tt http://www-verimag.imag.fr/TR/TR-2016-1.pdf}$ 





- $\circ$  Convergence of the  $1^{st}$  fixed-point iteration:
  - Monotonic and bounded
- Convergence of the  $2^{nd}$  fixed-point iteration:
  - on o monotonicity:  $R_i$  and  $rel_i$  may grow or shrink at each iteration.

#### **Theorem**

At each iteration, at least one task finds its final release date.

Full proof in our technical report:

 ${\tt http://www-verimag.imag.fr/TR/TR-2016-1.pdf}$ 





- $\circ$  Convergence of the  $1^{st}$  fixed-point iteration:
  - Monotonic and bounded
- Convergence of the 2<sup>nd</sup> fixed-point iteration:
  - on o monotonicity:  $R_i$  and  $rel_i$  may grow or shrink at each iteration.

#### **Theorem**

At each iteration, at least one task finds its final release date.

Full proof in our technical report:

http://www-verimag.imag.fr/TR/TR-2016-1.pdf



### Outline

- 1 Motivation and Context
- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work



Flight management system controller

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014



- Flight management system controller
- Receive from sensors and transmit to actuators

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014



- Flight management system controller
- Receive from sensors and transmit to actuators
- Assumptions:

Tasks are mapped on 5 cores Debug Support Unit is disabled Context switches are over-approximated constants

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014



- Flight management system controller
- Receive from sensors and transmit to actuators
- Assumptions:

Tasks are mapped on 5 cores Debug Support Unit is disabled Context switches are over-approximated constants

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

Profile obtained from measurements

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- Memory Demand: data and instruction cache misses + communications

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- Memory Demand: data and instruction cache misses + communications
- Moreover:
  - NoC Rx: writes 5 words
  - NoC Tx: reads 2 words

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- Memory Demand: data and instruction cache misses + communications
- Moreover:
  - NoC Rx: writes 5 words
  - NoC Tx: reads 2 words
  - Learning Experiments: Find the smallest schedulable hyper-period

## **Evaluation: Experiments**



Smallest schedulable hyper-period

### **Evaluation: Experiments**



Pessimistic assumption:
 High priority tasks are
 bounded by 1 access per bank

E5: All accesses interfer

### **Evaluation: Experiments**



Pessimistic assumption:
 High priority tasks are bounded by 1 access per bank



17 / 21



Pessimistic assumption: High priority tasks are bounded by 1 access per bank

Phases are modeled as

Smallest schedulable hyper-period

E5: All accesses interfere

E4, E3: We don't use the release dates

E2, E1: Our approach. We use the release dates

Taking into account the memory banks improves the analysis with a factor in [1.77, 2.52]





Smallest schedulable hyper-period

|      | E5/E1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|------|-------|-------|-------|-------|-------|-------|
| MPPA | 4.15  | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR   | 3.3   | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |

Speedup factors

Taking into account the memory banks improves the analysis with a factor in [1.77,2.52]



Smallest schedulable hyper-period

|      | E5/ <u>F</u> 1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|------|----------------|-------|-------|-------|-------|-------|
| MPPA | 4.15           | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR   | 3.3            | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |

Speedup factors

Taking into account the memory banks improves the analysis with a factor in [1.77,2.52]



Smallest schedulable hyper period

|      | E5/E1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|------|-------|-------|-------|-------|-------|-------|
| MPPA | 4.15  | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR   | 3.3   | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |

Speedup factors

Taking into account the memory banks improves the analysis with a factor in [1.77,2.52]



|      | E5/E1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|------|-------|-------|-------|-------|-------|-------|
| MPPA | 4.15  | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR   | 3.3   | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |

Speedup factors

Taking into account the memory banks improves the analysis with a factor in  $\left[1.77, 2.52\right]$ 



|      | E5/E1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|------|-------|-------|-------|-------|-------|-------|
| MPPA | 4.15  | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR   | 3.3   | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |

Speedup factors

Taking into account the memory banks improves the analysis with a factor in [1.77, 2.52]



|      | E5/E1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|------|-------|-------|-------|-------|-------|-------|
| MPPA | 4.15  | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR   | 3.3   | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |

Speedup factors

### Outline

- 1 Motivation and Context
- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work

o A response time analysis of SDF on the Kalray MPPA 256

- o A response time analysis of SDF on the Kalray MPPA 256
- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order

- A response time analysis of SDF on the Kalray MPPA 256
- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order
- We compute:
  - Tight response times taking into account the interference.
  - Release dates respecting the dependency constraints.

- A response time analysis of SDF on the Kalray MPPA 256
- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order

model of the multi-level arbiter

- We compute:
  - Tight response times taking into account the interference of
  - Release dates respecting the dependency constraints.

- A response time analysis of SDF on the Kalray MPPA 256
- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order
- We compute:
  - Tight response times taking into account the interference
  - Release dates respecting the dependency constraints.

double fixed-point algorithm

model of

the multi-level arbiter

- A response time analysis of SDF on the Kalray MPPA 256
- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order
- We compute:
  - Tight response times taking into account the interference
  - Release dates respecting the dependency constraints.
- Not restricted to SDF

double fixed-point algorithm

model of

the multi-level arbiter

Model of the Resource Manager.



Model of the Resource Manager.

tighter estimation of context switches and other interrupts



Model of the Resource Manager.

tighter estimation of context switches and other interrupts

Model of the NoC accesses.



Model of the Resource Manager.

use the output of

Model of the NoC accesses.



Model of the Resource Manager.

Model of the NoC accesses.

Memory access pipelining.







Model of the Resource Manager.
 Model of the NoC accesses.
 Memory access pipelining.



Model Blocking and non-blocking accesses.





# Questions?



Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



$$R_0 = 10 + 3 \times 10$$
 (response time in isolation)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



$$R_0 = 10 + 3 \times 10$$
 (response time in isolation)

$$R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$$

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



$$R_0 = 10 + 3 \times 10$$
 (response time in isolation)

$$R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$$

$$R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$$

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



$$R_0 = 10 + 3 \times 10$$
 (response time in isolation)

$$R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$$

$$R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$$

$$R_3 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 + 0 = 80$$
 (fixed-point)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

## The Global Picture

