# Elzar

# Triple Modular Redundancy using Intel AVX

# X

# Dmitrii Kuvaiskii

Oleksii Oleksenko Pramod Bhatotia Christof Fetzer







• Online services run in **huge data centers** 





- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception





- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception
- These faults can lead to **arbitrary state corruptions**





- Online services run in huge data centers
- Hardware faults are the norm rather than the exception
- These faults can lead to **arbitrary state corruptions**



"...There were a handful of messages that had **a single bit corrupted** such that the message was **still intelligible**, but the system state information was **incorrect**..."





- Online services run in huge data centers
- Hardware faults are the norm rather than the exception
- These faults can lead to **arbitrary state corruptions**







Amazon Web Services » Service Health Dashboard » Amazon S3 Availability Event: July 20, 2008

#### Amazon S3 Availability Event: July 20, 2008

We wanted to provide some additional detail about the problem we experienced on Sunday, July 20th.

"...There were a handful of messages that had **a single bit corrupted** such that the message was **still intelligible**, but the system state information was **incorrect**..."

#### Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan Kevin Lai, Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones Jeff Shute, Andrey Gubarev, Shivakumar Venkataraman, Divyakant Agrawal Google, Inc.

"...Corruption can occur **transiently in CPU** or RAM. Guarding against such corruptions is an **important goal** in Mesa's overall design..."





**Byzantine Fault Tolerance** 



Ad-hoc approaches

#### **Byzantine Fault Tolerance**

- 🙂 Tolerates arbitrary faults
- 😕 Pessimistic fault model
- 😕 High resource overheads



- C Tolerates arbitrary faults
- 😕 Pessimistic fault model
- 😕 High resource overheads

DSN'16

#### Principled approaches



#### **Byzantine Fault Tolerance**

- 🙂 Tolerates arbitrary faults
- 😕 Pessimistic fault model
- 🙁 High resource overheads

#### **Checksums / Assertions**

- 🙂 Low performance overheads
- 😕 Only anticipated faults
- 😕 Manual and error-prone











Native

z = add x, y



| Native |            | TMR                  |  |
|--------|------------|----------------------|--|
| Z      | = add x, y | z = <b>add</b> x, y  |  |
|        |            | $z_2 = add x_2, y_2$ |  |
|        |            | $z_3 = add x_3, y_3$ |  |

# Triple Modular Redundancy



| Native              | TMR                                                          | TMR with SIMD                   |
|---------------------|--------------------------------------------------------------|---------------------------------|
| z = <b>add</b> x, y | z = add x, y<br>$z_2 = add x_2, y_2$<br>$z_3 = add x_3, y_3$ | Zwide = <b>add</b> Xwide, Ywide |



| Native              | TMR                                                          | TMR with SIMD                   |  |
|---------------------|--------------------------------------------------------------|---------------------------------|--|
| z = <b>add</b> x, y | z = add x, y<br>$z_2 = add x_2, y_2$<br>$z_3 = add x_3, y_3$ | Zwide = <b>add</b> Xwide, Ywide |  |

#### Idea SIMD uses less instructions → less perf overhead?

### Can we use **SIMD** for **efficient fault tolerance**?

# Can we use **SIMD** for **efficient fault tolerance**?

Implementation using Intel AVX





#### Implementation Intel AVX for triple modular redundancy



# ImplementationIntel AVX for triple modular redundancyOutcomeMixed results 🔅 (AVX not general-purpose)



| Implementation | Intel AVX for triple modular redundancy          |
|----------------|--------------------------------------------------|
| Outcome        | <b>Mixed results</b> 🙁 (AVX not general-purpose) |
| Investigation  | Two bottlenecks in current AVX                   |

Motivation
Intel AVX
Design
Evaluation
Discussion

# Intel AVX Background

#### Intel AVX Background

256 bit YMM0 YMM1 YMM2 ... YMM15 64-bit 64-bit 64-bit 64-bit

#### New registers

#### Intel AVX Background

256 bit YMM0 YMM1 YMM2 ... YMM15 64-bit 64-bit 64-bit

**New instructions** 

**New registers** 









#### **AVX** is not actively used? → **Reuse for fault tolerance!**



Motivation
Intel AVX
Design
Evaluation
Discussion

# **Fault Model**

### – Protect against **transient faults in CPU**

- corruptions in CPU registers
- miscomputations in CPU execution units

- Memory is protected by other means
  - DRAM protected by ECC
  - CPU caches protected by ECC and parity





#### Native

- x = load a
- z = add x, 1

| Native              | TMR                                                      |  |
|---------------------|----------------------------------------------------------|--|
| x = <b>load</b> a   | x = <b>load</b> a                                        |  |
| z = <b>add</b> x, 1 | z = add x, 1<br>$z_2 = add x_2, 1$<br>$z_3 = add x_3, 1$ |  |
|                     | <pre>majority(z, z<sub>2</sub>, z<sub>3</sub>)</pre>     |  |

| Native              | TMR                                                      | Elzar                   |
|---------------------|----------------------------------------------------------|-------------------------|
| x = <b>load</b> a   | x = <b>load</b> a                                        | x = <b>avx-load</b> a   |
| z = <b>add</b> x, 1 | z = add x, 1<br>$z_2 = add x_2, 1$<br>$z_3 = add x_3, 1$ | z = <b>avx-add</b> x, 1 |
|                     | <pre>majority(z, z<sub>2</sub>, z<sub>3</sub>)</pre>     | <pre>avx-check(z)</pre> |

| Native              | TMR                                                      | Elzar                   |
|---------------------|----------------------------------------------------------|-------------------------|
| x = <b>load</b> a   | x = <b>load</b> a                                        | x = <b>avx-load</b> a   |
| z = <b>add</b> x, 1 | z = add x, 1<br>$z_2 = add x_2, 1$<br>$z_3 = add x_3, 1$ | z = <b>avx-add</b> x, 1 |
|                     | <pre>majority(z, z<sub>2</sub>, z<sub>3</sub>)</pre>     | avx-check(z)            |

#### **Common instructions** introduce lower overhead $\bigcirc$

| Native              | TMR                                                      | Elzar                   |
|---------------------|----------------------------------------------------------|-------------------------|
| x = <b>load</b> a   | x = <b>load</b> a                                        | x = <b>avx-load</b> a   |
| z = <b>add</b> x, 1 | z = add x, 1<br>$z_2 = add x_2, 1$<br>$z_3 = add x_3, 1$ | z = <b>avx-add</b> x, 1 |
|                     | <pre>majority(z, z<sub>2</sub>, z<sub>3</sub>)</pre>     | <pre>avx-check(z)</pre> |

#### Bottleneck 1: Memory accesses require wrappers 🙁

| Native              | TMR                                                      | Elzar                   |
|---------------------|----------------------------------------------------------|-------------------------|
| x = <b>load</b> a   | x = <b>load</b> a                                        | x = avx-load a          |
| z = <b>add</b> x, 1 | z = add x, 1<br>$z_2 = add x_2, 1$<br>$z_3 = add x_3, 1$ | z = avx-add x, 1        |
|                     | <pre>majority(z, z<sub>2</sub>, z<sub>3</sub>)</pre>     | <pre>avx-check(z)</pre> |

#### Bottleneck 2: Checks are expensive 🙁

### x = avx-load a — No such thing as avx-load! 🙁

| a a | а | а |
|-----|---|---|
|-----|---|---|







z = avx-add x, 1
avx-check(z)



#### **Memory accesses** require wrappers 🙁

x = avx-load a
z = avx-add x, 1
avx-check(z)

x = avx-load a
z = avx-add x, 1
avx-check(z)

| z <sub>1</sub> z <sub>2</sub> | <b>z</b> <sub>3</sub> | z <sub>4</sub> |
|-------------------------------|-----------------------|----------------|
|-------------------------------|-----------------------|----------------|











Ζ

z'

I.





#### Checks introduce additional 55% overhead 😕

Motivation
Intel AVX
Design
Evaluation
Discussion

## **Performance overheads**







Average performance overhead is **4**×

# **Performance overheads**



Amortized by very poor memory access pattern



(1) Native benefits from AVX vectorization(2) Most of time spent in mem-intensive bzero()







Elzar performs **46%** worse than SWIFT-R on average 😕



Dominated by memory accesses, Elzar inserts many wrappers 🙁



Motivation
Intel AVX
Design
Evaluation
Discussion

# **Bottlenecks and Proposed Solution**

### **AVX** lacks certain instructions

- Need wrappers for memory accesses
- Need shuffle-xor-ptest for checks

Problem

# **Bottlenecks and Proposed Solution**

### **Problem AVX** lacks certain instructions

- Need wrappers for memory accesses
- Need shuffle-xor-ptest for checks

Solution

### Offload to FPGA accelerator

– Intel's Xeon-FPGA pairing

# **Bottlenecks and Proposed Solution**

### **Problem AVX** lacks certain instructions

- Need wrappers for memory accesses
- Need shuffle-xor-ptest for checks

### Solution

### Offload to **FPGA accelerator**

– Intel's Xeon-FPGA pairing



# **Bottlenecks and Proposed Solution**

### **Problem AVX** lacks certain instructions

- Need wrappers for memory accesses
- Need shuffle-xor-ptest for checks

### Solution

### Offload to **FPGA accelerator**

– Intel's Xeon-FPGA pairing



Potentially **71% better** than SWIFT-R

### Conclusion





### Implementation

Intel AVX for triple modular redundancy



# ImplementationIntel AVX for triple modular redundancyHypothesisLess instructions → less perf overhead



# ImplementationIntel AVX for triple modular redundancyHypothesisLess instructions → less perf overheadOutcome46% worse than SWIFT-R



| Implementation | Intel AVX for triple modular redundancy              |
|----------------|------------------------------------------------------|
| Hypothesis     | <b>Less</b> instructions → <b>less</b> perf overhead |
| Outcome        | 46% worse than SWIFT-R                               |
| Discussion     | With FPGA, <b>~71%</b> better than SWIFT-R           |

### Thank you! dmitrii.kuvaiskii@tu-dresden.de

GitHub repo: https://github.com/tudinfse/elzar



### backup slides



### Intel Haswell CPU microarchitecture



### Intel Haswell CPU microarchitecture



### Elzar: Check on Branch

Native

cmp x, y
jne truebranch

### Elzar: Check on Branch

Native

cmp x, y
jne truebranch



### **Elzar: Check on Branch**

Native

cmp x, y
jne truebranch



Impact Checks on branches introduce only 4% overhead 🙂

#### **Problem AVX** instruction set lacks certain instructions

- Loads/stores require extracting AVX-replicated address
  - Elzar creates expensive wrappers around loads/stores
  - AVX-512 introduces promising **gather/scatter** instructions
- AVX has only one control-flow instruction ptest
  - Elzar has to insert **ptest** for each control-flow decision
  - AVX could add **cmp** instructions directly affecting control flow
- AVX misses integer division, integer truncation, etc...
  - Elzar produces very ineffective code in these cases



### **Estimation of Proposed Changes**



### **Estimation of Proposed Changes**



# **Estimation of Proposed Changes**



(71% improvement over SWIFT-R)