# **HAFT** Hardware-Assisted Fault Tolerance

<u>Dmitrii Kuvaiskii</u> Rasha Faqeh Pramod Bhatotia Christof Fetzer Technische Universität Dresden

Pascal Felber Université de Neuchâtel

Eurosys'16

• Online services run in **huge data centers** 





- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception





- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception
- Focus on faults that result in **arbitrary data corruptions**





- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception
- Focus on faults that result in **arbitrary data corruptions**





Amazon S3 Availability Event http://status.aws.amazon.com/s3-20080720.html

- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception
- Focus on faults that result in **arbitrary data corruptions**





Amazon S3 Availability Event

http://status.aws.amazon.com/s3-20080720.html

Data corruption with Opteron CPUs https://bugzilla.kernel.org/show\_bug.cgi?id=7768

- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception
- Focus on faults that result in **arbitrary data corruptions**





Amazon S3 Availability Event

http://status.aws.amazon.com/s3-20080720.html

Data corruption with Opteron CPUs

https://bugzilla.kernel.org/show\_bug.cgi?id=7768

Defective S3 load balancer

https://forums.aws.amazon.com/thread.jspa?threadID=22709

- Online services run in **huge data centers**
- Hardware faults are the norm rather than the exception
- Focus on faults that result in **arbitrary data corruptions**





#### Amazon S3 Availability Event

http://status.aws.amazon.com/s3-20080720.html

#### Data corruption with Opteron CPUs

https://bugzilla.kernel.org/show\_bug.cgi?id=7768

#### Defective S3 load balancer

https://forums.aws.amazon.com/thread.jspa?threadID=22709

#### Google's Mesa Data Warehousing System

" ...corruption can occur **transiently in CPU** or RAM. Guarding against such corruptions is an **important goal** in Mesa's overall design... "





**Byzantine Fault Tolerance** 



Ad-hoc approaches

#### **Byzantine Fault Tolerance**

- Tolerates arbitrary faults
- X Pessimistic fault model
- X High resource overheads



- X Pessimistic fault model
- **X** High resource overheads

Eurosys'16





#### **Byzantine Fault Tolerance**

- Tolerates arbitrary faults
- X Pessimistic fault model
- X High resource overheads

**Checksums / Assertions** 

- Low performance overheads
- X Only anticipated faults
- X Manual and error-prone













- X Non-transparent
  - Manual changes in source code
  - Specific languages / programming models



- 🗙 Non-transparent
  - Manual changes in source code
  - Specific languages / programming models
- × Impractical
  - Only single-threaded programs
  - Only fail-stop execution



- 🗙 Non-transparent
  - Manual changes in source code
  - Specific languages / programming models
- 🗙 Impractical
  - Only single-threaded programs
  - Only fail-stop execution
- × Inefficient
  - Requires spare cores / deterministic execution
  - Memory overhead

#### Practical



- No changes in source code
- Shared-memory programming model

Practical



- No changes in source code
- Shared-memory programming model

Practical

- Multithreaded programs
- Fault detection *and* fault recovery

Efficient

- No changes in source code
- Shared-memory programming model

Practical

- Multithreaded programs
- Fault detection and fault recovery

Efficient

- No spare cores, no deterministic execution
- No memory overhead (rely on ECC)

### - Motivation

- Design
- Optimizations
- Evaluation















### Fault Model

### - Protect against transient faults in CPU

- corruptions in CPU registers
- miscomputations in CPU execution units

|     | 4 |
|-----|---|
|     | × |
| CPU |   |

### **Fault Model**

### – Protect against transient faults in CPU

- corruptions in CPU registers
- miscomputations in CPU execution units

- Memory is protected by other means
  - DRAM protected by ECC
  - CPU caches protected by ECC and parity





#### Native

z = add x, y

store z, 0x10

| Native               | Instr Level Replication<br>ILR  |
|----------------------|---------------------------------|
| z = <b>add</b> x, y  | z = add x, y<br>z2 = add x2, y2 |
| <b>store</b> z, 0x10 | <b>store</b> z, 0x10            |

| Native               | Instr Level Replication<br>ILR                       |
|----------------------|------------------------------------------------------|
| z = <b>add</b> x, y  | z = add x, y<br>z2 = add x2, y2<br>d = cmp neq z, z2 |
| <b>store</b> z, 0x10 | <pre>br d, crash store z, 0x10</pre>                 |

#### HAFT: Code Transformation

| Native               | Instr Level Replication                                      | Transactification<br>Tx                                            |
|----------------------|--------------------------------------------------------------|--------------------------------------------------------------------|
| z = <b>add</b> x, y  | z = add x, y<br>$z^2 = add x^2, y^2$<br>$d = cmp neq z, z^2$ | <pre>tx-begin z = add x, y z2 = add x2, y2 d = cmp neq z, z2</pre> |
| <b>store</b> z, 0x10 | <pre>br d, crash store z, 0x10</pre>                         | <pre>br d, tx-abort store z, 0x10 tx-end</pre>                     |

## Challenge of Transactification

#### **Commodity HTM** (Intel TSX) implementations

- for synchronization **not** fault recovery
- for small-sized well-behaved transactions



## Challenge of Transactification

#### **Commodity HTM** (Intel TSX) implementations

- for synchronization **not** fault recovery
- for small-sized well-behaved transactions

#### Need **right size** of HW transactions

- large and rare  $\rightarrow$  high abort rate
- small and frequent  $\rightarrow$  high perf overhead

| L1 cache |  |
|----------|--|
| 1        |  |
| CPU HTM  |  |

## **Challenge of Transactification**

#### **Commodity HTM** (Intel TSX) implementations

- for synchronization **not** fault recovery
- for small-sized well-behaved transactions

#### Need **right size** of HW transactions

- large and rare  $\rightarrow$  high abort rate
- small and frequent  $\rightarrow$  high perf overhead

#### Solution: dynamic transaction boundaries

- track number of instructions executed
- start new transaction whenever exceed predefined threshold



- Motivation
- <del>Design</del>
- Optimizations
- Evaluation

#### - Motivation

– <del>Design</del>

## – Optimizations

1. Shared memory accesses

- 2. Lock elision
- Evaluation

#### - Motivation

– <del>Design</del>

# – Optimizations

1. Shared memory accesses

- 2. Lock elision
- Evaluation

See the paper for other optimizations

 Motivation Checks on memory accesses are expensive – 40% instructions are loads and stores

 Motivation Checks on memory accesses are expensive – 40% instructions are loads and stores

```
d = cmp neq adr, adr2
br d, xabort
val = load adr
```

- Motivation Checks on memory accesses are expensive
   40% instructions are loads and stores
- Idea Distinguish atomic and non-atomic accesses

```
d = cmp neq adr, adr2
br d, xabort
val = load adr
```

- Motivation Checks on memory accesses are expensive
   40% instructions are loads and stores
- Idea Distinguish atomic and non-atomic accesses

#### Explicit atomic

d = cmp neq adr, adr2
br d, xabort
val = load atomic adr

Protected non-atomic

val = **load** adr val2 = **load** adr2

- Motivation Checks on memory accesses are expensive
   40% instructions are loads and stores
- Idea Distinguish atomic and non-atomic accesses
- Impact Up to 40% better performance

#### Explicit atomic

d = cmp neq adr, adr2
br d, xabort
val = load atomic adr

Protected non-atomic

val = **load** adr val2 = **load** adr2

### **Optimization 2: Lock Elision**

# • Motivation Small critical sections are expensive – **3** transactions for each

Tiny critical section

...
lock(&l)
counter++
unlock(&l)

• • •

# • Motivation Small critical sections are expensive – **3** transactions for each



- Motivation Small critical sections are expensive **3** transactions for each
- Idea Use Tx for recovery *and* lock elision



- Motivation Small critical sections are expensive **3** transactions for each
- Idea Use Tx for recovery *and* lock elision

**Tiny critical section** 

```
...
lock_wrapper(&l)
counter++
unlock_wrapper(&l)
```

• •

- Motivation Small critical sections are expensive **3** transactions for each
- Idea Use Tx for recovery *and* lock elision
- Impact Up to 30% better throughput

```
Tiny critical section
```

```
...
lock_wrapper(&l)
counter++
unlock_wrapper(&l)
```

• •

- Motivation
- <del>Design</del>
- Optimizations
- Evaluation

- Motivation
- <del>Design</del>
- <del>Optimizations</del>
- Evaluation
  - 1. Performance overheads
  - 2. Reliability
  - 3. Real-world application: Memcached

- Motivation
- <del>Design</del>
- <del>Optimizations</del>
- Evaluation
  - 1. Performance overheads
  - 2. Reliability
  - 3. Real-world application: Memcached

See the paper for more results









Amortized by very low cache locality



(1) small-sized transactions and (2) high original ILP







Out of 2500 fault injections, 83% led to data corruption in output





Injected faults: only 1.1% undetected

# Availability







Out of 2500 fault injections, **12%** resulted in correct execution





# Injected faults: **91.2**% corrected (best-effort nature of Intel TSX)









### Summary

#### Transparent

- no changes in source code
- general programming model

#### Transparent

- no changes in source code
- general programming model
- Practical
  - Shared-memory multithreaded programs
  - Fault detection *and* fault recovery

#### Transparent

- no changes in source code
- general programming model
- Practical
  - Shared-memory multithreaded programs
  - Fault detection *and* fault recovery
- Efficient
  - Low performance overhead
  - Relies on commodity-hardware HTM (Intel TSX)

# **Thank you!** dmitrii.kuvaiskii@tu-dresden.de

Source code available: https://github.com/tudinfse/haft

Eurosys'16

### Backup slides



[Designing Reliable Systems from Unreliable Components, S. Borkar, Micro'05]

## Performance evaluation



#### Average performance overhead is **2**× (less with more threads)

Eurosys'16



HAFT outperforms SEI by **30-40**%

### HAFT: Run-time Execution



tx-end

## Comparison with State-of-the-Art

| Approach                           | Resources                             | Multith. | Perf overhead                                 | Fault coverage                                  |
|------------------------------------|---------------------------------------|----------|-----------------------------------------------|-------------------------------------------------|
| <b>PLR [8]</b><br>DSN'07           | 2-3× memory usage<br>2-3× spare cores | No       | Detection: 16.9%<br>Recovery: 41.1%           | Detection: very high<br>Recovery: N/A           |
| <b>SWIFT [6]</b><br>CGO'05         | _                                     | No       | Detection: 41%<br>Recovery: N/A               | Detection: high<br>Recovery: N/A                |
| <b>Shoestring [5]</b><br>ASPLOS'10 | _                                     | No       | Detection: 15-<br>30%<br>Recovery: N/A        | Detection: medium<br>Recovery: N/A              |
| <b>DAFT [10]</b><br>PACT'10        | 2× spare cores                        | No       | Detection: 38%<br>Recovery: N/A               | Detection: high<br>Recovery: N/A                |
| <b>RAFT [9]</b><br>CGO'12          | 2× memory usage<br>2× spare cores     | No       | Detection: 3%<br>Recovery: N/A                | Detection: very high<br>Recovery: N/A           |
| <b>RomainMT [4]</b><br>EMSOFT'14   | 2-3× memory usage<br>2-3× spare cores | Yes      | Detection: 13-<br>22%<br>Recovery: 24-<br>65% | Detection: N/A<br>Recovery: N/A                 |
| <b>SEI [2]</b><br>NSDI'15          | –<br>(manual code changes)            | Yes      | Detection: 20-<br>50%<br>Recovery: N/A        | Detection: very high<br>Recovery: N/A           |
| <b>HAFT</b><br>(this work)         | _                                     | Yes      | Detection: <b>52%</b><br>Recovery: <b>89%</b> | Detection: <b>high</b><br>Recovery: <b>high</b> |



#### Limitations:

- X Non-transparent
  - Manual changes in source code [1] [2]
  - Specific languages / programming models [3]
- 🗙 Impractical
  - Only single-threaded programs [1] [5-10]
  - Only fail-stop execution [1] [2] [5] [6] [8-10]
- × Inefficient
  - Requires spare cores / deterministic execution [4] [8-10]
  - Memory overhead [8] [9]

### References

- [1] M. Correia, D. G. Ferro, F. P. Junqueira, and M. Serafini. *Practical hardening of crash-tolerant systems*. ATC'12
- [2] D. Behrens, M. Serafini, S. Arnautov, F. P. Junqueira, and C. Fetzer. Scalable error isolation for distributed systems. NSDI'15
- [3] P. Bhatotia, A. Wieder, R. Rodrigues, F. Junqueira, and B. Reed. *Reliable data-center scale computations.* LADIS'10
- [4] B. Döbel and H. Härtig. Can we put concurrency back into redundant multithreading? EMSOFT'14
- [5] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic soft error reliability on the cheap. ASPLOS'10
- [6] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. *SWIFT: Software implemented fault tolerance*. CGO'05
- [7] G. A. Reis, J. Chang, and D. I. August. *Automatic instruction-level software-only recovery.* Micro'07
- [8] A. Shye, T. Moseley, V. Reddi, J. Blomstedt, and D. Connors. Using process-level redundancy to exploit multiple cores for transient fault tolerance. DSN'07
- [9] Y. Zhang, S. Ghosh, J. Huang, J. W. Lee, S. A. Mahlke, and D. I. August. *Runtime asynchronous fault tolerance via speculation.* CGO'12
- [10] Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. *DAFT: Decoupled acyclic fault tolerance*. PACT'10