schedule - Reliability Seminar

Note that the schedule is tentative, please constantly check for changes.

Some resources may require access permission from your institution (e.g., ACM Library). You need to connect to UVA on-campus network or VPN to view contents.

Week 1	Week 2	Week 3	Week 4	Week 5	Week 6	Week 7	Week 8
8/22	8/29	9/5	9/12	9/19	9/26	10/3	10/10
8/24	8/31	9/7	9/14	9/21	9/28	10/5	10/12

Week 9	Week 10	Week 11	Week 12	Week 13	Week 14	Week 15	Week 16
10/17	10/24	10/31	11/7	11/14	11/21	11/28	12/5
10/19	10/26	11/2	11/9	11/16	11/23	11/30

Course Introduction

Tue 8/22

Course Overview

Reliability

Thu 8/24

J. Gray, Why Do Computers Stop and What Can Be Done About It?, Technical Report 85.7, Tandem Computers, June 1985.
B. Maurer, Fail at Scale, Communications of the ACM (CACM), Vol. 58 No. 11, Pages 44-49, November 2015.
Optional:
- An Empirical Study of Operating Systems Errors, Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler, SOSP ‘01

Cloud Failures

Tue 8/29

H. S. Gunawi et al., Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages, In Proceedings of the 6th ACM Symposium on Cloud Computing (SOCC ‘16), October 2016.
S. Ghosh et al., How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service, In Proceedings of the 13th Symposium on Cloud Computing (SoCC ‘22), November 2022.

Deadline for Team Registration! see Canvas Announcement

Challenges

Thu 8/31 Hardware Faults

T. Do et al., Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems, In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC ‘13), October, 2013
P. H. Hochschild et al., Cores that Don’t Count, In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII), May 2021.
Optional
- H. S. Gunawi et al., Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems, In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST ‘18), February 2018.

Tue 9/5 Software Bugs

D. Yuan et al., Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems, In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘14), October 2014.
H. S. Gunawi et al., What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, In Proceedings of the 5th ACM Symposium on Cloud Computing (SOCC ‘14), November 2014.
Optional
- S. Lu et al., Learning from Mistakes — A Comprehensive Study on Real World Concurrency Bug Characteristics, In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘08), March 2008.

Thu 9/7 Misconfigurations

Z. Yin et al., An Empirical Study on Configuration Errors in Commercial and Open Source Systems, In Proceedings of the 23th ACM Symposium on Operating Systems Principles (SOSP ‘11), Oct. 2011.
T. Xu et al., Early Detection of Configuration Errors to Reduce Failure Damage, In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘16), November 2016.

Tue 9/12 Human Mistakes

A. B. Brown and D. A. Patterson, Undo for Operators: Building an Undoable E-mail Store, In Proceedings of the 2003 USENIX Annual Technical Conference (USENIX ATC ‘03), June 2003.
K. Nagaraja et al., Understanding and Dealing with Operator Mistakes in Internet Services, In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ‘04), December 2004.

Thu 9/14 Overloads

L. Huang et al., Metastable Failures in the Wild, In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), July 2022. (Guest lecturer)
H. Zhou et al., Overload Control for Scaling WeChat Microservices, In Proceedings of the 2018 ACM Symposium on Cloud Computing (SOCC ‘18), Oct., 2018.

Tue 9/19 Network Issues

A. Alquraan et al., An Analysis of Network-Partitioning Failures in Cloud Systems, In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘18), October 2018.
P. Bailis and K. Kingsbury, The Network Is Reliable, Communications of the ACM (CACM), Vol. 57 No. 9, Pages 48-55, September 2014.

Thu 9/21 Scale

F. McSherry et al., Scalability! But at what COST?, In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS-XV), May 2015.
C. A. Stuardo et al., ScaleCheck: A Single-Machine Approach for Discovering Scalability Bugs in Large Distributed Systems, In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST ‘19), Feb. 2019.

Deadline for Project Proposal Submission!

Tue 9/26 New Paradigms

X. Sun et al., Automatic Reliability Testing for Cluster Management Controllers, In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), July 2022. (Guest lecturer)
H. Zhang et al., Fault-tolerant and Transactional Stateful Serverless Workflows, In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Nov. 2020.

Cloud Failures and Forward

Thu 9/28

P. Huang et al., Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS ‘17), May 2017.
C. Lou et al., Understanding, Detecting and Localizing Partial Failures in Large System Software, In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ‘20), Feb. 2020.

Fall Reading Days

Tue 10/3

No classes

Bug Finding

Thu 10/5 Static Analysis

D. Engler et al., Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code, In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ‘01), Oct. 2001.
Z. Li et al., CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘04), Dec. 2004.

Tue 10/10 Dynamic Analysis

S. Savage et al., Eraser: A Dynamic Data Race Detector for Multithreaded Programs, In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP ‘97), Oct. 1997.
S. Lu et al., AVIO: Detecting Atomicity Violations via Access Interleaving Invariants, Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘06), Oct. 2006.

Thu 10/12 Binary Analysis

C. Luk et al., Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI ‘05), June 2005.
N. Nethercote et al., Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation, In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI ‘07), June 2007.
Optional:
- Y. Shoshitaishvili et al., SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis, In Proceedings of the 37th IEEE Symposium on Security and Privacy (S&P ‘16), May 2016.

Tue 10/17 Fuzzing

M. Böhme et al., Coverage-based Greybox Fuzzing as Markov Chain, In Proceedings of the 23rd ACM Conference on Computer and Communications Security (CCS ‘16), Oct. 2016.
S. Gong et al., Snowcat: Efficient Kernel Concurrency Testing using a Learned Coverage Predictor, In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ‘23), Oct. 2023. (Guest lecturer)

Formal Methods

Thu 10/19 Symbolic Execution

C. Cadar et al., KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs, In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI ‘08), Dec. 2008.
Y. Hu et al., Automated Reasoning and Detection of Specious Configuration in Large Systems with Symbolic Execution, In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI ‘20), November 2020.
Optional:
- V. Chipounov et al., S2E: A Platform for In Vivo Multi-Path Analysis of Software Systems, In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘11), March 2011.

Tue 10/24 Model Checking

J. Yang et al., Using Model Checking to Find Serious File System Errors, In Proceedings of the 4th USENIX Conference on Operating Systems Design and Implementation (OSDI ‘04), December 2004.
J. Bornholt et al., Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3, In Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP ‘21), October 2021.

Thu 10/26 Verification

G. Klein et al., seL4: Formal Verification of an OS Kernel, In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP ‘09), Oct. 2009.
C. Hawblitzel et al., IronFleet: Proving Practical Distributed Systems Correct, In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ‘15), Oct. 2015.
Optional:
- P. Fonseca et al., An Empirical Study on the Correctness of Formally Verified Distributed Systems, In Proceedings of the 12th European Conference on Computer Systems (EuroSys ‘17), April 2017.

Hacker Day I

Tue 10/31

No classes

Record and Replay

Thu 11/2

S. Park et al., PRES: Probabilistic Replay with Execution Sketching on Multiprocessors, In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP ‘09), Oct. 2009.
Z. Guo et al., R2: An Application-Level Kernel for Record and Replay, In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI ‘08), Dec. 2008.

Deadline for Checkpoint Report!

Election Day

Tue 11/7

No classes

Production Testing

Thu 11/9

P. Alvaro et al., Automating Failure Testing Research at Internet Scale, In Proceedings of the 6th ACM Symposium on Cloud Computing (SOCC ‘16), October 2016.
A. Basiri et al., Chaos Engineering, IEEE Software, Vol. 33, Issue 3, Page 35-41, May 2016.

Failure Detection

Tue 11/14

J. B. Leners et al., Detecting Failures in Distributed Systems with the FALCON Spy Network, In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP ‘11), October 2011.
P. Huang et al., Capturing and Enhancing In Situ System Observability for Failure Detection, In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘18), October 2018.

Hacker Day II

Thu 11/16

No classes

Failure Diagnosis

Tue 11/21

X. Ren et al., Relational Debugging — Pinpointing Root Causes of Performance Problems , In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), July 2023. (Guest lecturer)
R. Bhagwan et al., Orca: Differential Bug Localization in Large-Scale Services, In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘18), Oct. 2018.
Optional:
- Y. Zhang et al., Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach, In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ‘17), October 2017.

Thanksgiving Recess

Thu 11/23

No classes

Failure Recovery

Tue 11/28

C. Candea et al., Microreboot – A Technique for Cheap Recovery, In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ‘04), December 2004.
Z. Guo et al., Failure Recovery: When the Cure Is Worse Than the Disease, In Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS ‘13), May 2013.
Optional:
- M. Rinard et al., Enhancing Server Availability and Security Through Failure-Oblivious Computing, In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ‘04), December 2004.

Presentation I

Thu 11/30

Presentation II

Tue 12/5