Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 |
---|---|---|---|---|---|---|---|
8/27 | 9/3 | 9/10 | 9/17 | 9/24 | 10/1 | 10/8 | 10/15 |
8/29 | 9/5 | 9/12 | 9/19 | 9/26 | 10/3 | 10/10 | 10/17 |
Week 9 | Week 10 | Week 11 | Week 12 | Week 13 | Week 14 | Week 15 | Week 16 |
---|---|---|---|---|---|---|---|
10/22 | 10/29 | 11/5 | 11/12 | 11/19 | 11/26 | 12/3 | |
10/24 | 10/31 | 11/7 | 11/14 | 11/21 | 11/28 | 12/5 |
Introduction
- J. Gray, Why Do Computers Stop and What Can Be Done About It?, Technical Report 85.7, Tandem Computers, June 1985.
- B. Maurer, Fail at Scale, Communications of the ACM (CACM), Vol. 58 No. 11, Pages 44-49, November 2015.
- Optional:
- An Empirical Study of Operating Systems Errors, Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler, SOSP ‘01
- H. S. Gunawi et al., Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages, In Proceedings of the 6th ACM Symposium on Cloud Computing (SOCC ‘16), October 2016.
- S. Ghosh et al., How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service, In Proceedings of the 13th Symposium on Cloud Computing (SoCC ‘22), November 2022.
Deadline for Team Registration!
Challenges
- S. Wang et al., Understanding Silent Data Corruptions in a Large Production CPU Population, In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ‘23), Oct. 2023. Guest Lecture
- T. Do et al., Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems, In Proceedings of the 5th ACM Symposium on Cloud Computing (SOCC ‘13), October, 2013
- Optional
- H. S. Gunawi et al., Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems, In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST ‘18), February 2018.
- P. H. Hochschild et al., Cores that Don’t Count, In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII), May 2021.
- D. Yuan et al., Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems, In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘14), October 2014.
- H. S. Gunawi et al., What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, In Proceedings of the 5th ACM Symposium on Cloud Computing (SOCC ‘14), November 2014.
- Optional
- S. Lu et al., Learning from Mistakes — A Comprehensive Study on Real World Concurrency Bug Characteristics, In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘08), March 2008.
- Z. Yin et al., An Empirical Study on Configuration Errors in Commercial and Open Source Systems, In Proceedings of the 23th ACM Symposium on Operating Systems Principles (SOSP ‘11), Oct. 2011.
- T. Xu et al., Early Detection of Configuration Errors to Reduce Failure Damage, In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘16), November 2016.
- A. B. Brown and D. A. Patterson, Undo for Operators: Building an Undoable E-mail Store, In Proceedings of the 2003 USENIX Annual Technical Conference (USENIX ATC ‘03), June 2003.
- K. Nagaraja et al., Understanding and Dealing with Operator Mistakes in Internet Services, In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ‘04), December 2004.
- L. Huang et al., Metastable Failures in the Wild, In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), July 2022.
- H. Zhou et al., Overload Control for Scaling WeChat Microservices, In Proceedings of the 2018 ACM Symposium on Cloud Computing (SOCC ‘18), Oct., 2018.
- A. Alquraan et al., An Analysis of Network-Partitioning Failures in Cloud Systems, In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘18), October 2018.
- P. Bailis and K. Kingsbury, The Network Is Reliable, Communications of the ACM (CACM), Vol. 57 No. 9, Pages 48-55, September 2014.
- F. McSherry et al., Scalability! But at what COST?, In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS-XV), May 2015.
- C. A. Stuardo et al., ScaleCheck: A Single-Machine Approach for Discovering Scalability Bugs in Large Distributed Systems, In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST ‘19), Feb. 2019.
Deadline for Proposal!
- J. Gu et al., Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management, In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ‘23), Oct. 2023. Guest Lecture
- X. Sun et al., Automatic Reliability Testing for Cluster Management Controllers, In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), July 2022.
Thu 10/3 Cloud Failures and Forward
- P. Huang et al., Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS ‘17), May 2017.
- C. Lou et al., Understanding, Detecting and Localizing Partial Failures in Large System Software, In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ‘20), Feb. 2020.
Bug Finding
- D. Engler et al., Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code, In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ‘01), Oct. 2001.
- Z. Li et al., CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘04), Dec. 2004.
- S. Savage et al., Eraser: A Dynamic Data Race Detector for Multithreaded Programs, In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP ‘97), Oct. 1997.
- S. Lu et al., AVIO: Detecting Atomicity Violations via Access Interleaving Invariants, Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘06), Oct. 2006.
Fall Reading Days: No classes
- C. Luk et al., Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI ‘05), June 2005.
- N. Nethercote et al., Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation, In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI ‘07), June 2007.
- Optional:
- Y. Shoshitaishvili et al., SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis, In Proceedings of the 37th IEEE Symposium on Security and Privacy (S&P ‘16), May 2016.
- M. Böhme et al., Coverage-based Greybox Fuzzing as Markov Chain, In Proceedings of the 23rd ACM Conference on Computer and Communications Security (CCS ‘16), Oct. 2016.
- S. Gong et al., Snowcat: Efficient Kernel Concurrency Testing using a Learned Coverage Predictor, In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ‘23), Oct. 2023.
Formal Methods
- C. Cadar et al., KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs, In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI ‘08), Dec. 2008.
- Y. Hu et al., Automated Reasoning and Detection of Specious Configuration in Large Systems with Symbolic Execution, In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI ‘20), November 2020.
- Optional:
- V. Chipounov et al., S2E: A Platform for In Vivo Multi-Path Analysis of Software Systems, In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘11), March 2011.
- J. Yang et al., Using Model Checking to Find Serious File System Errors, In Proceedings of the 4th USENIX Conference on Operating Systems Design and Implementation (OSDI ‘04), December 2004.
- J. Bornholt et al., Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3, In Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP ‘21), October 2021.
- G. Klein et al., seL4: Formal Verification of an OS Kernel, In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP ‘09), Oct. 2009.
- C. Hawblitzel et al., IronFleet: Proving Practical Distributed Systems Correct, In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ‘15), Oct. 2015.
- Optional:
- P. Fonseca et al., An Empirical Study on the Correctness of Formally Verified Distributed Systems, In Proceedings of the 12th European Conference on Computer Systems (EuroSys ‘17), April 2017.
Election Day: No classes
Deadline for Checkpoint report!
Runtime Techniques
Hacker Day I: No classes
- Y. Xiong et al., SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation, In Proceedings of the 2024 USENIX Annual Technical Conference (USENIX ATC ‘24), July 2024.
- A. Basiri et al., Chaos Engineering, IEEE Software, Vol. 33, Issue 3, Page 35-41, May 2016.
- Optional:
- P. Alvaro et al., Automating Failure Testing Research at Internet Scale, In Proceedings of the 6th ACM Symposium on Cloud Computing (SOCC ‘16), October 2016.
- J. B. Leners et al., Detecting Failures in Distributed Systems with the FALCON Spy Network, In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP ‘11), October 2011.
- P. Huang et al., Capturing and Enhancing In Situ System Observability for Failure Detection, In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘18), October 2018.
- X. Ren et al., Relational Debugging — Pinpointing Root Causes of Performance Problems , In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), July 2023.
- R. Bhagwan et al., Orca: Differential Bug Localization in Large-Scale Services, In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘18), Oct. 2018.
- Optional:
- Y. Zhang et al., Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach, In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ‘17), October 2017.
- C. Candea et al., Microreboot – A Technique for Cheap Recovery, In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ‘04), December 2004.
- Z. Guo et al., Failure Recovery: When the Cure Is Worse Than the Disease, In Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS ‘13), May 2013.
- Optional:
- M. Rinard et al., Enhancing Server Availability and Security Through Failure-Oblivious Computing, In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ‘04), December 2004.
Hacker Day II: No classes
Thanksgiving recess: No classes
Final Presentation I
Final Presentation II---
Deadline for Final report!