Lab 2B: Raft – Log

Deadline: 4/8 23:59:59 EST

Introduction

A replicated service achieves fault tolerance by storing complete copies of its state (i.e., data) on multiple replica servers. Replication allows the service to continue operating even if some of its servers experience failures (crashes or a broken or flaky network). The challenge is that failures may cause the replicas to hold differing copies of the data.

Raft organizes client requests into a sequence, called the log, and ensures that all the replica servers see the same log. Each replica executes client requests in log order, applying them to its local copy of the service's state. Since all the live replicas see the same log contents, they all execute the same requests in the same order, and thus continue to have identical service state. If a server fails but later recovers, Raft takes care of bringing its log up to date. Raft will continue to operate as long as at least a majority of the servers are alive and can talk to each other. If there is no such majority, Raft will make no progress, but will pick up where it left off as soon as a majority can communicate again.

In this lab you'll implement the leader and follower code to append new log entries, so that the go test -run 2B tests pass. A set of Raft instances talk to each other with RPC to maintain replicated logs. Your Raft interface will support an indefinite sequence of numbered commands, also called log entries. The entries are numbered with index numbers. The log entry with a given index will eventually be committed. At that point, your Raft should send the log entry to the larger service for it to execute.

To start, remember checking out to the new branch first:

$ git checkout -b lab2b

Hints

  • Your first goal should be to pass TestBasicAgree2B(). Start by implementing Start(), then write the code to send and receive new log entries via AppendEntries RPCs, following Figure 2. Send each newly committed entry on applyCh on each peer.
  • You will need to implement the election restriction (section 5.4.1 in the paper).
  • One way to fail to reach agreement in the early Lab 2B tests is to hold repeated elections even though the leader is alive. Look for bugs in election timer management, or not sending out heartbeats immediately after winning an election.
  • Your code may have loops that repeatedly check for certain events. Don't have these loops execute continuously without pausing, since that will slow your implementation enough that it fails tests. Use Go's condition variables, or insert a time.Sleep(10 * time.Millisecond) in each loop iteration.
  • Do yourself a favor for future labs and write (or re-write) code that's clean and clear. For ideas, re-visit our the Guidance page with tips on how to develop and debug your code.
  • If you fail a test, look over the code for the test in config.go and test_test.go to get a better understanding what the test is testing. config.go also illustrates how the tester uses the Raft API.

The tests for upcoming labs may fail your code if it runs too slowly. You can check how much real time and CPU time your solution uses with the time command. Here's typical output:

$ time go test -run 2B
Test (2B): basic agreement ...
  ... Passed --   0.9  3   16    4572    3
Test (2B): RPC byte count ...
  ... Passed --   1.7  3   48  114536   11
Test (2B): agreement after follower reconnects ...
  ... Passed --   3.6  3   78   22131    7
Test (2B): no agreement if too many followers disconnect ...
  ... Passed --   3.8  5  172   40935    3
Test (2B): concurrent Start()s ...
  ... Passed --   1.1  3   24    7379    6
Test (2B): rejoin of partitioned leader ...
  ... Passed --   5.1  3  152   37021    4
Test (2B): leader backs up quickly over incorrect follower logs ...
  ... Passed --  17.2  5 2080 1587388  102
Test (2B): RPC counts aren't too high ...
  ... Passed --   2.2  3   60   20119   12
PASS
ok  	35.557s

real	0m35.899s
user	0m2.556s
sys	0m1.458s
$

The “ok 35.557s” means that Go measured the time taken for the 2B
tests to be 35.557 seconds of real (wall-clock) time. The “user
0m2.556s” means that the code consumed 2.556 seconds of CPU time, or
time spent actually executing instructions (rather than waiting or
sleeping). If your solution uses much more than a minute of real time
for the 2B tests, or much more than 5 seconds of CPU time, you may run
into trouble later on. Look for time spent sleeping or waiting for RPC
timeouts, loops that run without sleeping or waiting for conditions or
channel messages, or large numbers of RPCs sent.

Submission

Required: DESIGN DOC Please fill in a new file DESIGN_DOC and add to your repo root directory (together with student_info). This document is for you to share your experience and will be graded as part of your submission. You can refer to this template. Do not forget to add it in git!

To submit, push your commits in local lab2b branch to Github.

$ git push -u origin lab2b

Our grading scripts will automatically take a snapshot of your lab branch at the submission deadline (unless you use the late tokens, in this case your codes will be graded later). Make sure you check in all commits to the correct branch and do not submit new commits that may break the compilation. If you decide to use the late hour tokens, by the deadline send an email to cs4740staff@virginia.edu with the subject “[Late Request]: $GitHub_Repo_Name$” (empty content is fine) so we won’t be collecting and grading your solution immediately. When you finish (within the token limit), send another email to cs4740staff@virginia.edu with the subject “[Late Finish]: $GitHub_Repo_Name$”.