While reading Lamport's Paxos and Byzantine Generals Problem papers we were struck by how Lamport relates a system design to a human analogy. However, what if the inverse approach was applied to designing a system? In this project, we attempt to extract parts of the Westminster System to create a distributed and hierarchical consensus and leader election algorithm.
While governments predate distributed systems research by millennia, the development of distributed systems has happened separately, and thus has potentially disregarded ideas and lessons therein. Even in "The Part-Time Parliament", it is clear that Lamport constructs a fictional government to fit a distributed systems scenario, and not the other way round.
Whether human analogies incept the ideas, or the ideas beget analogies we were interested to see what could be gained from applying human systems to distributed systems. It is clear that there are striking parallels between the design of distributed systems and the construction of government systems. Consensus and leader election are long standing requirements of many governments today.
We start with the Westminster System since it has several requirements that are clearly structured, and the system operates almost like an interface that has been extended and customized by several other governments for various purposes (a lot of the Eastern Bloc, India, Australia, Canada, Japan).
Next, we observe each part of the Westminster System to see which parts can be be adapted and which parts seem repetitive, inapplicable or superfluous. With this subset, we attempt to create a consensus algorithm that is close to the spirit of the original system.
Finally we make comparisons to existing consensus algorithms and draw some conclusions.
The Westminster System is a parliamentary system of government first developed in the United Kingdom, rooted in the 11th century but developing into the system we recognize in the 17th century. It was later adopted widely around the world and is one of the earliest iterations of a modern parliamentary model that has withstood the test of time, making it a reasonable choice to study as an archetypal example of a real world government consensus system.
Important characteristics of the Westminster System are:
The above description is an extremely reductive version of the real system as implemented. The first thing that struck us after researching the Westminster System is just how complicated it is, with many tedious details, exceptions and special cases along with many practices that are seemingly there only for historical reasons. So we included only the salient points. From what remains, it is worth highlighting a few key characteristics we want to embrace, and others we would like to omit before we describe the system in full.
One notable detail about how the actual Westminster System works is that the elected representatives represent different geographical regions. This reminded us of global, federated networks of computers with layers of structure (e.g. in-datacenter versus across the world), so we chose that as the scenario to build towards.
A Monarch will not be included in this system because we struggled to find a purpose for this mostly ceremonial role. Also in any distributed system, it is difficult to imagine a system in which any machine is as permanent as a Monarch, that is not selected or renewed by consensus.
In the Westminster System, The House of Commons serves a different, shorter cycle than the House of Lords. This is an interesting idea, because it allows for two different classes of members that have staggered leases, which might eliminate a situation where everyone voting or changing at the same time causes problems. The "thundering herd problem" is one example where every node in a system doing things simultaneously causes undesirable spikes in resource usage resources and bandwidth problems. The House of Lords can best represent local changes to a network if machines are added and dropped - this form of representation will be most responsive. However, choosing the representatives of a large scale system too often may delay the speed of decision making. This makes the bicameral structure favorable, where the House of Lords is less responsive but more stable over time.
Term limits are also an interesting idea - it prevents members of the system from becoming "special", which would imply an over-reliance on those members which might cause reliability issues. It is a truism in reliable systems that things that happen rarely are error-prone because they end up being not well-tested in the real world. For example, failover in traditional data systems is something that systems administrators fear. Netflix's Chaos Monkey is an interesting approach to this problem: by inducing failure on purpose, they make failover a routine scenario rather than an exceptional one. Term limits are a similar idea: nodes have to swap duties after a certain period of time, so that swapping duties becomes routine. Although term limits are worth integrating into the system, we did not include the proportion of each as it currently exists in the Parliament. The House of Lords in reality is larger than the House of Commons, but in this system, but House of Lords will be significantly smaller.
A vote of confidence regulates the term of the Prime Minster. These will be regularly initiated, and can serve to either terminate or renew the lease of a serving prime minister, which, in this case, is automatically enduring unless a Vote of No Confidence occurs.
There is a judiciary in the Westminster System whose job it is to interpret and apply the laws that the legislative branch agrees upon. Since in our case "laws" are merely values in a distributed system, there isn't really anything to interpret and apply.
We can now describe our Westminster-inspired system.
Network StructureWe assume a network in which there are geographically distributed machines across a broad scale. If not geographically distributed, this system is modeled after machines that represent diverse services, products or operations such that inclusion of different "districts" or zones is important. In this way, the use of constituencies to partition the network into smaller districts for hierarchical voting makes sense. If some decisions to be made are localized to specific districts, then a vote can occur internally, without continuing to deeper levels of the hierarchy before reaching an answer. This can improve efficiency and availability in the larger system, and better utilize the local system.
Such an assumption requires tools to partition the network into districts. If we wanted to do this automatically, a starting point could be algorithms (dubbed "community detection") that use different heuristics to separate topologies into smaller clusters:
A simpler and perhaps more practical scenario is that geographic or function based boundaries are already apparent to the operators and which district each node belongs to and an arbitrary leader for that district is manually bootstrapped into the system as it is started. After that, the system can determine things automatically.
Properties of the membersWe don't make any special assumptions regarding the reliability of machines: we assume the system is composed of heterogeneous commodity machines. This reflects the situation in the real world, and makes sense from a distributed systems point of view as it is impractical to try to build specialized hardware to eliminate failure at a large scale.
One way that individual reliability is maintained in real parliamentary systems is a reputation system: members who do bad things eventually are recognized as being bad. In this vein, we can include a mechanism where machines maintain reputation values for other machines which are based on desirable heuristics such as network uptime (similar to seniority), speed of response, quality of response, etc. These can be combined into a single "reputation" metric. These perhaps need not be consistent across different machines: different machines may care about different properties. Machines can these use these reputation metrics when deciding who to vote for in higher positions.
Relationships Between The MembersAfter machines are organized into districts, districts internally elect its representative for the house of commons. Then a group of representatives The purpose of this tiering can be twofold:
First, a federated system with many subsystems can have certain information that needs to only be shared with a local group of nodes, while others which concern everyone in the system. This might be due to geographical distance, or due to the fact that specific nodes perform specific functions that don't concern other nodes. Consensus over global values require more and longer communication, which makes them more expensive in terms of time and bandwidth usage. So mandating global consensus for every bit of data is inefficient. A tiered system allows the members to avoid this inefficiency whenever possible.
Second, it separates internal and external communication. This helps because flowing all non-local requests through a single node allows for efficiency improvements such as batching messages. This also helps with scaling: in a system where every node communicates with every other node, the cost of communication is n squared. Dividing up and grouping the nodes based on task or geography helps reduce this effect at the global level.
House of CommonsIn this visualization, we present a local district voting system. The system starts with Node 0 as its elected local leader. The first panel (center) represents the scoring metrics updating, establishing a reputation on the machines. The top scoring machine is offered a ballot position in the next election, and the current leader will broadcast to all nodes in the district that an election is occuring. Constituent nodes must respond within a fixed amount of time for their vote to count, and the leader announces the newly elected leader by broadcasting to all nodes. A log of the internal system actions is maintained on the right side.
The visualization features 3 specific operations. The first, in greater detail is the local election protocol. At a regular interval, the current leader (based on its own internal time) will initiate an election. This happens when the leader node (shown in blue) broadcasts a message to all nodes in the district. This model currently assumes all nodes in a district are completely connected. This is not a requirement, however, as messages can be relayed through other connected nodes. Once nodes receive the election notification, they submit a ballot within a certain period of time. The ballot confirms that the node will vote for the top performing node in the district, based on its reputation metrics. Those metrics are displayed at the top of the middle panel, in a dictionary. Once the leader (the district representative) receives and processes votes, it will broadcast the result, and leaders will transition. The transition includes the transfer of a ledger of pending requests from the current leader to the new leader, if the leader is changing.
The reputation scores are maintained in a dictionary, displayed in the middle panel. In our implementation, reputation is randomly selected from a distribution, and the scores are assigned the same way for the current leader and for the rest of the nodes in the district. However, we imagine that in a real system, a longer tenure may translate to a better score, and heuristics for this should be specifically selected for the needs of the design of that system.
The next operation is a read request. One benefit of the federated consensus model is that it can dramatically reduce the size of a network based on the work being done. In this case, reads are handled by their district representatives (House of Commons members for each constituency). Because they are not mutating anything, a district representative can simply collect requests, and operate as an intermediate layer between district nodes and data stores. Districts, by this nature, will also handle their own caches. This can be particularly useful if the network is zoned into districts based on the behaviors they exhibit. This leads to caches that may better serve the needs of a district than a more generic cache.
The final operation is a write request. A randomly selected node in the district sends a request to the current leader. This request is then handled, and either added to a ledger or discarded if it conflicts with another request already in the system (an example of this is account creation). The requests that do remain are appended to the Write Queue, which is displayed in the center panel. Write queues are managed at the district level, and are only emptied during Parliamentary Sessions. The action on district write queues is better explained in the visualization following this one.
This model has some strengths and weaknesses, and of course is a simplification of the nuances of a real implementation of such a system. However, where there are details that need to be filled in, we have abstracted components that represent the higher level system design. One key take away from this system is that it handles requests differently, depending on what they are, and that district representatives in the House of Commons can serve as load balancers for requests.
In this visualization, we cover the interactions between districts, between districts and the House, and internally in the house. The top section (directly below here) displays an example of a network divided into 12 districts. Each district has been partitioned based on some heuristic (we discuss this in greater detail in the analysis section). Within each district is a graph of the nodes included, a leader which has been highlighted in blue, and a write queue which has aggregated locally. The elements displayed in each of the 12 containers here are carried over from the previous model of a district's interactions. The leader in blue is elected the same way it was shown in the previous visualization, but the process is not shown in full detail here, because of scale.
Below the districts is the House of Commons, featuring one representative per district. One of these members is also the Prime Minister of the network. Each node in the House is labelled with [districtId].[memberId]. The Prime Minister is shown in blue, and in our representation, the network of the House of Commons is also fully connected, though as long as messages can reach all nodes from all other nodes in the House, such a model could work.
In the middle panel, we display another dictionary containing reputation scores for each district in the House. These scores are tied to the weighed and combined performance of the district at large, and its selected representative. Below the reputation board is another write queue. This one is maintained at the House level. We discuss this in further detail in the "Parliament Session" below.
The first operation we provide here is a Parliament Session. As pointed out in the previous section, districts collect write queues locally until a Parliament Session is called by the PM, at a regular interval. The Prime Minister initiates the meeting, by sending a request to each House member, one at a time. The members then return to the Prime Minister with their write requests, from the write queues they maintain for their districts. The Prime Minister sends out ballots in batches from those queues to all members of the House for voting. Once received, House members submit their votes in batches and the Prime Minister broadcasts the result. The PM then reviews the votes, and either can discard a request, process a request directly, or append it to the House ledger to pass on to the Lords for another round of voting to establish consensus when critical. Then, the Prime Minister follows by sending the next district a request for its queue. It performs this operation in round robin style, where each district is given a time slot based on its districtId.
The next operation is a Vote of Confidence. This can either be initiated by a node in the House upon request, or at a regular interval based on the lease of a prime minister. The Vote proceeds by the leader initiating a broadcast to the house for preserving its seat. The members receive the request, submit a vote (based on a performance heuristic, ie. scoreboard), and the prime minister processes this. Finally, if more than 50% of members approve, the prime minister stays. Otherwise, there is a transition of leadership to the highest reputation member of the House.
The final operation is the relection of members of the house. Each district staggers its votes, and will carry out the voting procedure outlined in the first visualization. As those districts update their leaders, they are updated in the house voting and will resume carrying out votes that in progress, if a parliament meeting was called or if the leader node failed. The node serving as prime minister is not re-elected during this time. Instead, if the prime minister fails a vote of confidence, the district will re-elect their representative and a new PM is selecting in the interim.
This system performs well when a network naturally is partitioned into districts, observes behavior that can be curtailed to districts and other operations that need large scale consensus, and has the scale to require such a system in the first place. For small scale networks, where establishing a hierarchy does not make sense, a flatter structure may be more efficient. The overhead costs of holding elections at each level may exceed the cost of communication between nodes in the network in a flatter structure, and the benefits only exist at a critical mass.
This extends to the notion that district leaders (or Commons representatives) should serve as load balancers of their districts. Having districts commit their read requests to their district representatives can improve the efficiency of the system, especially because the operations are not mutating any data and because a single master in a much larger network may quickly turn into a bottleneck if its queue is consistently full and dropping packets.
An extended idea we considered here is that even write requests can be localized, or contained at different levels of the network. Such a system can move critical requests up the hierarchy in the system, but other requests that do not require large scale consensus can funnel out of this bureaucracy much earlier. If consistency requirements can be managed differently for different kinds of activity, this system has a tiered structure to handle it.
Selecting a good zoning algorithm for districts is essential, and cannot be generalized in the scope of this paper. Though we have abstracted this process, gaining an understanding of the specific system this is being applied to can improve the implementation. For example, zoning based on the information different parts of the network commonly request or zoning based on operations may balance districts and make caches more universally relevant for activity within districts. The selection of heuristics is evident in each algorithm we considered above, where different design decisions were made based on the networks and optimization priorities.
The choice of staggered elections serves to keep districts online for other operations, and for voting such that they are not all in the process of electing new leaders at the same time. This is much the same way a technology company may handle a failover.
In this paper, we were concerned with attempting to design a distributed system that resembles a Westminster Parliament, in order to see if we could arrive at conclusions that apply to distributed systems.
One general realization was that consensus in a distributed system is a different beast than consensus in a government. Consensus in distributed systems is concerned with more essential matters such as ensuring every member of the system is on the same page on some values, handling node failure, handling communication problems, etc. For consensus in governments however, these are not concerns. For example, communication generally happens through people shouting in a room together, which has proven to be a fairly reliable (if not effective!) form of human to human communication. Here the concerns are more around competing interests. Disagreement is not an anomaly, it is almost an expected property of the system. And consensus is not about the whole system trying to cooperate to all agree on some implicitly correct value (that an all-seeing oracle could know) so much as it is about deciding what should be correct in the first place.
Similarly with leader election: In real life we care about electing leaders in a fair and equitable way, in a way that represents the interests of their constituencies. This introduces a lot of complexity and inefficiency, whereas a more arbitrary decision could have been easier to make. In the context of a distributed system, there aren't really any competing interests as much as just anomalies that we are trying to avoid, which ultimately might be an easier problem to deal with. Perhaps there is a lesson in here about whether antagonism is a desirable property of a government.
Trying to reconcile these differences was a constant source of difficulty for us. On one hand we could make decisions to design a system that resembles a Westminster parliament. On the other hand we could make decisions that favor a more parsimonious system where our model of the world is simpler and reasoning about reliability is easier. Perhaps this type of system is more useful in contexts where the concerns are more similar (towards decision making rather than agreement) - imagine a system where we're trying to elect the best album of 2016. The decision-making portion could be handled by a system like ours, whereas the reliability concerns could be handled with existing distributed systems techniques at a lower level.
The lessons from this system turned out to be more about system organization than consensus. In hindsight this makes sense: representative governments were built to handle the difficulties (or at the time of the first parliament, the impossibility) of a direct representative government. As a result, we arrived at a federated system that solved these types of problems, e.g. appointing an elected leader to simplify communication requirements.
Some future technical improvements to our model might include thinking about how to handle districts that are implicitly not uniform in size, wherein the leaders could become a bottleneck. Another one might include investigating what types of reputation metrics could be used in the real world and how.
More generally, one could investigate the relationship between decision making and consensus. Perhaps decision-making with misaligned incentives is an extreme case where most of the nodes are byzantine, or perhaps it's a simpler and more general problem of more data and better calculation.