Thursday, April 12, 2018

Notes from USENIX NSDI Days 2 & 3

Before talking about Days 2 & 3, here are my notes from NSDI day 1 if you are interested.

Again all of the papers mentioned (and omitted) below are accessible publicly at the NSDI web page. Enjoy!

NSDI day 2 

Day 2 had the following sessions:

  • web and video
  • performance isolation and scaling
  • congestion control
  • cloud

Here are the papers I took detailed notes on. I ordered them by how interesting I found them.

Towards Battery-Free HD Video Streaming

This paper is by Saman Naderiparizi, Mehrdad Hessar, Vamsi Talla, Shyamnath Gollakota, and Joshua R Smith, University of Washington.

This work was mind blowing for me. This is a strong group that has developed the Wisp motes before, but even then the presented battery-free video streaming sensors was very impressive and the talk included a live demo of them.

So here is the deal. The goal of this work is to design sticker form factor, battery-free cameras. But that is crazy right? Video streaming is power hungry, how can you go battery-free?

The paper looks closely at the costs of the components of a video streaming camera. It turns out the image sensor, at 85microWatt, is very low power. On the other hand the radio at 100milliWatt is very high power.

If we could only offload the task of communication away from the sensor, we can pull this off!

The inspiration for the idea comes from the Russian great seal bug. Inside the great seal was a very thin membrane that vibrated when there is talking. Then a directional remote radio was used to receive those analog vibrations and reconstruct noise by the Russian spies. The following is from the Wikipedia page on this seal bug, called "the Thing":
The "Thing" consisted of a tiny capacitive membrane connected to a small quarter-wavelength antenna; it had no power supply or active electronic components. The device, a passive cavity resonator, became active only when a radio signal of the correct frequency was sent to the device from an external transmitter. Sound waves (from voices inside the ambassador's office) passed through the thin wood case, striking the membrane and causing it to vibrate. The movement of the membrane varied the capacitance "seen" by the antenna, which in turn modulated the radio waves that struck and were re-transmitted by the Thing. A receiver demodulated the signal so that sound picked up by the microphone could be heard, just as an ordinary radio receiver demodulates radio signals and outputs sound.
The group applies the same principle to cameras to make them battery-free. They showed on the stage the first demo of analog video backscatter that sends pixels directly to the antenna. The range for the prototype of the system was 27 feet, and it streamed 112*112 resolution of real-time video. A software defined radio was used to recreate the backscattered analog video.

For the analog hardware, the speakers said they got inspiration from human brain signals and created pulse width modulated pixels. More engineering went into performing intra-frame compression leveraging that the adjacent pixels are fairly similar.

Vesper: Measuring Time-to-Interactivity for Web Pages

This paper is by  Ravi Netravali and Vikram Nathan, MIT CSAIL; James Mickens, Harvard University; Hari Balakrishnan, MIT CSAIL.

This work asks the question of "what does it mean for a page to load quickly?" In other words, how should we define the load time?

The way page loads work likes this. The url typed in browser invokes the server to send the html. The browser runs html + javascript (requesting other embedded items as it encounters them in the html, but that is the topic of Ravi and James's other paper in the session). In the browser there is a javascript engine and a rendering engine which constructs the DOM tree. The js engine and dom tree interact via dom api.

Often the page load metric used is the page load time. Of course, this is very conservative, because only some of the page content is immediately visible, the invisible part doesn't matter, so why care about the load time of the invisible parts? Making this observation, Google came up with speed index: time to render above-the-fold, i.e., in the visible part of the browser.

But the speed index is also deficient because it doesn't talk about the js code running time. The js code running time affects the user experience, say via autocomplete, etc. An interactive page is not usable without js, and today a large fraction of the pages are interactive. The talk gave some statistics about median 182 handlers, and 95th percentile 1252 handlers in the pages surveyed.

To improve on the speed index, this work comes up with the ready index, which is defined as page time-to-interactivity in terms of visibility and functionality.

But the challenge is, nobody knows a good way to automatically identify the interactive state: are the java scripts working yet?

The system, Vesper, uses a 2 phase approach

  • identify visible elements event handlers, and state handlers access when fired
  • track loading progress of interactive state from phase 1

As a side note, the speaker, Ravi, spoke in a relaxed and clear way. The talk didn't feel rushed, but covered a lot of stuff. I really loved the delivery.

Performance Analysis of Cloud Applications

This paper is by Dan Ardelean, Amer Diwan, and Chandra Erdman, Google.

This work from Google considers the question of how we evaluate a change before we deploy it in production? Yes, the accepted approach is to use A/B testing over a small fraction of the users, but is that enough?

The abstract has this to say:
"Many popular cloud applications are large-scale distributed systems with each request involving tens to thousands of RPCs and large code bases. Because of their scale, performance optimizations without actionable supporting data are likely to be ineffective: they will add complexity to an already complex system often without chance of a benefit. This paper describes the challenges in collecting actionable data for Gmail, a service with more than 1 billion active accounts.
Using production data from Gmail we show that both the load and the nature of the load changes continuously. This makes Gmail performance difficult to model with a synthetic test and difficult to analyze in production. We describe two techniques for collecting actionable data from a production system. First, coordinated bursty tracing allows us to capture bursts of events across all layers of our stack simultaneously. Second, vertical context injection enables us combine high-level events with low-level events in a holistic trace without requiring us to explicitly propagate this information across our software stack."
The vertical context injection roughly means collecting the trace at the kernel level, using ftrace, where the layers above the kernel injects information into the kernel via stylized syscalls with payload.

The paper concludes with this observations. For meaningful performance experiments:

  • do experiments in production
  • use controlled A-B tests with 10 millions of users (less is not very meaningful)
  • use long-lived tests to capture the changing mix of requests
  • use creative approaches (vertical context injections) for collecting rich data cheaply.

LHD: Improving Cache Hit Rate by Maximizing Hit Density

This paper is by Nathan Beckmann, Carnegie Mellon University; Haoxian Chen, University of Pennsylvania; Asaf Cidon, Stanford University and Barracuda Networks.

Who knew... Cache eviction policies still require work and you can achieve big improvements there.

To motivate the importance of the cache hit rate research, Asaf mentioned the following. The  key-value cache is 100x faster than database. For Facebook if you can improve its cache hit rate of 98%, by just another additional 1%, the performance would improve 35%.

Here is the abstract:
Cloud application performance is heavily reliant on the hit rate of datacenter key-value caches. Key-value caches typically use least recently used (LRU) as their eviction policy, but LRU’s hit rate is far from optimal under real workloads. Prior research has proposed many eviction policies that improve on LRU, but these policies make restrictive assumptions that hurt their hit rate, and they can be difficult to implement efficiently.
We introduce least hit density (LHD), a novel eviction policy for key-value caches. LHD predicts each object’s expected hits-per-space-consumed (hit density), filtering objects that contribute little to the cache’s hit rate. Unlike prior eviction policies, LHD does not rely on heuristics, but rather rigorously models objects’ behavior using conditional probability to adapt its behavior in real time.
To make LHD practical, we design and implement RankCache, an efficient key-value cache based on memcached. We evaluate RankCache and LHD on commercial memcached and enterprise storage traces, where LHD consistently achieves better hit rates than prior policies. LHD requires much less space than prior policies to match their hit rate, on average 8x less than LRU and 2–3x less than recently proposed policies. Moreover, RankCache requires no synchronization in the common case, improving request throughput at 16 threads by 8x over LRU and by 2x over CLOCK.

Poster session

There was a poster session at the end of day 2. I wish there were more of the preliminary but bold work in the poster session, because most of the posters were just a poster accompanying the presented paper in the main track.

I liked these two posters the most. They are both very interesting works in progress.
  • Distributed Test Case Generation using Model Inference. Stewart Grant and Ivan Beschastnikh, University of British Columbia
  • High Performance and Usable RPCs for Datacenter Fabrics. Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David G. Andersen, Carnegie Mellon University 

NSDI Day 3

The day 3 had these sessions:

  • Network monitoring and diagnosis
  • Fault-Tolerance
  • Physical Layer
  • Configuration Management

Since the sessions were networking specific, and since I was tired of the firehose of information spewed at me in the first 2 days, I didn't take much notes on day 3.  So I will just include my notes on the Plover paper from the fault-tolerance session.

PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance

This paper is by Cheng Wang, Xusheng Chen, Weiwei Jia, Boxuan Li, Haoran Qiu, Shixiong Zhao, and Heming Cui, The University of Hong Kong.

This work builds on the Remus paper which appeared in NSDI08 and received a test-of-time award this year at NSDI.

The two limitations of the REMUS primary/backup VM replication approach was that:

  • too many memory pages needs to be copied and transferred, and
  • a split brain is possible due to partitioning.

This work, Plover, uses Paxos to address these problems. Paxos helps with both problems. By using 3 nodes, it doesn't suffer from the split brain the primary-backup approach suffers. By totally-ordering the requests seen by replicas, it can avoid copying memory pages. Replicas executing the same sequence of inputs should have the same state --- well, of course, assuming deterministic execution that is.

The drawback with Paxos is: it cannot handle non-determinism in request execution. To fix this, Plover invokes the VM page synchronization periodically before releasing replies.

The good news is that using Paxos to totally-order requests makes most memory pages same: the paper reports 97% of the pages being the same. So the VM synchronization is lightweight because it only needs to take care of the remaining 3% pages.

Plover is available on github.

I wonder, since Plover already does VM synchronization, does it really need to use a 100% totally-ordered requests delivered to the replicas via Paxos? Would it be possible to use a relaxed but faster solution? The Tapir project explored relaxed ordering of operations for storage systems, and some of the lessons may be applicable here.

MAD questions

Ok the MAD questions today picks up the thread from the last time. How do you improve the conference experience? Are conferences cramming to many technical sessions in a day? What can be done differently to improve the interactivity and networking of the conference participants?

A major reason I go to conferences is to meet people doing interesting work and converse with them, learn better about their perspectives and thought-processes.

During the 3 days at NSDI, I have talked with 14 people. That is not a bad number for me. I am not from the networking and NSDI community, so I don't know most people there. I get more chances to interact with people if I go to a conference where I know more people. Unfortunately, since I kept switching research areas (theoretical distributed systems 98-00, wireless sensor networks 00-10, smartphones/crowdsourcing 10-13, cloud computing 13-18) I don't have a home conference, where I know most people.

Out of these 14 people, I only knew 3 of them before. From the remaining, I knew a couple of them from interacting on Twitter, but the remaining majority were cold-hello first-time interactions.

The cold-hello interactions are hard, and as an introvert and shy person (except when I am curious) I had to force myself to have these first-time interactions. I assume the people I approach are also interested in talking to people (that is what conferences are supposed to be about), and we can have nice interesting conversations since we have some shared interest on distributed systems and at least on research. I would say 75% of the conversations I had  were interesting and not-superficial. But sometimes it bombs and that gets awkward. And instead of being happy about the nice interactions you have, it is easy to focus on the awkward ones and feel bad about them.

Although I am happy with meeting 14 interesting people, this is so much lower than the people I meet and talk with at HPTS. If you look at my posts about HPTS, you can see that I made it a point to emphasize how much I enjoyed the interactivity of HPTS.

I think a major way HPTS makes this happen is it sets the intentions clear and states this explicitly the first day. Pat Helland takes the stage and says that "the point of HPTS is to meet other people and interact, and the sessions are just a break from meeting/networking with other people". Since HPTS makes the cold-hello the norm, it does not feel weird anymore. I never had an awkward conversation at HPTS.

I am sure there are many ways to build interactivity and networking in the conferences. Why don't we make the posters session a long session in the afternoon, rather than after 6pm? Are there any ice-breaker activities that the conferences can adapt? I remember that at an activity with 20 people, the moderator asked everyone to say something uniquely quirky about themselves. That broke the ice pretty quickly; I still remember some of the quirks people mentioned. Maybe to scale for larger groups, it may be possible to have open-mike crazy ideas, dangerous ideas, and hot-takes sessions. Maybe we need to get some professionals to help, say from improv comedy people or capable event ice-breaker people. (I assume big companies like Google should have skilled HR people to help tech people interact better, right?)

Ok, this can get long, and I am not really knowledgeable in this area, so I will stop here. But here is my ask. Next time if you see me at a conference, please approach me and say hello. I am sure we will have a nice conversation and we will have things to learn from each other.

Maybe next time I should make a badge to make this explicit: "Please say Hi to me, I love to meet and talk to you."

Wednesday, April 11, 2018

Notes from USENIX NSDI 18 First Day

I have been attending USENIX NSDI, one of the premier conferences on networking, at Seattle, WA. Here are some notes from the first day, Monday, April 9.

Pre-session announcements

NSDI has 40 papers accepted out of 255 papers. There was a mention of 1000 reviews done for the conference. That is a lot of reviews by very highly qualified people. It is a shame those reviews are not shared openly, those reviews could be really useful for the community to learn from, and providing them may also expose if there were any sub-par reviews. There is a movement for open review process, and I hope it catches on at a wider scale.

The best paper is awarded to "NetChain: Scale-Free Sub-RTT Coordination" by  Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.

The community award (best paper whose code and dataset made publicly available) went to "Stateless Datacenter Load-balancing with Beamer" by Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu, University Politehnica of Bucharest.

NSDI makes all papers available publicly. So if any of the papers here interest you, you can download and read the paper.

First session 

The first session was on new hardware. The main theme here was to see how we can get the performance hardware solutions offer with the programmability of software solutions. I provide short summaries of the presentations of two papers. The session also included two other papers titled "PASTE: A Network Programming Interface for Non-Volatile Main Memory" and "Azure Accelerated Networking: SmartNICs in the Public Cloud Microsoft".

Approximating fair queueing on reconfigurable switches

The paper is by Naveen Kr. Sharma and Ming Liu, University of Washington; Kishore Atreya, Cavium; Arvind Krishnamurthy, University of Washington.

Congestion control today done via end-to-end protocols. The switches are dumb. What if they were smarter? That would provide  benefits for the end host and fairness, etc.

But, smart switches are challenging to realize for high-speed networks. In high-speed networks, it is hard to

  • maintain a sorted packet buffer
  • store per-flow counters
  • access and modify current round number.

The work implements simulated fair queuing (fair queueing without per flow queues) in high-speed switches. The approach is based on approximate fair queueing: simulate a bit by bit round robin scheme with key approximations

The approximate flow counters are stored using a variation of count-min sketch. The results show the approximate fair queueing is achieved via 12-14 queues. Evaluation also shows that approximate fair queuing leads to 4-8x improvement in flow completion times.

NetChain: Scale-Free Sub-RTT Coordination

The paper is by Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.

This work received the best paper award. And it is also on the distributed coordination area I am interested in, so I have a relatively long coverage of this work.

Conventional wisdom says coordination is expensive. Netchain aims to provide  lighting fast coordination enabled by programmable switches. Some applications of the coordination service can be configuration management, group membership, locking, barrier synchronization.

Right now these are done over a coordination service running Paxos, which is often implemented over a strongly-consistent, fault-tolerant key-value store. Can we do better in terms of latency and throughput?

The opportunity for using in-network coordination is that distributed coordination is communication-heavy rather than computation-heavy. The idea is to run coordination in the switches using consensus. The design goals are to achieve high-throughput and strong consistency.

How do they build a strongly consistent fault-tolerant in-network (which means in the switches) keyvalue store? They use the chain replication protocol. That is, there is a master configurator managing the chain configuration and the storage nodes on the chain that replicate the key-values.

The in-network (that is, on the switch) key-value storage builds on the SOSP 17 paper titled "NetCache: Balancing Key-Value Stores with Fast In-Network Caching" which leverages register arrays in the switches.

There is a problem of possible out-of-order delivery between the consecutive switches in the chain, which can lead to inconsistency. The presentation says they solve that with serialization with sequence number and dropping the out of order packet. The onus is on the client to retry when its request doesn't get a reply in time.

Of course the complicated part here is for handling a switch failure. Chain replication technique for recovery is adapted, so that the master configurator (which is Paxos maintained) can reconfigure the switch by removing the crashed node. But then to keep the fault-tolerance, later a new node needs to be added. The paper says the master first copies the state to the newly added one and then add that to the chain. Of course, there is a catching-up problem of the newly added switch if the previous node keep getting and inserting new items. There needs to be some blocking to coordinate it, probably via 2-phase commit. I quickly scanned the paper, and didn't see this discussed.

The paper has a TLA+ specification in the extended arxiv version. I checked the TLA+ model, and it assumes copying state to the new switch is done atomically, which seems to be an over-simplification of what needs to be implemented in reality.

The work is evaluated over 4 barefoot tofino switches and 4 commodity servers.  I think the presenter said that upto 100K key-value pairs could be stored on 8Mb storage available on the switches. Compared to Zookeeper, the solution was able to provide 3 orders of magnitude improvement in throughput and 1 order of magnitude improvement in read latency and 2 order of magnitude in write latency.

The presented concluded with asking what kind of new applications could be enabled by faster coordination service available at the datacenters. I am still unclear what subRTT means in the title, but I will read the paper and write a longer review later.

Second session

The second session was on distributed systems, my favorite topic. I have brief summaries of the three papers presented in this session.

zkLedger: Privacy-Preserving Auditing for Distributed Ledgers

This paper is by Neha Narula, MIT Media Lab; Willy Vasquez, University of Texas at Austin; Madars Virza, MIT Media Lab.

Neha presented this paper, and pulled it off without using the word blockchain even once.

Verification provided by distributed ledgers should not mean everything is in the open. The companies require some privacy to keep some business strategies and positions secret. On the other too much secrets is also a bad thing, there needs to be some auditability to prevent bad actors.

zkLedger provides practical privacy and complete auditing.

A big challenge here is that a bad actor bank could omit transactions. zkledger jointly addresses the privacy and auditability requirements by keeping an entry for every bank in every transaction. To hide values, zkLedger uses Pedersen commitments. The key insight is that the auditor audits every transaction. zkLedger also uses an interactive map/reduce paradigm over the ledger with non-interactive zero-knowledge proofs (NIZKs) to compute measurements that go beyond sums.

The abstract of the paper provides a good summary, so here it is:
Banks create digital asset transactions that are visible only to the organizations party to the transaction, but are publicly verifiable. An auditor sends queries to banks, for example "What is the outstanding amount of a certain digital asset on your balance sheet?" and gets a response and cryptographic assurance that the response is correct. zkLedger has two important benefits over previous work. First, zkLedger provides fast, rich auditing with a new proof scheme using Schnorr-type non-interactive zero-knowledge proofs. Unlike zk-SNARKs, our techniques do not require trusted setup and only rely on widely-used cryptographic assumptions. Second, zkLedger provides completeness; it uses a columnar ledger construction so that banks cannot hide transactions from the auditor, and participants can use rolling caches to produce and verify answers quickly. We implement a distributed version of zkLedger that can produce provably correct answers to auditor queries on a ledger with a hundred thousand transactions in less than 10 milliseconds.

Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

This paper is by Yilong Geng, Shiyu Liu, and Zi Yin, Stanford University; Ashish Naik, Google Inc.; Balaji Prabhakar and Mendel Rosenblum, Stanford University; Amin Vahdat, Google Inc.

This work aims to provide accurate timestamping as a primitive (with 10 nanosec accuracy) in datacenters at scale. Tightly-synchronized clocks have applications in distributed systems particularly for distributed databases, such as spanner and cockroachdb.

The system they develop, Huygens, take NIC timestamps (which is supported by most current generation NICs), and doesn't require specialized switches like PTP. Other than the NIC timestamping support, Huygens is software based.

Huygens leverages three key ideas. First, coded probes are used for identifying and rejecting impure probe data which suffer queuing delays. To detect this, Huygens send 2 consecutive probe packets with known gap: 10microsecond (NIC timestamping) and checks the gap between them on the receiving end. Only the probes with the original 10 microsecond gap are accepted as pure, since they most likely have experienced zero queueing delays. Since the queues are changing fast, it is very unlikely both of the consecutive packets were subject to same nonzero queueing delay.

Second, Huygens processes the purified data with Support Vector Machines, a widely-used and powerful classifier, to accurately estimate one-way propagation times. (Huygens assume delay between two servers are symmetric.)

Finally, to detect and correct synchronization errors even further, Huygens exploits the network effect that a group of pair-wise synchronized clocks must be transitively synchronized.

One of the questions asked was about the high probing rate employed by this work. Another question was about if this could be extended to WAN? The presenter mentioned 10 microsecond accuracy in a WAN experiment, but I wonder if it is due to Google datacenters having private and high-speed links.

One very interesting question was if this could be used for measuring the temperature in datacenters? This is a great question, not a crazy one. High temperature means local clock starts to run slower, and there is a predictable linear relationship. So if you have tightly synchronized clocks, you can measure the drift from ideal and infer temperature increase/decrease.

I will read and summarize this paper later, since I am interested in mechanisms and applications of clock synchronization in distributed systems. After our work on hybrid logical clocks (HLC), we had built a distributed monitoring system, Retroscope, using HLC.

SnailTrail: Generalizing Critical Paths for Online Analysis of Distributed Dataflows

This paper is by Moritz Hoffmann, Andrea Lattuada, John Liagouris, Vasiliki Kalavri, Desislava Dimitrova, Sebastian Wicki, Zaheer Chothia, and Timothy Roscoe, ETH Zurich.

I had provided a summary of this work earlier in my blog. Please see that.

Third session

The third section was on traffic management. I am not really a networking guy, but I listened to these to get some more understanding on what is going on in this domain.

The papers presented were:

  • "Balancing on the Edge: Transport Affinity without Network State", 
  • "Stateless Datacenter Load-balancing with Beamer", 
  • "Larry: Practical Network Reconfigurability in the Data Center", and
  • "Semi-Oblivious Traffic Engineering: The Road Not Taken"

Fourth session 

The fourth session was on network function virtualization and hardware.

The papers presented were:

  • Metron: NFV Service Chains at the True Speed of the Underlying Hardware
  • G-NET: Effective GPU Sharing in NFV Systems
  • SafeBricks: Shielding Network Functions in the Cloud
All of these papers are available on the NSDI'18 webpage.

MAD questions

Really, you read this far into this long post, and still expect me to write some MAD questions? Ok, here is just one, so that my promise is not broken.

The afternoon and especially late afternoon isn't really great for listening to conference talks. To follow talks one needs to expend a lot of mental effort, because the deliveries are done very quickly. Moreover the context changes drastically from talk to talk, and that is also depleting attention.

I wonder, if at least the late afternoons are better spent with panels, or more lively and less concentration-demanding activities.

I have heard of this book on these topics "When: The Scientific Secrets of Perfect Timing" by Daniel Pink. I plan to check that book.

Tuesday, April 3, 2018

The Stellar Consensus Protocol: A Federated Model for Internet-level Consensus

Last week in our seminar we discussed the Stellar consensus paper.

The paper is long, 32 pages. It looks like the paper is written the way the protocol is conceived and reasoned about. First comes a section on the federated Byzantine agreement (FBA) model, which talks about quorum slices and the quorums that result from them. The next section talks about the requirements for quorum intersections, and defines the dispensable sets with associated theorems. Then comes the federated voting section, with subsections titled:  Voting with open membership, Blocking sets, Accepting statements, Accepting is not enough, Statement confirmation, and Liveness and neutralization. On page 19, the Stellar Consensus Protocol (SCP) section starts, with more definitions and the proofs intertwined with the protocol description. At this point, the reader is already overwhelmed, trying to maintain in his mind theories about how the previous 19 pages might connect back to this protocol, and is confronted with the task of reading through a 9 pages long SCP protocol section.

The paper would be significantly improved if it was rewritten top-down: not in a way the protocol is conceived and proved, but in a reader-friendly manner prioritizing the clear communication of the protocol basics.

It was hard reading through this paper, and I didn't read it thoroughly. Here is what I understand:
Stellar Consensus Protocol is PBFT which uses quorums derived from quorum slices of federated participants, instead of traditional quorums from a closed set of participants.
It would be nice if SCP provided a mechanism that prevents participants from selecting bad quorum slices that lead to bad quorums.

Background and context

Traditional Byzantine Agreement protocols have closed membership: the number of participating nodes and their identities (via public private keys or via non-transferable symmetric-key MACs) are fixed.

SCP is a Federation-based Byzantine Agreement (FBA) protocol which allows open membership instead of requiring closed membership. In a federated open membership model, we don't know the number of nodes in the entire system or all of their identities. Nodes may join and leave.

One way to deal with open membership is Proof-of-Work based blockchains as in BitCoin. That has problems with excessive energy consumption, inscalability of throughput, and due to probabilistic nature of the commit long wait times to have a good guarantee of irreversibility of a transaction.

SCP does not use proof-of-work based blockchains. It adapts the PBFT protocol to work in an open membership federated environment. PBFT is a 3-phase deterministic byzantine consensus protocol. It has similarities with Paxos: in fact if you extend Paxos which just tolerates crash faults to tolerate byzantine faults, you get pretty much the PBFT protocol.


The federated model

The federated model means that a node can have a PBFT/consensus agreement with a set of nodes it specifies, without involving all the nodes in the network in this agreement.

To this end, each node specifies a quorum slice in its config file. The  quorum slice consists of the nodes it trusts, and hopefully be a diverse well-balanced portfolio. By declaring its quorum slice, this node says that it finds the consortium of these nodes (not necessarily individually each one) trustworthy, and will rely on this consortium to convince itself of the agreement and will rely on them to bless/endorse its transactions. Traditional non-federated Byzantine agreement requires all nodes to accept the same slices, in FBA the key innovation is enabling each node to chose its own quorum slice set.

These quorum slices are used for constructing quorums. A quorum is a set of nodes sufficient to reach agreement.

For safety, any two quorums in the network need to intersect, and the intersection should contain nonByzantine nodes. If the quorum intersection consists entirely of Byzantine nodes, then SCP cannot guarantee safety.

The onus is on the users to provide good quorum slices. SCP does not provide a way to check the soundness/integrity of quorum slices which give rise to quorums. Again if the quorum intersection consists entirely of Byzantine nodes, safety is violated and SCP doesn't accept responsibility of that. To quote the paper: "SCP can only guarantee safety when nodes choose adequate quorum slices."

The SCP protocol starts with a nomination phase, which if run long enough, eventually produces the same set of candidate values at every intact node, which means nodes can combine the candidate values in a deterministic way to produce a single composite value for the slot. Upon predicted/approximated convergence of nomination phase, the nodes start the ballot phase to perform federated voting (PBFT) to commit and abort ballots associated with composite values.

When intact nodes agree to commit a ballot, the value associated with the ballot will be externalized for the slot in question. When they agree to abort a ballot, the ballot's value becomes irrelevant. If a ballot gets stuck in a state where one or more intact nodes cannot commit or abort it, then nodes try again with a higher ballot; they associate the new ballot with the same value as the stuck one in case any node believes the stuck ballot was committed.

Safety results from ensuring that all stuck and committed ballots are associated with the same value. Liveness follows from the fact that a stuck ballot can be neutralized by moving to a higher ballot. The good news about SCP is that provided that the quorum condition is satisfied, any committed decision is instantly irreversible.

Here are some videos on SCP (well mostly the motivation and setup of Federated Byzantine Agreement without the SCP protocol description): https://www.youtube.com/watch?v=mB9UW7HK8pc and https://www.youtube.com/watch?v=zTI1HAWDHIg.

MAD questions

1. What are the scalability limits of SCP? 
That the quorums need to intersect is a limitation. If the quorum selections are not done carefully, you may need majority quorums for the system, and PBFT based protocol would suffer after 20 nodes in quorum and 40 nodes in the system. But there are better ways to select your quorum slices: Instead of a flat system if you use a hierarchical system, with tier 1 nodes, tier 2 nodes, tier 3 nodes, and choose your quorums through this hierarchy you can satisfy the quorum property with about log(N) nodes in contrast to N/2 nodes. Hierarchies actually work pretty well for scaling. Think of a tree with 10 children per node, at level 4 there will 10,000 nodes, and level 5 100,000 nodes.


2. There has been a lot of work on quorum systems and probabilistic quorum systems. Can those be employed to help with the scalability problem in SCP? 

Maybe even some graph algorithms can be relevant, like the "Sparse Partitions" work by Baruch Awerbuch and David Peleg. 

3. Is it possible to come up with a service for SCP that provides checks and prevents nodes from selecting bad quorum slices that lead to bad quorums? 
But why would you tryst that service, that service itself should be built in a trustless way.

4. How can we implement sharding per services in SCP?
In the SCP model described in the paper all transactions are related and potentially interfering/dependent on each other since there is no sharding per services considered. How can we implement sharding support for SCP that provides parallelism inside the services but also allows occasional cross service transactions and prevents double spending. Would it be possible to build something similar to the Aspen model for SCP?