Thursday, April 12, 2018

Notes from USENIX NSDI Days 2 & 3

Before talking about Days 2 & 3, here are my notes from NSDI day 1 if you are interested.

Again all of the papers mentioned (and omitted) below are accessible publicly at the NSDI web page. Enjoy!

NSDI day 2 

Day 2 had the following sessions:

  • web and video
  • performance isolation and scaling
  • congestion control
  • cloud

Here are the papers I took detailed notes on. I ordered them by how interesting I found them.

Towards Battery-Free HD Video Streaming

This paper is by Saman Naderiparizi, Mehrdad Hessar, Vamsi Talla, Shyamnath Gollakota, and Joshua R Smith, University of Washington.

This work was mind blowing for me. This is a strong group that has developed the Wisp motes before, but even then the presented battery-free video streaming sensors was very impressive and the talk included a live demo of them.

So here is the deal. The goal of this work is to design sticker form factor, battery-free cameras. But that is crazy right? Video streaming is power hungry, how can you go battery-free?

The paper looks closely at the costs of the components of a video streaming camera. It turns out the image sensor, at 85microWatt, is very low power. On the other hand the radio at 100milliWatt is very high power.

If we could only offload the task of communication away from the sensor, we can pull this off!

The inspiration for the idea comes from the Russian great seal bug. Inside the great seal was a very thin membrane that vibrated when there is talking. Then a directional remote radio was used to receive those analog vibrations and reconstruct noise by the Russian spies. The following is from the Wikipedia page on this seal bug, called "the Thing":
The "Thing" consisted of a tiny capacitive membrane connected to a small quarter-wavelength antenna; it had no power supply or active electronic components. The device, a passive cavity resonator, became active only when a radio signal of the correct frequency was sent to the device from an external transmitter. Sound waves (from voices inside the ambassador's office) passed through the thin wood case, striking the membrane and causing it to vibrate. The movement of the membrane varied the capacitance "seen" by the antenna, which in turn modulated the radio waves that struck and were re-transmitted by the Thing. A receiver demodulated the signal so that sound picked up by the microphone could be heard, just as an ordinary radio receiver demodulates radio signals and outputs sound.
The group applies the same principle to cameras to make them battery-free. They showed on the stage the first demo of analog video backscatter that sends pixels directly to the antenna. The range for the prototype of the system was 27 feet, and it streamed 112*112 resolution of real-time video. A software defined radio was used to recreate the backscattered analog video.

For the analog hardware, the speakers said they got inspiration from human brain signals and created pulse width modulated pixels. More engineering went into performing intra-frame compression leveraging that the adjacent pixels are fairly similar.

Vesper: Measuring Time-to-Interactivity for Web Pages

This paper is by  Ravi Netravali and Vikram Nathan, MIT CSAIL; James Mickens, Harvard University; Hari Balakrishnan, MIT CSAIL.

This work asks the question of "what does it mean for a page to load quickly?" In other words, how should we define the load time?

The way page loads work likes this. The url typed in browser invokes the server to send the html. The browser runs html + javascript (requesting other embedded items as it encounters them in the html, but that is the topic of Ravi and James's other paper in the session). In the browser there is a javascript engine and a rendering engine which constructs the DOM tree. The js engine and dom tree interact via dom api.

Often the page load metric used is the page load time. Of course, this is very conservative, because only some of the page content is immediately visible, the invisible part doesn't matter, so why care about the load time of the invisible parts? Making this observation, Google came up with speed index: time to render above-the-fold, i.e., in the visible part of the browser.

But the speed index is also deficient because it doesn't talk about the js code running time. The js code running time affects the user experience, say via autocomplete, etc. An interactive page is not usable without js, and today a large fraction of the pages are interactive. The talk gave some statistics about median 182 handlers, and 95th percentile 1252 handlers in the pages surveyed.

To improve on the speed index, this work comes up with the ready index, which is defined as page time-to-interactivity in terms of visibility and functionality.

But the challenge is, nobody knows a good way to automatically identify the interactive state: are the java scripts working yet?

The system, Vesper, uses a 2 phase approach

  • identify visible elements event handlers, and state handlers access when fired
  • track loading progress of interactive state from phase 1

As a side note, the speaker, Ravi, spoke in a relaxed and clear way. The talk didn't feel rushed, but covered a lot of stuff. I really loved the delivery.

Performance Analysis of Cloud Applications

This paper is by Dan Ardelean, Amer Diwan, and Chandra Erdman, Google.

This work from Google considers the question of how we evaluate a change before we deploy it in production? Yes, the accepted approach is to use A/B testing over a small fraction of the users, but is that enough?

The abstract has this to say:
"Many popular cloud applications are large-scale distributed systems with each request involving tens to thousands of RPCs and large code bases. Because of their scale, performance optimizations without actionable supporting data are likely to be ineffective: they will add complexity to an already complex system often without chance of a benefit. This paper describes the challenges in collecting actionable data for Gmail, a service with more than 1 billion active accounts.
Using production data from Gmail we show that both the load and the nature of the load changes continuously. This makes Gmail performance difficult to model with a synthetic test and difficult to analyze in production. We describe two techniques for collecting actionable data from a production system. First, coordinated bursty tracing allows us to capture bursts of events across all layers of our stack simultaneously. Second, vertical context injection enables us combine high-level events with low-level events in a holistic trace without requiring us to explicitly propagate this information across our software stack."
The vertical context injection roughly means collecting the trace at the kernel level, using ftrace, where the layers above the kernel injects information into the kernel via stylized syscalls with payload.

The paper concludes with this observations. For meaningful performance experiments:

  • do experiments in production
  • use controlled A-B tests with 10 millions of users (less is not very meaningful)
  • use long-lived tests to capture the changing mix of requests
  • use creative approaches (vertical context injections) for collecting rich data cheaply.

LHD: Improving Cache Hit Rate by Maximizing Hit Density

This paper is by Nathan Beckmann, Carnegie Mellon University; Haoxian Chen, University of Pennsylvania; Asaf Cidon, Stanford University and Barracuda Networks.

Who knew... Cache eviction policies still require work and you can achieve big improvements there.

To motivate the importance of the cache hit rate research, Asaf mentioned the following. The  key-value cache is 100x faster than database. For Facebook if you can improve its cache hit rate of 98%, by just another additional 1%, the performance would improve 35%.

Here is the abstract:
Cloud application performance is heavily reliant on the hit rate of datacenter key-value caches. Key-value caches typically use least recently used (LRU) as their eviction policy, but LRU’s hit rate is far from optimal under real workloads. Prior research has proposed many eviction policies that improve on LRU, but these policies make restrictive assumptions that hurt their hit rate, and they can be difficult to implement efficiently.
We introduce least hit density (LHD), a novel eviction policy for key-value caches. LHD predicts each object’s expected hits-per-space-consumed (hit density), filtering objects that contribute little to the cache’s hit rate. Unlike prior eviction policies, LHD does not rely on heuristics, but rather rigorously models objects’ behavior using conditional probability to adapt its behavior in real time.
To make LHD practical, we design and implement RankCache, an efficient key-value cache based on memcached. We evaluate RankCache and LHD on commercial memcached and enterprise storage traces, where LHD consistently achieves better hit rates than prior policies. LHD requires much less space than prior policies to match their hit rate, on average 8x less than LRU and 2–3x less than recently proposed policies. Moreover, RankCache requires no synchronization in the common case, improving request throughput at 16 threads by 8x over LRU and by 2x over CLOCK.

Poster session

There was a poster session at the end of day 2. I wish there were more of the preliminary but bold work in the poster session, because most of the posters were just a poster accompanying the presented paper in the main track.

I liked these two posters the most. They are both very interesting works in progress.
  • Distributed Test Case Generation using Model Inference. Stewart Grant and Ivan Beschastnikh, University of British Columbia
  • High Performance and Usable RPCs for Datacenter Fabrics. Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David G. Andersen, Carnegie Mellon University 

NSDI Day 3

The day 3 had these sessions:

  • Network monitoring and diagnosis
  • Fault-Tolerance
  • Physical Layer
  • Configuration Management

Since the sessions were networking specific, and since I was tired of the firehose of information spewed at me in the first 2 days, I didn't take much notes on day 3.  So I will just include my notes on the Plover paper from the fault-tolerance session.

PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance

This paper is by Cheng Wang, Xusheng Chen, Weiwei Jia, Boxuan Li, Haoran Qiu, Shixiong Zhao, and Heming Cui, The University of Hong Kong.

This work builds on the Remus paper which appeared in NSDI08 and received a test-of-time award this year at NSDI.

The two limitations of the REMUS primary/backup VM replication approach was that:

  • too many memory pages needs to be copied and transferred, and
  • a split brain is possible due to partitioning.

This work, Plover, uses Paxos to address these problems. Paxos helps with both problems. By using 3 nodes, it doesn't suffer from the split brain the primary-backup approach suffers. By totally-ordering the requests seen by replicas, it can avoid copying memory pages. Replicas executing the same sequence of inputs should have the same state --- well, of course, assuming deterministic execution that is.

The drawback with Paxos is: it cannot handle non-determinism in request execution. To fix this, Plover invokes the VM page synchronization periodically before releasing replies.

The good news is that using Paxos to totally-order requests makes most memory pages same: the paper reports 97% of the pages being the same. So the VM synchronization is lightweight because it only needs to take care of the remaining 3% pages.

Plover is available on github.

I wonder, since Plover already does VM synchronization, does it really need to use a 100% totally-ordered requests delivered to the replicas via Paxos? Would it be possible to use a relaxed but faster solution? The Tapir project explored relaxed ordering of operations for storage systems, and some of the lessons may be applicable here.

MAD questions

Ok the MAD questions today picks up the thread from the last time. How do you improve the conference experience? Are conferences cramming to many technical sessions in a day? What can be done differently to improve the interactivity and networking of the conference participants?

A major reason I go to conferences is to meet people doing interesting work and converse with them, learn better about their perspectives and thought-processes.

During the 3 days at NSDI, I have talked with 14 people. That is not a bad number for me. I am not from the networking and NSDI community, so I don't know most people there. I get more chances to interact with people if I go to a conference where I know more people. Unfortunately, since I kept switching research areas (theoretical distributed systems 98-00, wireless sensor networks 00-10, smartphones/crowdsourcing 10-13, cloud computing 13-18) I don't have a home conference, where I know most people.

Out of these 14 people, I only knew 3 of them before. From the remaining, I knew a couple of them from interacting on Twitter, but the remaining majority were cold-hello first-time interactions.

The cold-hello interactions are hard, and as an introvert and shy person (except when I am curious) I had to force myself to have these first-time interactions. I assume the people I approach are also interested in talking to people (that is what conferences are supposed to be about), and we can have nice interesting conversations since we have some shared interest on distributed systems and at least on research. I would say 75% of the conversations I had  were interesting and not-superficial. But sometimes it bombs and that gets awkward. And instead of being happy about the nice interactions you have, it is easy to focus on the awkward ones and feel bad about them.

Although I am happy with meeting 14 interesting people, this is so much lower than the people I meet and talk with at HPTS. If you look at my posts about HPTS, you can see that I made it a point to emphasize how much I enjoyed the interactivity of HPTS.

I think a major way HPTS makes this happen is it sets the intentions clear and states this explicitly the first day. Pat Helland takes the stage and says that "the point of HPTS is to meet other people and interact, and the sessions are just a break from meeting/networking with other people". Since HPTS makes the cold-hello the norm, it does not feel weird anymore. I never had an awkward conversation at HPTS.

I am sure there are many ways to build interactivity and networking in the conferences. Why don't we make the posters session a long session in the afternoon, rather than after 6pm? Are there any ice-breaker activities that the conferences can adapt? I remember that at an activity with 20 people, the moderator asked everyone to say something uniquely quirky about themselves. That broke the ice pretty quickly; I still remember some of the quirks people mentioned. Maybe to scale for larger groups, it may be possible to have open-mike crazy ideas, dangerous ideas, and hot-takes sessions. Maybe we need to get some professionals to help, say from improv comedy people or capable event ice-breaker people. (I assume big companies like Google should have skilled HR people to help tech people interact better, right?)

Ok, this can get long, and I am not really knowledgeable in this area, so I will stop here. But here is my ask. Next time if you see me at a conference, please approach me and say hello. I am sure we will have a nice conversation and we will have things to learn from each other.

Maybe next time I should make a badge to make this explicit: "Please say Hi to me, I love to meet and talk to you."

Wednesday, April 11, 2018

Notes from USENIX NSDI 18 First Day

I have been attending USENIX NSDI, one of the premier conferences on networking, at Seattle, WA. Here are some notes from the first day, Monday, April 9.

Pre-session announcements

NSDI has 40 papers accepted out of 255 papers. There was a mention of 1000 reviews done for the conference. That is a lot of reviews by very highly qualified people. It is a shame those reviews are not shared openly, those reviews could be really useful for the community to learn from, and providing them may also expose if there were any sub-par reviews. There is a movement for open review process, and I hope it catches on at a wider scale.

The best paper is awarded to "NetChain: Scale-Free Sub-RTT Coordination" by  Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.

The community award (best paper whose code and dataset made publicly available) went to "Stateless Datacenter Load-balancing with Beamer" by Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu, University Politehnica of Bucharest.

NSDI makes all papers available publicly. So if any of the papers here interest you, you can download and read the paper.

First session 

The first session was on new hardware. The main theme here was to see how we can get the performance hardware solutions offer with the programmability of software solutions. I provide short summaries of the presentations of two papers. The session also included two other papers titled "PASTE: A Network Programming Interface for Non-Volatile Main Memory" and "Azure Accelerated Networking: SmartNICs in the Public Cloud Microsoft".

Approximating fair queueing on reconfigurable switches

The paper is by Naveen Kr. Sharma and Ming Liu, University of Washington; Kishore Atreya, Cavium; Arvind Krishnamurthy, University of Washington.

Congestion control today done via end-to-end protocols. The switches are dumb. What if they were smarter? That would provide  benefits for the end host and fairness, etc.

But, smart switches are challenging to realize for high-speed networks. In high-speed networks, it is hard to

  • maintain a sorted packet buffer
  • store per-flow counters
  • access and modify current round number.

The work implements simulated fair queuing (fair queueing without per flow queues) in high-speed switches. The approach is based on approximate fair queueing: simulate a bit by bit round robin scheme with key approximations

The approximate flow counters are stored using a variation of count-min sketch. The results show the approximate fair queueing is achieved via 12-14 queues. Evaluation also shows that approximate fair queuing leads to 4-8x improvement in flow completion times.

NetChain: Scale-Free Sub-RTT Coordination

The paper is by Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.

This work received the best paper award. And it is also on the distributed coordination area I am interested in, so I have a relatively long coverage of this work.

Conventional wisdom says coordination is expensive. Netchain aims to provide  lighting fast coordination enabled by programmable switches. Some applications of the coordination service can be configuration management, group membership, locking, barrier synchronization.

Right now these are done over a coordination service running Paxos, which is often implemented over a strongly-consistent, fault-tolerant key-value store. Can we do better in terms of latency and throughput?

The opportunity for using in-network coordination is that distributed coordination is communication-heavy rather than computation-heavy. The idea is to run coordination in the switches using consensus. The design goals are to achieve high-throughput and strong consistency.

How do they build a strongly consistent fault-tolerant in-network (which means in the switches) keyvalue store? They use the chain replication protocol. That is, there is a master configurator managing the chain configuration and the storage nodes on the chain that replicate the key-values.

The in-network (that is, on the switch) key-value storage builds on the SOSP 17 paper titled "NetCache: Balancing Key-Value Stores with Fast In-Network Caching" which leverages register arrays in the switches.

There is a problem of possible out-of-order delivery between the consecutive switches in the chain, which can lead to inconsistency. The presentation says they solve that with serialization with sequence number and dropping the out of order packet. The onus is on the client to retry when its request doesn't get a reply in time.

Of course the complicated part here is for handling a switch failure. Chain replication technique for recovery is adapted, so that the master configurator (which is Paxos maintained) can reconfigure the switch by removing the crashed node. But then to keep the fault-tolerance, later a new node needs to be added. The paper says the master first copies the state to the newly added one and then add that to the chain. Of course, there is a catching-up problem of the newly added switch if the previous node keep getting and inserting new items. There needs to be some blocking to coordinate it, probably via 2-phase commit. I quickly scanned the paper, and didn't see this discussed.

The paper has a TLA+ specification in the extended arxiv version. I checked the TLA+ model, and it assumes copying state to the new switch is done atomically, which seems to be an over-simplification of what needs to be implemented in reality.

The work is evaluated over 4 barefoot tofino switches and 4 commodity servers.  I think the presenter said that upto 100K key-value pairs could be stored on 8Mb storage available on the switches. Compared to Zookeeper, the solution was able to provide 3 orders of magnitude improvement in throughput and 1 order of magnitude improvement in read latency and 2 order of magnitude in write latency.

The presented concluded with asking what kind of new applications could be enabled by faster coordination service available at the datacenters. I am still unclear what subRTT means in the title, but I will read the paper and write a longer review later.

Second session

The second session was on distributed systems, my favorite topic. I have brief summaries of the three papers presented in this session.

zkLedger: Privacy-Preserving Auditing for Distributed Ledgers

This paper is by Neha Narula, MIT Media Lab; Willy Vasquez, University of Texas at Austin; Madars Virza, MIT Media Lab.

Neha presented this paper, and pulled it off without using the word blockchain even once.

Verification provided by distributed ledgers should not mean everything is in the open. The companies require some privacy to keep some business strategies and positions secret. On the other too much secrets is also a bad thing, there needs to be some auditability to prevent bad actors.

zkLedger provides practical privacy and complete auditing.

A big challenge here is that a bad actor bank could omit transactions. zkledger jointly addresses the privacy and auditability requirements by keeping an entry for every bank in every transaction. To hide values, zkLedger uses Pedersen commitments. The key insight is that the auditor audits every transaction. zkLedger also uses an interactive map/reduce paradigm over the ledger with non-interactive zero-knowledge proofs (NIZKs) to compute measurements that go beyond sums.

The abstract of the paper provides a good summary, so here it is:
Banks create digital asset transactions that are visible only to the organizations party to the transaction, but are publicly verifiable. An auditor sends queries to banks, for example "What is the outstanding amount of a certain digital asset on your balance sheet?" and gets a response and cryptographic assurance that the response is correct. zkLedger has two important benefits over previous work. First, zkLedger provides fast, rich auditing with a new proof scheme using Schnorr-type non-interactive zero-knowledge proofs. Unlike zk-SNARKs, our techniques do not require trusted setup and only rely on widely-used cryptographic assumptions. Second, zkLedger provides completeness; it uses a columnar ledger construction so that banks cannot hide transactions from the auditor, and participants can use rolling caches to produce and verify answers quickly. We implement a distributed version of zkLedger that can produce provably correct answers to auditor queries on a ledger with a hundred thousand transactions in less than 10 milliseconds.

Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

This paper is by Yilong Geng, Shiyu Liu, and Zi Yin, Stanford University; Ashish Naik, Google Inc.; Balaji Prabhakar and Mendel Rosenblum, Stanford University; Amin Vahdat, Google Inc.

This work aims to provide accurate timestamping as a primitive (with 10 nanosec accuracy) in datacenters at scale. Tightly-synchronized clocks have applications in distributed systems particularly for distributed databases, such as spanner and cockroachdb.

The system they develop, Huygens, take NIC timestamps (which is supported by most current generation NICs), and doesn't require specialized switches like PTP. Other than the NIC timestamping support, Huygens is software based.

Huygens leverages three key ideas. First, coded probes are used for identifying and rejecting impure probe data which suffer queuing delays. To detect this, Huygens send 2 consecutive probe packets with known gap: 10microsecond (NIC timestamping) and checks the gap between them on the receiving end. Only the probes with the original 10 microsecond gap are accepted as pure, since they most likely have experienced zero queueing delays. Since the queues are changing fast, it is very unlikely both of the consecutive packets were subject to same nonzero queueing delay.

Second, Huygens processes the purified data with Support Vector Machines, a widely-used and powerful classifier, to accurately estimate one-way propagation times. (Huygens assume delay between two servers are symmetric.)

Finally, to detect and correct synchronization errors even further, Huygens exploits the network effect that a group of pair-wise synchronized clocks must be transitively synchronized.

One of the questions asked was about the high probing rate employed by this work. Another question was about if this could be extended to WAN? The presenter mentioned 10 microsecond accuracy in a WAN experiment, but I wonder if it is due to Google datacenters having private and high-speed links.

One very interesting question was if this could be used for measuring the temperature in datacenters? This is a great question, not a crazy one. High temperature means local clock starts to run slower, and there is a predictable linear relationship. So if you have tightly synchronized clocks, you can measure the drift from ideal and infer temperature increase/decrease.

I will read and summarize this paper later, since I am interested in mechanisms and applications of clock synchronization in distributed systems. After our work on hybrid logical clocks (HLC), we had built a distributed monitoring system, Retroscope, using HLC.

SnailTrail: Generalizing Critical Paths for Online Analysis of Distributed Dataflows

This paper is by Moritz Hoffmann, Andrea Lattuada, John Liagouris, Vasiliki Kalavri, Desislava Dimitrova, Sebastian Wicki, Zaheer Chothia, and Timothy Roscoe, ETH Zurich.

I had provided a summary of this work earlier in my blog. Please see that.

Third session

The third section was on traffic management. I am not really a networking guy, but I listened to these to get some more understanding on what is going on in this domain.

The papers presented were:

  • "Balancing on the Edge: Transport Affinity without Network State", 
  • "Stateless Datacenter Load-balancing with Beamer", 
  • "Larry: Practical Network Reconfigurability in the Data Center", and
  • "Semi-Oblivious Traffic Engineering: The Road Not Taken"

Fourth session 

The fourth session was on network function virtualization and hardware.

The papers presented were:

  • Metron: NFV Service Chains at the True Speed of the Underlying Hardware
  • G-NET: Effective GPU Sharing in NFV Systems
  • SafeBricks: Shielding Network Functions in the Cloud
All of these papers are available on the NSDI'18 webpage.

MAD questions

Really, you read this far into this long post, and still expect me to write some MAD questions? Ok, here is just one, so that my promise is not broken.

The afternoon and especially late afternoon isn't really great for listening to conference talks. To follow talks one needs to expend a lot of mental effort, because the deliveries are done very quickly. Moreover the context changes drastically from talk to talk, and that is also depleting attention.

I wonder, if at least the late afternoons are better spent with panels, or more lively and less concentration-demanding activities.

I have heard of this book on these topics "When: The Scientific Secrets of Perfect Timing" by Daniel Pink. I plan to check that book.

Tuesday, April 3, 2018

The Stellar Consensus Protocol: A Federated Model for Internet-level Consensus

Last week in our seminar we discussed the Stellar consensus paper.

The paper is long, 32 pages. It looks like the paper is written the way the protocol is conceived and reasoned about. First comes a section on the federated Byzantine agreement (FBA) model, which talks about quorum slices and the quorums that result from them. The next section talks about the requirements for quorum intersections, and defines the dispensable sets with associated theorems. Then comes the federated voting section, with subsections titled:  Voting with open membership, Blocking sets, Accepting statements, Accepting is not enough, Statement confirmation, and Liveness and neutralization. On page 19, the Stellar Consensus Protocol (SCP) section starts, with more definitions and the proofs intertwined with the protocol description. At this point, the reader is already overwhelmed, trying to maintain in his mind theories about how the previous 19 pages might connect back to this protocol, and is confronted with the task of reading through a 9 pages long SCP protocol section.

The paper would be significantly improved if it was rewritten top-down: not in a way the protocol is conceived and proved, but in a reader-friendly manner prioritizing the clear communication of the protocol basics.

It was hard reading through this paper, and I didn't read it thoroughly. Here is what I understand:
Stellar Consensus Protocol is PBFT which uses quorums derived from quorum slices of federated participants, instead of traditional quorums from a closed set of participants.
It would be nice if SCP provided a mechanism that prevents participants from selecting bad quorum slices that lead to bad quorums.

Background and context

Traditional Byzantine Agreement protocols have closed membership: the number of participating nodes and their identities (via public private keys or via non-transferable symmetric-key MACs) are fixed.

SCP is a Federation-based Byzantine Agreement (FBA) protocol which allows open membership instead of requiring closed membership. In a federated open membership model, we don't know the number of nodes in the entire system or all of their identities. Nodes may join and leave.

One way to deal with open membership is Proof-of-Work based blockchains as in BitCoin. That has problems with excessive energy consumption, inscalability of throughput, and due to probabilistic nature of the commit long wait times to have a good guarantee of irreversibility of a transaction.

SCP does not use proof-of-work based blockchains. It adapts the PBFT protocol to work in an open membership federated environment. PBFT is a 3-phase deterministic byzantine consensus protocol. It has similarities with Paxos: in fact if you extend Paxos which just tolerates crash faults to tolerate byzantine faults, you get pretty much the PBFT protocol.


The federated model

The federated model means that a node can have a PBFT/consensus agreement with a set of nodes it specifies, without involving all the nodes in the network in this agreement.

To this end, each node specifies a quorum slice in its config file. The  quorum slice consists of the nodes it trusts, and hopefully be a diverse well-balanced portfolio. By declaring its quorum slice, this node says that it finds the consortium of these nodes (not necessarily individually each one) trustworthy, and will rely on this consortium to convince itself of the agreement and will rely on them to bless/endorse its transactions. Traditional non-federated Byzantine agreement requires all nodes to accept the same slices, in FBA the key innovation is enabling each node to chose its own quorum slice set.

These quorum slices are used for constructing quorums. A quorum is a set of nodes sufficient to reach agreement.

For safety, any two quorums in the network need to intersect, and the intersection should contain nonByzantine nodes. If the quorum intersection consists entirely of Byzantine nodes, then SCP cannot guarantee safety.

The onus is on the users to provide good quorum slices. SCP does not provide a way to check the soundness/integrity of quorum slices which give rise to quorums. Again if the quorum intersection consists entirely of Byzantine nodes, safety is violated and SCP doesn't accept responsibility of that. To quote the paper: "SCP can only guarantee safety when nodes choose adequate quorum slices."

The SCP protocol starts with a nomination phase, which if run long enough, eventually produces the same set of candidate values at every intact node, which means nodes can combine the candidate values in a deterministic way to produce a single composite value for the slot. Upon predicted/approximated convergence of nomination phase, the nodes start the ballot phase to perform federated voting (PBFT) to commit and abort ballots associated with composite values.

When intact nodes agree to commit a ballot, the value associated with the ballot will be externalized for the slot in question. When they agree to abort a ballot, the ballot's value becomes irrelevant. If a ballot gets stuck in a state where one or more intact nodes cannot commit or abort it, then nodes try again with a higher ballot; they associate the new ballot with the same value as the stuck one in case any node believes the stuck ballot was committed.

Safety results from ensuring that all stuck and committed ballots are associated with the same value. Liveness follows from the fact that a stuck ballot can be neutralized by moving to a higher ballot. The good news about SCP is that provided that the quorum condition is satisfied, any committed decision is instantly irreversible.

Here are some videos on SCP (well mostly the motivation and setup of Federated Byzantine Agreement without the SCP protocol description): https://www.youtube.com/watch?v=mB9UW7HK8pc and https://www.youtube.com/watch?v=zTI1HAWDHIg.

MAD questions

1. What are the scalability limits of SCP? 
That the quorums need to intersect is a limitation. If the quorum selections are not done carefully, you may need majority quorums for the system, and PBFT based protocol would suffer after 20 nodes in quorum and 40 nodes in the system. But there are better ways to select your quorum slices: Instead of a flat system if you use a hierarchical system, with tier 1 nodes, tier 2 nodes, tier 3 nodes, and choose your quorums through this hierarchy you can satisfy the quorum property with about log(N) nodes in contrast to N/2 nodes. Hierarchies actually work pretty well for scaling. Think of a tree with 10 children per node, at level 4 there will 10,000 nodes, and level 5 100,000 nodes.


2. There has been a lot of work on quorum systems and probabilistic quorum systems. Can those be employed to help with the scalability problem in SCP? 

Maybe even some graph algorithms can be relevant, like the "Sparse Partitions" work by Baruch Awerbuch and David Peleg. 

3. Is it possible to come up with a service for SCP that provides checks and prevents nodes from selecting bad quorum slices that lead to bad quorums? 
But why would you tryst that service, that service itself should be built in a trustless way.

4. How can we implement sharding per services in SCP?
In the SCP model described in the paper all transactions are related and potentially interfering/dependent on each other since there is no sharding per services considered. How can we implement sharding support for SCP that provides parallelism inside the services but also allows occasional cross service transactions and prevents double spending. Would it be possible to build something similar to the Aspen model for SCP?

Saturday, March 31, 2018

Book review: Last lecture by Randy Pausch

Of course you have watched the late Prof. Randy Pausch's last lecture. If for some reason you haven't, please right that wrong now.

I recently read the book "Last Lecture" by Randy Pausch. I got the book from my local library, yes, a physical library building. They still exist, and they are a great resource.

It is a short book, and I have a very brief review.

The book does not repeat the "last lecture" presentation. It tells mostly disjoint and complementary story, and it gets into more personal stuff. I learned several more interesting anectodes about Randy in this book.

It is of course a very heartbreaking story. The book is a "memento mori".

It makes you think about "What would you want to be known for?" and "How would you prioritize your life if you had 5 years left?"

I am still thinking about those long after I returned the book to the library.

I am currently reading about "Crypto: How the Code Rebels Beat the Government Saving Privacy" by Steven Levy and "Tribe of Mentors" by Tim Ferriss. Both courtesy of my local library, and both very interesting. I will write short summaries about them once I am done.

Tuesday, March 27, 2018

Master your tools

Couple months ago, a friend sent me this article about "Speed learner Max Deutsch challenging chess grandmaster Magnus Carlsen". My friend called the story a life-hack story, and remarked that this is the "Silicon Valley bro" frame of mind in action: "Nobody deserves any respect because I am so confident and valley savvy I could naturally write an app and do it better".

This was my reply on the article.
This is a very nice illustration of a hack versus expert: Expert had years of battle-scar, and internalized everything. Hacks/shortcuts are going to take you only to where the expert's domain starts.
Also another take away is, we humans are dumb. We don't have anything to be proud of about our mental prowess. I make so many stupid mistakes every week, it is frustrating. It is a shame we can't get much smarter on the spot. But in the offline mode (doing research/thinking by writing), I think we can do better. We can only make up for the shortcoming of our brains by using discipline (writing or math-symbolic computing).

This story is a perfect illustration of the need for tooling and the need for mastering your tools.

There are many levels of learning/mastering

This is what I wrote in 2012 about the many levels of learning. It has a nice backstory as well, about how I almost flunked differential mathematics as a freshman.
There are many levels of knowing. The first level is "knowing by word". With passive learning (like I did for studying for the diff final), you get the most limited form of understanding/knowing. It is ephemeral, and you are not able to reconstruct that on your own. The second level is "knowing by first hand experience". At this level, you are able to understand/witness and if necessary rediscover the knowledge on your own. Finally, the third level is "knowing by heart" or grokking it. At this level, you internalized the knowledge so that it becomes part of your nature.
The best way of learning is by doing. You need to work hands-on, make a lot of mistakes so you can start to internalize that knowledge. That means you should make, and produce things continuously. For this to work in a sustainable manner, you need to change your default mode from the consumer mode to the producer mode.

As it comes to learning/mastering a tool, hacks/shortcut does not cut it. Just having a passing familiarity with the tool doesn't empower you.  You need to master and internalize your tools.

No pain, no gain. If you don't bend yourself, you did not master the tool. You are short-cutting and won't get the full benefits of the tool and level up.

You truly master a tool only when you master yourself and bend yourself with that tool's thinking mode. Mastering the tool and yourself through the tool, takes many weeks, and sometimes years. When you pay your dues and go deep, only then, you transcend and level up.

Learning/mastering is a gradient descent process

You (well, at least I) progress through the levels of learning gradually almost via a gradient descent like process. Let me give an example from Paxos.

The first time I taught Paxos in my distributed systems class around 10 years ago, I don't think I understood it well. I was a hack. I felt very unsatisfied about how I couldn't explain it better. Through these regrets, I started getting better at it. This echoes what I wrote earlier about "what I learn from teaching".
In that down hour after teaching, what do I do? I have regret-flashbacks about some parts of the lecture I delivered. I have remorse about how I botched some parts of the lecture and how I couldn't answer a student's question better. The reason I botch up a part of the lecture is often because I haven't internalized that part yet. The reason I wasn't able to make a smooth transition from one part to another part is often because I didn't have a better understanding/insight about how those parts fit. It is likely I was missing some context about it. Or maybe more likely that I knew the context, but I didn't internalize it well enough. There are many different levels of knowing, and the more reliable/resilient/dependable level of knowing is knowing by experience. Teaching is a way to have hands-on training in your research area. You learn through your regrets about how you couldn't explain it better. 

So the way I got better at internalizing and explaining Paxos was by making mistakes and feeling like an idiot. Once in a while I would be saying "Oh, wait, this doesn't make sense in Paxos", or "Wait, how does Paxos deal with this?", and then work my way out of these problems. I would feel bad and say to myself "How come I didn't notice this after this many years?", but I now got used to this feeling. I know this is a natural part of the learning/mastering the tool. This is a good kind of screw up, because it makes me learn. When I screw up and learn from it, I am leveling up.

Tools grow your brain

A good tool is a force multiplier. So it is definitely worth the time to master a good tool, because you will get a lot of mileage out of that tool as multiplicative factor.

Young Steve Jobs was pitching computers as bicycles for the mind.
I think one of the things that really separates us from the high primates is that we’re tool builders. I read a study that measured the efficiency of locomotion for various species on the planet. The condor used the least energy to move a kilometer. And, humans came in with a rather unimpressive showing, about a third of the way down the list. It was not too proud a showing for the crown of creation. So, that didn’t look so good. But, then somebody at Scientific American had the insight to test the efficiency of locomotion for a man on a bicycle. And, a man on a bicycle, a human on a bicycle, blew the condor away, completely off the top of the charts.
And that’s what a computer is to me. What a computer is to me is it’s the most remarkable tool that we’ve ever come up with, and it’s the equivalent of a bicycle for our minds.

To make a transformative contribution, sometimes you will need to invent your own tools as I wrote earlier in "How to go 10X":
Adopt/Invent better tools "Give me six hours to chop down a tree and I will spend the first four sharpening the axe." -- Abraham Lincoln
I mean, not just better but transformative tools of course. Most often, you may need to invent that transformative tool yourself. When you are faced with an inherent worthy problem, don't just go for a point solution, generalize your solution, and ultimately make it in to a tool to benefit for the future. Generalizing and constructing the tool/system pays that technical debt and gets you to have truly 10X benefit. MapReduce as a tool comes to my mind as a good example for this. By generalizing on the kind of big data processing tasks/programs written at Google, Jeff Dean was able to see an underlying pattern, write a good tool to solve the problem once and for all.
Scientists spend decades to invent transformative tools (Hadron Collider, Hubble telescope) and then they get breakthrough results. As researchers in computer science, we should try to adopt/cultivate this mentality more.
The good news is mastering a tool makes it easier for you to pick up another one. As you become expert in one tool/domain, you can map other tools/domain to that and contextualize and master them much faster.

Acquiring new tools is the surest way to get smarter.

After you master a tool

After you master a tool, you internalize it and go minimalist, and reduce it to first principles. You not only internalize how to use it, you also internalize its limitations and learn when not to use it. You learn from your tools and start converging to more minimalistic tools and first principles.

As to methods, there may be a million and then some, but principles are few. The man who grasps principles can successfully select his own methods. The man who tries methods, ignoring principles, is sure to have trouble.
-- Ralph Waldo Emerson

MAD questions

1. What qualifies as a tool? 
Does critical reading qualify? How about writing well?

2. I want to learn from you, what are the tools that benefited you the most?

3. What does your mastering process look like?

4. What are some good tools for computer science students?
Here is John Regehr's answer, but this is just one answer and it is given with a pretty literal interpretation of "tool".

5. Refining from the tools/methods to principles, what are some good principles for "computer science" thinking?


Related links

I really liked "You are your tools" post by Daniel Lemire.

Thought-processes.

Intuition pumps. 

How to go 10X. 

Good engineers use good tools. 

Teach Yourself Programming in Ten Years.

Monday, March 26, 2018

Replying to why decentralization matters

Last month Chris Dixon published "Why decentralization matters".

I felt it was one-sided and lacked solid arguments. I picked up some populist vibes as well.
"There go my people. I must find out where they are going, so I can lead them." --Alexandre Auguste Ledru-Rollin

At that time, I had jotted down my criticisms to some of the paragraphs in that article in my draft posts folder. Then I saw Todd Hoff's (from High Scalability fame) brief write up about the article. It captures nicely my reactions to the article. So I will start with Todd's response and cover only the remaining parts of my criticisms in the rest of this post.
"Nerds always dream of decentralization, I certainly do, but every real world force aligns on the side of centralization. We still have NAT with IPv6! Ironically, the reason given why decentralization will win is exactly why it won't: "Decentralized networks can win the third era of the internet for the same reason they won the first era: by winning the hearts and minds of entrepreneurs and developers." Decentralization is harder for developers, it's harder to build great software on top of, it's harder to monetize, it's harder to monitor to control, it's harder to onboard users, everything becomes harder. The advantages are technical, ideological, and mostly irrelevant to users. Wikipedia is crowd-sourced, it's in no way decentralized, it's locked down with process like Fort Knox. Decentralization is great for dumb pipes, that is the original internet, but not much else. Cryptonetworkwashing can't change architectures that have long proven successful in the market."

The myth of the decentralized Internet

Chris says: "During the first era of the internet, internet services were built on open protocols that were controlled by the internet community."

Open protocols do not imply a decentralized system.

Also the myth of decentralized Internet (or early Internet) needs to die already. "The Internet is better viewed as being distributed, both in terms of technologies and governance arrangements. The shift in perspective, from decentralised to distributed, is essential to understand the past and present internet, and to imagine possible future internets which preserve and support the public good."

The excellent book "Where Wizards Stay Up Late: The Origins Of The Internet" tells the story of how ARPA and the involved universities built the origins of Internet and captures really well the thinking/perspectives at that time.

Render unto GAFA

Chris says:
> During the second era of the internet for-profit tech companies, Google, Apple, Facebook, and Amazon (GAFA), built software and services that rapidly outpaced the capabilities of open protocols.
Yes, this is because [logically] centralized coordination is effective and efficient.

> For 3rd parties, this transition from cooperation to competition feels like a bait-and-switch. Over time, the best entrepreneurs, developers, and investors have become wary of building on top of centralized platforms. We now have decades of evidence that doing so will end in disappointment. In addition, users give up privacy, control of their data, and become vulnerable to security breaches. These problems with centralized platforms will likely become even more pronounced in the future.
The first part is a fair point (below I talk a bit about what could be a better model). But I am not sure I agree with the second part. Centralized services provide easy to use platforms for third parties to reach to millions of people that use the service. IoS appstore, Android appstore, AWS, Facebook, and Google ad platforms are examples. These platforms are sharing a great resource/platform to bootstrap and scale to millions to the 3rd parties, and therefore can afford to leverage on that and be whimsical and demanding. This comes with freemarket cut-throat capitalism. I am very disappointed (but not surprised) with how this is abused and I hope we can converge to a better model that prevents the abuses. Hopefully some of the ideas from the decentralized systems can be adopted to help converge to that model.

That said, I don't think decentralized systems would be immune to abuses and tyranny, only the type of power wielders will change.

> Decentralized systems start out half-baked but, under the right conditions, grow exponentially as they attract new contributors.
It is still unclear to me how a decentralized network can bootstrap easily. Decentralized does not automatically resolve the bootstrapping and traction problem, right? Why would I join a decentralized system in the beginning? I don't know how many Byzantine nodes are out there? At the beginning of the network, it is easy to have a Byzantine gang trap. I think the cryptocurrencies play to the greed and FOMO of the people while there is hype in this field. Initially it is cheap to join, and there is a chance to bank on by being the early adapter and make money off the late comers. But that works only if this particular thing/platform becomes popular. That is not a sustainable model. For the dozens of cryptocurrencies that become popular, we have thousands of cryptocurrencies that tanked which are missed. I think we have a case of survivorship bias in evaluating blockchain bootstrapping process.

Appeal of the emotional appeal

Chris says: "Decentralized networks can win the third era of the internet for the same reason they won the first era: by winning the hearts and minds of entrepreneurs and developers."

If you  make an emotional appeal to substitute for a logical argument, you are not doing well. On a side note, recently I had watched a speaker at an important venue that made the following appeal for decentralization: "The cloud took away our personal computers and it is trying to move us back to the mainframe age. With blockchain and peer-to-peer computing we will take back the control to our computers." Oh, please, cry me a river.

As for the response to Chris, Todd's assessment is right about the developers' perspective: "Decentralization is harder for developers, it's harder to build great software on top of, it's harder to monetize, it's harder to monitor to control, it's harder to onboard users, everything becomes harder. The advantages are technical, ideological, and mostly irrelevant to users."

Since the [logically] centralized cloud model has been around for a long time, it has done a lot of improvements in its developer support. Thanks to the logical centralization, the model is also easier for the developers to build on and provide high-availability and virtually unlimited high-scalability as well. But for the decentralized systems, the current support is bad, and there is no indication it can improve fast quickly. 

I think what attracts developers to fully-decentralized platforms currently is not that their minds and hearts are won, but it is the novelty, curiosity, and excitement factors. There is an opportunity to make a winning in this domain if you are one of the first movers. But due to the volatility of the domain, lack of a killer application as of yet, and uncertainty about what could happen when regulations hit, or the bubble burst, it is still a risky move. 

> Cryptonetworks combine the best features of the first two internet eras: community-governed, decentralized networks with capabilities that will eventually exceed those of the most advanced centralized services.
This is another big unsubstantiated claim. Chris says this as a matter of fact and as inevitable without providing any justification for it. I am not convinced that a fully decentralized architecture can be made to exceed the capabilities of [logically] centralized cloud services. Instead I think the centralized solutions can quickly copy the best parts of the technology and outpace decentralized solutions. Paypal can just implement the smartcontracts as the good idea, and still do this as a centralized solution, and avoid all the problems of decentralization. Maybe they can throw in attestation and verifiable claims exchange, which they can more easily implement in their logically-centralized platform. Linked-in may implement attestation and identity services. And identity services will come to social network platforms inevitably/eventually, given the current rate of manipulations and propaganda going on. Dropbox can easily add content-based addressing and leave IPFS in obscurity. Kickstarter can easily implement the token/ICO idea as a better way to support projects with stakeholders.

> Today’s cryptonetworks suffer from limitations that keep them from seriously challenging centralized incumbents. The most severe limitations are around performance and scalability. The next few years will be about fixing these limitations and building networks that form the infrastructure layer of the crypto stack. After that, most of the energy will turn to building applications on top of that infrastructure.
As for the efficiency limitations of decentralized systems, it may not be possible to fix the limitations of decentralized systems. Decentralized is at an inherent disadvantage from a coordination perspective.

What is a better model for preventing abuses with centralized services?

The hand of the free market helps somewhat for keeping centralized services straight: when one big corporation starts playing foul and upsets the users, new companies and startups quickly move in to disrupt the space and fill in the void. But the hand of the free market moves slowly.

This is not exactly my domain of expertise, but here are some things that may help prevent abuses from centralized services.

1. The services can provide some open core protocols/contracts/APIs that won't change. This will give these services an edge for adoption by developers compared to others. Maybe they may even make these enforcable by law with some penalty. Cloud providers offer SLAs, maybe they should offer these contracts to developers as well.

2. Users and developers can be made stakeholders via initial token offerings. Why can't a centralized service not be able to offer tokens? As long as there is a real cost of getting a token (the user and developer makes some contribution to the growth of the platform) and there is eventually going to be real value associated with a token, this incentive economy can also work for centralized services. Badges are a form of token, and it worked OK for a long time for some services for FourSquare for example. Why not take this to the next level with some cryptotokens, and tokens that can be exchanged/sold on and off the platform?

3. The services may be build closer to Unix philosophy of specialized components with better defined inputs/outputs. This can enable third party services to integrate transform. And collaboration and cooperation may bring win-win business to each service involved as well. Why do the services need to be close walled gardens?

4. Finally better regulations from law-makers can help, such as EU's laws about customer privacy and the right to be forgotten.

I realize number 3 is a bit far-fetched. But if we could have it, we could use our contributions and our data to have some leverage over the services. I had thought about this problem for the smartphone crowdsourcing model back in 2013, but did not follow up because it is a very ambitious undertaking.

Other MAD questions

1. What if the decentralized approach wins a decisive victory quickly, what would that look like?
I find this unlikely but let's say this is a black swan event: somehow the decentralized approach click really well with developers and users alike and takes off. And we see more and more of the blockchain based services. What does that world look like? Does that mean there are more stakeholders in hit services. I am all for spreading the wealth. If blockchain wins, is it automatically guaranteed that we have more stakeholders and more accountable services. Or would we still see relatively few number of big money stakeholders dominating the ecosystem? How do you convince a skeptical this would not be the case?


2.  What are some very good applications for PoW blockchains? (I had the same question for IPFS recently.) Chris doesn't give any killer application, but insists that the applications will come. Even if we still miss the killer application, at this point we should be able to name some reasonably good fit applications there, and name applications from the general territory. Here is a checklist to eliminate some answers in advance.
  • It is store of value against the devaluating currency of developing countries. (They have been using gold, USD, Euro for that for many decades thank you.)
  • It is for international money transfer. (Does it beat Paypal, Western Union, banks, etc., significantly in the features that consumers value the most?)
  • But blockchains have this feature and that feature. (What is the killer application? Let's talk applications not features.)
And while we are at it, I repeat Todd's plea. Let's not call Wikipedia a decentralized system ever again. Crowdsourced does not imply decentralized, and certainly not the permissionless fully-hyperbolically-decentralized systems suggested in blockchain PoW approach.

Related links

The comments under Chris's tweet includes a lot of interesting counterpoints and very valid concerns.

The comments under Chris's medium blog post also has several counterpoints presented.

Blockchain applications in Supply Chain Management

Paper review. IPFS: Content addressed, versioned, P2P file system

Paper review. Blockchains from a distributed computing perspective

Saturday, March 24, 2018

Blockchain applications in Supply Chain Management

Here is the whitepaper that started this post, which is admittedly outside my domain of expertise. Also here is some discussion we had on this on Twitter. 

The white paper includes quotes like "7 in 10 consumer industry executives expect to have a blockchain production network in 3 years" and "Blockchains are expected to greatly reduce nine frictions. Including: inaccessible marketplaces, restrictive regulations, institutional inertia, invisible threats, imperfect information, inaccessible information."

Man, if that last statement is true, the companies should just employ blockchain and fire their CEO's and management teams as they are no longer needed. Blockchain is a miracle worker indeed. It enters inaccessible marketplaces, it loosens restrictive regulations, eliminates institutional inertia, makes visible the invisible threats (watchout ninjas), corrects/curates the imperfect information, and makes all information accessible.

I know that I tend to go hypercritical when I am analyzing some new idea/application. It is a vocational disease. On another day, I may try to write a more positive article about blockchain applications in supply chains. But, since there are already many overly optimistic articles on the blockchain applications in supply chains, this is the article you get from me today.

Do you need blockchains for supply-chain management?

There is this nice paper, titled "Do you need a blockchain?", that explains the application context blockchains would be useful for. This figure is really nice. I will keep this handy here, and refer blockchain fanatics to this figure.


And this table provides a nice comparison of your options as well.


As a case study the paper looks at supply-chain management (SCM). In SCM, the flow of materials and services required in manufacturing a given product is managed. This includes various intermediate storage and production cycles across multiple companies. Consider how complex the SCM for Walmart is. And, while many people don't realize this, military SCM problems rival or even exceed that of giant corporations such as Walmart. (Military is the biggest consumer of RFIDs behind Walmart.) If the military screws up their SCM logistics that may mean life and death. Due to misallocation of resources, some troops in deployment may suffer while some others would have surplus food/items that may need to be thrashed. (Don't search for military applications of blockchain. I searched for it and regretted reading superficial first-mover articles hyping blockchain use for military supply chain management.)

Of course an immediate issue facing the use of blockchains for SCM is that too much transparency could be a bad thing for business strategy (even more dangerous for military strategy).

Here is another great takeaway from the "Do you need a blockchain?" paper.  Harkening back to Figure 1, what is the nature of trust to the writer of the blockchain entry? How is integrity enforced at entry time (for recording the type, quantity, and condition of the goods being shipped)? If we trust the writer, what is the benefit of the blockchain system, let's replicate it across some append-only stores (that is how accounting does ledgers anyway).
"This reasoning leaves us with the question whether all writers can be trusted. SCM has the inherent problem of the interface between the digital and the physical world. A human, or some machine under the control of a single writer, typically is required to register that a certain good has arrived in a warehouse, and if for example its quality is appropriate. If there is no trust in the operation of these employees, then the whole supply chain is technically compromised as any data can be supplied by a malicious writer. If, on the other hand, all writers are trusted, a blockchain is not needed as a regular database with shared write access can be used instead. Note that if through some technical means, the connection between the digital and physical world could be realized in a secure manner, then the previous reasoning might change."
If you like to avoid the too much transparency problem and have some trust in the writers (or can have leverage on them via no-payment and no-future-business threats), then instead of a permissionless/public blockchain, going with a private blockchain could be a better option. But then why not just use a database or Kafka streaming to a database. Hyperledger in its current form is pretty close to that actually. If you have trust issues with writers, run PBFT on top of that.

MAD questions

1. What are some SCM use-cases where (distributed) databases/datastores would not work, and we positively need blockchains?

2. How much data is SCM applications going to generate?
In my previous post, I wrote up my notes from Dr. Laura Haas's talk on data science challenges in food supply chains. The talk mentioned from one food testing sample for metagenomics, you can expect to see 2400 microbial species. As a result, one metagenomics file is 250 GB, and 50 metagenomics samples result in 149 workflows invoked 42K times producing 271K files and 116K graphs.

I was surprised by the size of this data. And this is only from the food safety aspect of SCM. So the data sizes SCM generate has a potential to grow big quickly and we have a big data problem in SCMs. Then how do we store this huge data on all blockchain nodes? Big data will not only overwhelm the storage resources of the nodes but also strain the network as well. Are we going to just store the hash of the big data? If so, why can't we store the hash in an append-only multiversion distributed datastore and call it a day.

Moreover, don't you need to run some analytics on this data? So you will need a database/data store any how. Instead of investigating ways to do SCM with blockchains, wouldn't it make more sense to look for ways to enforce existing database solutions with provenance/attestation? (I have been hearing about provenance work in databases but never had time to learn about them.)

3. How do we teach blockchains to keep secrets?
For SCM, organizations would not like other parties to know the information about goods being transported in order not to lose their strategic advantage. I guess there are some techniques to anonymize information and some blockchains use them. But given that this information is permanently stored, how hard is it for an incentivized third party to data-mine (in machine learning sense, not in blockchain sense :-) this information and de-anonymize the information.

4. One chain to rule them all?
If we use blockchains for SCM, how do we get the public miners on board? I guess we give some tokens via an ICO, and ask them to mine for more tokens. People say blockchains facilitate crowdfunding new initiatives and I certainly appreciate some of the advantages ICOs bring. On the other hand, if done responsibly (and not just playing to people's greed) I am sure this won't be as easy as it sounds. Would there be dozens of SCM blockchains? Would it be possible to maintain a healthy number of diverse miner pools? When things start to level off would we end up with blockchain graveyards? A responsible startup should also have an end-of-life plan, right?

Thursday, March 22, 2018

Anatomical similarities and differences between Paxos and blockchain consensus protocols

Now that we read through some blockchain papers in our seminar, I started to develop some understanding of the "blockchain way". Since I already know about the "Paxos way", I thought it would be instructive and fun to compare & contrast Paxos and blockchain consensus protocols. If you know about either of these protocols well, this post can help you get a headstart on learning the other one.

I will be referring to the below Paxos figure for the protocol phases, so let's pin it here.


Leader election: silent vs loud


A couple days ago I tweeted that. That is how leader election is done in blockchain consensus protocol. The first miner to solve the puzzle first becomes the leader for the current round. For blockchain it is par for the course that for each round another leader (the one who solves the next puzzle first) serves. In other words, instead of Phase1a "propose leadership" and Phase1b "promise to follow" back-and-forth communication among the nodes, blockchain uses a proof-of-work puzzle to elect a leader. What if multiple miners solve the puzzle at about the same time? Well, see the below section for that comparison.

For Paxos, the leader election is done via Phase 1. A leader candidate sends a phase 1a message with its ballot number (essentially a counter or a logical clock if you like) and try to get OKs from a majority number of participants in the quorum. Receiving one "Not OK" message is a deal breaker, because it indicates there is another leader candidate with a higher ballot number reaching to the participants.

In Paxos applications, typically we like to continue with the same leader for many rounds, actually as long as we can. So to avoid paying the cost of phase 1 repeatedly, MultiPaxos skips phase 1 and continues with iterating over phase 2 messages for the consecutive rounds. If an incumbent leader arises, the phase 1 leader election may need to happen again. An orthogonal mechanism, such as a failure detection service, is used to ensure that there are not many candidates running to become a leader at a given time, as that may interfere with the progress of the protocol. That situation is called the "dueling leaders" situation. But worry not. Even when there are multiple leader candidates, Paxos is safe as we discuss below.

Multiple leaders case: fork vs no-fork

In blockchain, the difficulty of the puzzle makes sure that the probability of having two leaders at the same time is low. When there are multiple leaders, you will have a fork. The fork is resolved eventually, within a couple more block additions because most likely one fork will grow faster (the chances of both forks growing at the same speed becomes negligible after a couple of blocks) and the participants prefer to mine on the longest chain for having a chance to receive incentive: the mining fee if they become the leader for that round. In other words blockchain provides a eventually-consistent/stabilizing consensus protocol.

In Paxos, a failure detector service (which relies on heartbeats and often has some randomized backoff implementation) often helps with converging the number of leaders/candidates to 1. However, Paxos has the built-in mechanism (with the ballot numbers and majority acknowledgment via Phase 1 and Phase 2) that ensures safety of agreement even when there are multiple active leaders/candidates. If a participant is aware of an alternative candidate/leader with a higher ballot number, it sends a "Not OK" message to the incumbent leader. The incumbent leader waits for receiving an acknowledgment from a majority of participants to commit a value. If it receives a "Not OK" message, it needs to go back to Phase 1 to duel in the leader election again. On the other hand, it is safe for the incumbent leader to commit a value after a majority acknowledgment is received because even when the incumbent leader is dethroned, the newcomer is bound by Paxos protocol rules to repropose and accept the same value as the consensus decision. Therefore, even when there are multiple leaders, you won't have forks in Paxos, because only one of them--if at all-- will be able to complete Phase 2 and commit a value.

Communication in Accept phase: 1-way vs 2-way 

The blockchain protocol employs communication only once and only in one way, from the leader to the participants, and this corresponds to the phase 2a of Paxos: the Accept message Phase.

Since the blockchain consensus protocol skips Phase 2b and Phase 3, it provides unavoidably a probabilistic consensus.(I had talked about why this is the case for leader based distributed consensus such as Paxos in this previous post.) Since blockchain consensus is inherently probabilistic, you need to wait 6 more blocks to see if that block sticks. At that point the work required to rewrite history is so huge that you can conclude the block is finally committed.

For phase 2 in Paxos, the leader has a two-way acknowledged communication. The two-way communication works OK for the leader because the number of the participants is typically less than half a dozen (for the reasons explained below).

On the other hand, if 1000s nodes participated in a Paxos protocol, the leader would not be able to survive receiving 1000s of ack messages for each round. (Well, I guess it could have, but the phases would take several seconds than the typical 10s of milliseconds.) To avoid this problem, with 1000s of participants, the Blockchain has only one way broadcast communication, which is propagated from the leader to the participants via a gradual gossip protocol.

So to recap, and to refer back to the figure, here is what we have. Blockchain does a silent Phase 1 (via proof-of-work) and follows that with only Phase 2a.

The environment: public vs private

The typical use case of Paxos is to replicate state to make it consistently and safely survive crash failures in a private environment. So keeping the number of participants small (around 5 nodes) is OK for this use case.

The blockchain applications bring new constraints and requirements to the consensus problem. In blockchains, the participants can now be Byzantine, motivated by financial gains. So it is not sufficient to limit the consensus participants to be 3-5 nodes, because the rest of the network does not necessarily trust these nodes. In Blockchains, for reasons of attestability and tolerating colluding groups of Byzantine participants, it is preferred to keep the participants at 1000s to 10,000s. Thus the problem becomes: How do you design byzantine tolerant consensus algorithm that scales to 1000s of participants?

Moreover, the consensus problems solved are also slightly different. The Bitcoin-NG paper provided a nice formulation of the blockchain consensus problem in its model section.

Comparing the Nakamoto consensus problem with the classic Byzantine consensus problem is very instructional. Nakamoto consensus has probabilistic guarantees for Termination, Agreement, and Validity, whereas the classic Byzantine Consensus (which a Byzantized version of Paxos solves) has deterministic guarantees for them. (To keep the discussion simple in this post, I didn't get into a byzantine tolerant Paxos protocol, which is more complicated. By utilizing randomization/probabilities, blockchain has a really simple protocol --I have to give it to it.)

MAD questions

(This section is here due to my New Year's resolution.)
1. Well, this entire post started as a result of a MAD question: "What if I try to map Blockchain protocol components to that of Paxos protocol components?" So I chalk that up as my first MAD question :-)

2. Now that we have this mapping, is there a way to leverage on this to synthesize a new insight? What is the next step? Sometimes providing such a mapping can help give someone a novel insight, which can lead to a new protocol.

Tuesday, March 20, 2018

Leveraging data and people to accelerate data science (Dr. Laura Haas)

Last week Dr. Laura Haas gave a distinguished speaker series talk at our department. Laura is currently the Dean of the College of Information and Computer Sciences at University of Massachusetts at Amherst. Before that, she was at IBM for many years, and most recently served as the Director of IBM Research’s Accelerated Discovery Lab. Her talk was on her experiences at the Accelerated Discovery Lab, and was titled "Leveraging data and people to accelerate data science"

Accelerated Discovery Lab

The mission of the lab was to "help people get insight from data -- quickly".
The lab aimed to manage the technology and data complexity, so that clients can focus on solving their challenges. This job involved providing the clients with:

  1. expertise: math, ml, computing
  2. environment: hosted big data platforms
  3. data: curated & governed data sets to provide context
  4. analytics: rich collection of analytics and tools.

By providing these services, the lab "accelerated" around 30 projects. The talk highlighted 3 of them:

  • Working with CRM and social media information, which involved managing complex data.
  • Working with a major seed company, which involved analyzing satellite imaging, and deciding what/how to plant, and choosing the workflows that provide auditability, quality, accuracy.
  • Working with a drug company, which involved "big" collaboration across diverse teams and organizations.

The Food Safety Project

The bulk of the talk focused on the food-safety case-study application that the lab worked on.

Food safety is an important problem. Every year, in the USA alone food poisoning affects 1/6th of people, causes $50 million ilnesses, 128K hospitalizations, 3K deaths, and costs 8 billion dollars.

Food poisoning is caused by contaminants/pathogens introduced in the supply-chain. Laura mentioned that the state of the art in food testing was based on developing a suspicion and as a result testing a culture. This means you need to know what you are looking for and when you are looking for it.

Recently DNA sequencing became affordable & fast and enabled the field of metagenomics. Metagenomics is the study of genetic material (across many organisms rather than a single organism) recovered directly from environmental samples. This enabled us to  build a database of what is normal for each type of food bacteria pattern. Bacteria are the tiny witnesses to what is going on in the food! They are canary in the coal mine. Change in the bacteria population may point to several pathologies, including lead poisoning. (Weimer et al. IBM Journal of R&D 2016.)

The big data challenge in food safety

If you have a safe sample for metagenomics, you can expect to see 2400 microbial species. As a result, one metagenomics file is 250 GB! And 50 metagenomics samples result in 149 workflows invoked 42K times producing 271K files and 116K graphs.
(Murat's inner voice: Umm, good luck blockchainizing that many big files. Ok, it may be possible to store only the hashes for integrity.)

Another big challenge here is that contextual metadata is needed for the samples: when was the sample collected, how, under what conditions, etc.

Data lake helps with the management tasks. A data lake is a data ecosystem to acquire catalogue, govern, find use data in contexts. It includes components such as

  • schema: fields, columns, types, keys
  • semantics: lineage, ontology
  • governance: owner, risk, guidelines, maintenance
  • collaboration: users, applications, notebooks

Since the data collected involves sensitive information, the employees that have access to the data were made to sign away rights for ensuring privacy of the data. Violating these terms constituted a firable offence. (This doesn't seem to be a rock solid process though. This relies on good intentions of people and the coverage of the monitoring to save the day.)

The big analytics challenge in food safety

Mapping a 60Gb raw test-sample data file against a 100Gb references database may take hours to days! To make things work data and references files keep getting updated.

The lab developed a workbench for metagenomic computational analytics. The multipurpose extensible analysis platform tracks 10K datasets and their derivation, performs automatic parallelization across compute clusters, and  provides interactive UI and repeatibility/transparency/accountability. (Edund et.al ibm journal 2016)

The big collaboration challenge in food safety

4 organizations worked together on the food safety project: IBM, mars petfood company (pet-food chains were the most susceptible to contamination), UC Davis, and Bio-Rad labs. Later Cornell also joined. The participants spanned across US, UK, and China. The participants didn't talk the same language: the disciplines spanned across business, biology, bioinformatics, physics, chemistry, cs, ml, stat, math, and operations research.

For collaboration, email is not sufficient. There are typically around 40K datasets, which ones would you be emailing? Moreover, emailing also doesn't provide enough context about the datasets and metadata.

To support collaboration the lab built a labbook integration hub. (Inner voice: I am relieved it is not Lotus Notes.) The labbook is a giant knowledge graph that is annotatable, searchable, and is interfaced with many tools, including Python, Notebooks, etc. Sort of like Jupiter Notebooks on steroids?

Things learned

Laura finished with some lessons learned. Her take on this was: Data science needs to be holistic, incorporating all these three: people, data, analytics.

  • People: interdisciplinary hard, social practices/tools can help
  • Data: data governance is hard, it needs policies, tools
  • Analytics: many heterogenous set of tools need to be integrated for handling uncertainty

As for the future, Laura mentioned that being very broad does not work/scale well for a data science organization, since it is hard to please every one. She said that the accelerated discovery lab will focus on metagenomic and materials science to stay more relevant.

MAD questions

(This section is here due to my New Year's resolution.)

1. At the question answer time for the talk, I asked Laura about IBM's recent push for blockchains in food safety and supply-chain applications. Given that the talk outlined many hard challenges for food safety supply chains, I wanted to learn about her views on what roles blockchains can play here. Laura said that in food supply-chains there were some problems with faked provenance of sources and blockchains may help address that issue. She also said that she won't comment if blockchain is the right way to address it, since it is not her field of expertise.

2. My question was prompted by IBM's recent tweet of this whitepaper which I found to be overly exuberant about the role blockchains can play in the supply-chain problems.(https://twitter.com/Prof_DavidBader/status/964223296005967878) Below is the Twitter exchange ensued after I took issue with the report. Notice how mature @IBMServices account is about this? They pressed on to learn more about the criticism, which is a good indicator of intellectual honesty. (I wish I could have been a bit more restrained.)


I like to write a post about the use of blockchains in supply-chains. Granted I don't know much about the domain, but when did that ever stop me?

3. This is a little unrelated, but was there any application of formal methods in the supply-chain problem before?

4. The talk also mentioned a bit about applying datascience to measure contributions of datascience in these projects. This is done via collaboration trace analysis: who is working with who, how much, and in what contexts? A noteworthy finding was the "emergence of new vocabulary right before a discovery events". This rings very familiar in our research group meeting discussions. When we observe an interesting new phenomenon/perspective, we are forced to give it a made-up name, which sounds clumsy at first. When we keep referring to this given name, we know there is something interesting coming up from that front.