Tuesday, December 18, 2012

Our brains are radios

Ok, our brains are not actually radios, sorry for the provocative title. But as I will explain shortly, there are some interesting parallels. When I teach about wireless radios in my networking classes, I talk about this connection, and this always perks my students up. The connection stems from the fact that both radios and brains are signal processing systems. The concepts behind how radio receivers work explain some of the mysterious phenomena in our brains, particularly the hearing of phantom tunes.

This is how radios work

Most people have an incorrect model of how radios and wireless radio communication work. The layman thinks radios are simple. The transmitter does all the heavy lifting, and puts the signal on the air. Before the signal fades away completely, if the receiver device is in communication range, it will pick up the signal by listening. This naive understanding suggests that the receiver is a passive device, as listening is perceived mostly as a passive activity. But, nothing can be further from the truth. The receiver spends considerable energy to power its radio and amplify the signal in order to receive it.

When the receiver powers its radio to amplify signals, it is also amplifying noise. The signal to noise ratio (SNR) determines successful reception: if the signal has not faded, it will get amplified more, stand out, and decoded correctly. This gives raise to an interesting question: When there is no signal in the channel, what does the receiver radio hear? The answer is the receiver radio hears amplified noise. But the receiver radio does not have a way to know that this is noise. It picks up the signal as 0 and 1 bits, and passes this to the higher layer, MAC layer, which can have more context to differentiate between valid message and noise/junk message.

A peculiarity of the radios

In 2004, I was doing some MAC layer hacking on the wireless sensor nodes. For one experiment I bypassed the MAC layer on the node to listen/record the radio at the link layer. The lone node was recording bits, 0s, 1s, and so forth. I was surprised. I double-checked that there was no node transmitting in the vicinity, I tried again, and yes the receiver was still picking up phantom messages from thin air. This is because, the receiver radio is powered up and it is amplifying noise and receiving it as a phantom message. The reason this is not a problem in normal operation of the motes is because the MAC layer does the filtering of the phantom messages. To this end, the MAC layer imposes a preamble check (for at least a couple bytes) before recording a reception of a message. If the initial part of a stream of bits matches the preamble pattern (say 010101010101010101010101) than the rest of the message is recorded and delivered to the application layer. Otherwise, this is just noise, and should be ignored. Another important detail here is that if the preamble length is chosen too short (say 1 byte long), sooner or later the random noise stream will match the preamble pattern and a phantom (random noise) message will be received and passed to the application layer.

A peculiarity of our brains

This finally brings me to the brain and phantom tune connection. When you are lying in a dark room, deprived of any visual and audio stimulation, do you sometimes start hearing a tune (a phantom tune)? Tell me I am not the only one. It sometimes sounds very vivid, and you almost think you are actually hearing it play. (Sensory deprivation tanks is a good place to experience this. People who lost hearing also experience these in more intensity as reported in these articles.) This phantom tune is your brain lying to you, like the radio picking up phantom messages from thin air. This is because your brain is a signal processing machine and it is trying to latch on to some signals by boosting them. When there is no outside stimuli to latch on to, the noise gets amplified. In this case, the noise is the random firings of your neurons, which happens incessantly. And if it happens that a sequence of random firings of neurons matches the initial sequence (preamble!) of an old tune stored in your memory, your brain latches on to that signal and fills the rest of the signal playing from your memory, and you feel like you are vividly hearing that music again.

A related Kurt Vonnegut story from TimeQuake

I will of course not miss this opportunity to include a short piece on this theme from my favorite writer Kurt Vonnegut. This story appears in the TimeQuake book.
For the record: Dr. Fleon Sunoco at the NIH, who is independently rich, hires grave robbers tovbring him the brains of deceased members of Mensa, a nationwide club for persons with high Intelligence Quotients, or IQs, as determined by standardized tests of verbal and nonverbal skills, tests which pit the testees against the Joe and Jane Sixpacks, against the Lumpenproletariat.
His ghouls also bring him brains of people who died in really stupid accidents, crossing busy streets against the light, starting charcoal fires at cookouts with gasoline, and so on, for comparison. So as not to arouse suspicion, they deliver the fresh brains one at a time in buckets stolen from a nearby Kentucky Fried Chicken franchise. Needless to say, Sunoco's supervisors have no idea what he's really doing when he works late night after night.
They do notice how much he likes fried chicken, apparently, ordering it by the bucket, and that he never offers anybody else some. They also wonder how he stays so skinny. During regular working hours, he does what he is paid to do, which is develop a birth control pill that takes all the pleasure out of sex, so teenagers won't copulate.
At night, though, with nobody around, he slices up high-IQ brains, looking for little radios. He doesn't think Mensa members had them inserted surgically. He thinks they were born with them, so the receivers have to be made of meat. Sunoco has written in his secret journal: "There is no way an unassisted human brain, which is nothing more than a dog's breakfast, three and a half pounds of blood-soaked sponge, could have written 'Stardust,' let alone Beethoven's Ninth Symphony."
One night he finds an unexplained little snot-colored bump, no larger than a mustard seed, in the inner ear of a Mensa member, who as a junior high schooler had won spelling bee after spelling bee. Eureka!
He reexamines the inner ear of a moron who was killed when, she was grabbing door handles of fast-moving vehicles while wearing Rollerblades. Neither of her inner ears has a snot-colored bump. Eureka!
Sunoco examines fifty more brains, half from people so stupid you couldn't believe it, half from people so smart you couldn't believe it. Only the inner ears of the rocket scientists, so to speak, have bumps. The bumps have to have been the reason the smarties were so good at taking IQ tests. An extra piece of tissue that little, and as nothing but tissue, couldn't possibly have been much more help than a pimple. It has to be a radio! And radios like that have to be feeding correct answers to questions, no matter how recondite, to Mensas and Phi Beta Kappas, and to quiz show contestants.
This is a Nobel Prize-type discovery! So, even before he has published, Fleon Sunoco goes out and buys himself a suit of tails for Stockholm.
...
Trout said: "Fleon Sunoco jumped to his death into the National Institutes of Health parking lot. He was wearing his new suit of tails, which would never get to Stockholm. "He realized that his discovery proved that he didn't deserve credit for making it. He was hoist by his own petard! Anybody who did anything as wonderful as what he had done couldn't possibly have done it with just a human brain, with nothing but the dog's breakfast in his braincase. He could have done it only with outside help."

Thursday, November 29, 2012

From Clarity to Efficiency for Distributed Algorithms

Y. A. Liu, S. D. Stoller, B. Lin, M. Gorbovitski, Appeared in OOPSLA'12.

This paper describes a high-level language, DistAlgo, for clear description of distributed algorithms (e.g., Paxos, mutual execution). In contrast to embarrassingly parallel tasks which involve very regular and simple synchronization patterns (e.g., MapReduce tasks), distributed algorithms generally involve complex and tight synchronization patterns. DistAlgo specifies these complex synchronization conditions via using logical quantifications over message histories of processes involved in the algorithm. But, unfortunately, this turns out to be extremely inefﬁcient if executed straightforwardly: each quantiﬁer will cause a linear factor in running time, and any use of the history of messages sent and received will cause space usage to be unbounded.

To address this inefficiency, this paper focuses on optimizing implementations of DistAlgo programs by incrementalization of these expensive synchronization conditions. To this end, their method first transforms these quantifications are into set queries, and then transforms further to minimize their nesting. Finally, from here, the method systematically extracts single quantified order comparisons, and transforms them into efficient incremental updates of auxiliary variables as messages are sent and received. (An earlier paper that focuses more on the syntax of DistAlgo, and uses Paxos as a running example, appeared at SSS'12.).

To give a glance of DistAlgo and how it does this transformations, I am including some figures from the paper as is. The paper uses Lamport's mutual exclusion algorithm as a running example and shows its DistAlgo and automatically optimized DistAlgo implementation. DistAlgo fills a gap and seems like it can be very useful in the coming years for writing distributed programs. Distributed algorithms can be expressed in DistAlgo clearly at a high level, like in pseudocode, but also precisely, like in formal speciﬁcation languages, and be executed as part of real applications, as in programming languages. DistAlgo extends Python with processes as objects, send/receive of messages, yield points, synchronization conditions as logical quantifications, and configuration of distributed system topology. DistAlgo compiler and optimizer are implemented in Python (building on Python's parser module and AST package) and generates executable Python code.

Tuesday, November 20, 2012

Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary

Appeared in OSDI '12, Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguica, and Rodrigo Rodrigues.

In order to reduce latencies to geographically distributed users, big webservices companies (Google, Yahoo, Facebook, Twitter) replicate data across geographical regions. But replication across datacenters, create cross-site consistency problems, which is further complicated with the huge WAN latency delays. If you want to have strong consistent updates across sites, you have to pay the price in terms of latency (basically you revert to doing synchronous replication via say Paxos as in MDCC). This is the ELC part in the PACELC.

To alleviate this latency versus consistency tension, this paper proposes RedBlue consistency, which enables blue operations to be fast/asynchronous (and eventually consistent) while the remaining red operations are strongly-consistent/synchronous (and slow). So a program is partitioned into red and blue operations, which run with different consistency levels. While red operations must be executed in the same order at all sites (which make them slow), the order of execution of blue operations can vary from site to site (allowing them to be executed without requiring coordination across sites). "In systems where every operation is labeled red, RedBlue consistency is equivalent to serializability; in systems where every operation is labeled blue, RedBlue consistency allows the same set of behaviors as eventual consistency."

To facilitate this red blue partioning, each program operation u is split into two components: a generator operation g_u with no side effects, which is executed only at the primary site against some system state S, and produces a shadow operation h_u(S), which is executed at every site (including the primary site).

This is simply separation of concerns principle. The generator operation decides which state transitions should be made while the shadow operation applies the transitions in a state-independent manner. This separation leads to a fine-grained classification of operations, with potentially more execution paths leading to blue operations. Also this leads to a simple logic for shadow operations that can be based on operations that are intrinsically commutative (increment/decrement, insertion/removal) or that commute via last-writer-wins strategy.

Red Blue rules

Now here are the rules for labelling the [shadow] operations as red or blue.
1. For any pair of non-commutative shadow operations u and v, label both u and v red.
2. For any shadow operation u that may result in an invariant being violated,  label u red.
3. Label all non-red shadow operations blue.
This is unfortunately not an automated process. Developer has to manually partition the operations to generator and shadow operations, and after that mark them as red and blue manually, by following the above rules. As such, this is an error-prone process and is hard to scale.

The below is the description from the paper of how to keep track of the order and consistency of red/blue operations:
When a generator operation completes, the coordinator must determine if the operation reads a coherent system snapshot and obeys the ordering constraints of a causal serialization. To do this, the coordinator checks the timestamps of the data items read and written by the completing operation, and compares them to the timestamps associated with operations completing concurrently and the remote shadow operations that were being applied simultaneously at that site. Upon successful completion of a generator operation the coordinator assigns the corresponding shadow operation a timestamp that is component-wise equal to the latest operation that was incorporated at its site, and increments its blue and, if this shadow operations is red, the red component of the logical timestamp. This timestamp determines the position of the shadow operation in the RedBlue order, with the normal rules that determine that two operations are partially ordered if one is equal to or dominates the other in all components.
The paper reports an evaluation of this idea by modifying the TPC-W and RUBiS benchmarks on an online social network case study (Twitter is the Hello World of georeplication :-). The experimental results show that RedBlue consistency provides substantial performance gains without sacrificing consistency.

Discussion

Although the paper does not mention it, a very relevant work to this is the CALM conjecture and the Bloom project from UC Berkeley. The CALM principle says that (1) logically monotonic distributed code is eventually consistent without any need for coordination protocols (distributed locks, two-phase commit, paxos, etc.) and (2) eventual consistency can be guaranteed in any program by protecting non-monotonic statements ("points of order") with coordination protocols. It is easy to see that the logically monotonic operations correspond to the blue operations, and non-monotonic operations correspond to red operations in the RedBlue work.

After the OSDI presentation of the paper, there were a couple of concerns raised about the approach. Mike Freedman (Princeton) asked the question: "blue cannot overtake the red, so you cannot vary their order, doesn't this degrade performance significantly?". Marcos Aguilera (HP Research) commented that similar approaches have been around; he referred to the generic broadcast work, and the later and more general Generalized Paxos work. The Generalized Paxos work seems to be very related indeed, and I am not sure what in the RedBlue work constitute the major differences. Maybe the RedBlue work provides a case study and more principled approach to identify commutable actions in Generalized Paxos. Another shortcoming of the RedBlue work is that it does not have any fault-tolerance build in. It may be hard to add fault-tolerance as an after thought, so maybe it is best to think of this work in the Generalized Paxos framework.

Friday, November 9, 2012

Sensys'12 (day2)

Here are my much anticipated (at least by Ted :-) notes from the second day of Sensys.

Kyun queue: a sensor network system to monitor road traffic queues

I missed this talk, but this sounds like an interesting paper. The idea seems to be to deploy a transmitter and a receiver node on the opposite sides of a road. Depending on the density and the speed of the cars passing between the transmitter and the receiver, the received radio signal has different signatures. The work studies these signals and figures out the road conditions based on these.

Low cost crowdcounting using audio tones

Crowd counting can be useful for event planning, learning popularity of aisles, etc. This work assumes that everyone in the crowd has a smartphone and this counting app running on their phone. (This is the biggest limitation of the work.) This work uses the audio tones (again ultrasound tones) emitted by the phones to count the people. The app employs the smartphone mic and speaker, no additional hardware is needed. The system uses an initiation tone to start. The phone that hears this tone picks a frequency using uniform or better geometric choosing method and forwards the tone. The frequencies are picked in the ultrasound region (that same 4khz band alluded to in yesterday's work.) Using uniform choosing method is not prone to collisions, the counts can be off due to the losses in collisions. The geometric choosing method employs geometric distributions and hence is more prone to collisions. (My work on thresholding counting is very applicable for this problem: Singlehop Collaborative Feedback Primitives for Threshold Querying in Wireless Sensor Networks.) They did evaluations with 25+ phones, but in pretty controlled environments (phones at hand). The phones don't work well at pockets or behind clothing. The audio tone method is said to be low-power, but it is actually similar cost to the wifi. The authors didn't use wifi because its range can be too long for the region to be counted. They said that, if ZigBee were available on the phones, they would use it.

How refreshing! The presenter and a lead author of this work is an undergraduate student. This work aims to build a platform (hardware and software components) to collect analog sensor (any sensor) data through the audio port of any phone. But the problem is there is almost no standard I/O connector, no standard programming interface, no third party app support on the smartphones. HiJack did a great job on the iPhone using the audio headset port (the Energy-efficient GPS and the MusicalHeart works at this Sensys used the HiJack approach). But HiJack is iPhone specific. This work extends to other phones using available power from "mic bias voltage" at the audio headset port. Powered this way, the analog data/voltage from the sensor attached to the audio port are encoded as the amplitude of a square wave. This work uses the phone's builtin voice recording app (which is already optimized to be low-energy and use hardware compression) to capture and store the attached sensor data and the phone's ability to send the voice recording via email/mms for exporting the captured data. The software to process the signal and interpret it is at the cloud. The platform is evalued with an EKG application. Now you can use your smartphone as a holter device! The authors estimate that the hardware board for this platform would cost $5 in mass production. Prius: generic hybrid trace compression for WSNs The goal of this work is to compress traces efficiently. Traces are important for debugging WSN deployments, but trace sizes can become large, so they need to be compressed otherwise whether you write to flash (most often the case) or exfiltrate to the basestation you are wasting a lot of precious energy. Traditional compression techniques (prediction-based, dictionary-based grammar-based) are not applicable for WSNs because they require more memory and computation than is available in WSNs. The insights behind Prius work are that traces are highly repetitive, traces evolve slightly, and program memory is not as scarce as data/var memory. The Prius approach is to mine trace patterns offline (static-analysis of the program code?) and to store them in program memory. In the offline learning phase, Prius creates a compression dictionary, this dictionary is then processed by data structure generator to obtain an efficient data structure data structure to be used in the headerfile of the deployed program. DoubledipL leveraging thermoelectric harvesting for low power monitoring of sporadic water use This work does waterflow monitoring. The motes they built include acceleration and temperature sensors and use wireless communication. The motes are harvesting their energy on the pipes using the thermal gradients of the hot/cold water in the pipe and the environment temperature. Business meeting There was a short business meeting in the afternoon. We learned that next year Sensys will be in Rome, Italy at the Sapienza University, Nov 11-15. The rest of the business meeting was spent for some soul-searching about Sensys. Is the excitement for Sensys diminishing as maybe indicated by the Sensys submission and attendance numbers are going down. With the smartphone going big, and also getting a lot of focus at Sensys, how can Sensys distinguish itself from Ubicomp or Ubisys? Andrew Campbell (Darthmouth) said he is nervous and somewhat pessimistic about the future of Sensys. He called on to the PhD students, Assistant professors, to get up and be more engaged/central in Sensys community. Also several people voiced interest for having a workshop on big data / machine learning as part of Sensys. Demo session A lot of the papers in the conference got demoed in the demo and poster session. Probably there were about 10 demos, and 20 more posters. The most interesting demo must have been the proximity social interaction monitoring sensors demo from University of Utah. They had brought in 300 motes. Their large programming board, can program upto 100 motes at the same time. They produced their own hardware, and said each mote cost about$30. They distributed upto 100 motes for monitoring, about half of the people in attendance got a mote from them. They also had stationary motes under each poster table. Although the reception area was quite large, is L shaped, and a lot of humans were in the way (all very detrimental to wireless communication), the stationary motes talk to singlehop to their basestation connected to the laptop. The mobile nodes had much smaller communication radius, they captured signals they heard from nearby motes, and broadcasted these so that stationary motes can pick these up and forward them to the basestation.

The muscle fatigue monitoring demo was also very interesting. Yes, this was the guy with the yoga pants. I got to take a closer look at the Cleon mote from the best paper (energy efficient GPS). I tried the MusicalHeart demo. I was told that music would not be able to slow my heart rate at high activity level (jogging) as I suspected. I was told that I had to set my tempo to the music, and then my heart rate will slow. So the music only becomes the cue to speed up or slow down my activity. Sort of a bummer.

You could understand if a talk/paper is well received depending on the number of people asking questions after the talk. The talks that receive no questions mean something was not quite right. (On related news, if I do not receive any questions after my presentations, I will probably have a breakdown and start crying.)

The tea (especially the vanilla orchard variety) was very good and there was a good supply of tea. I observed that more people were drinking tea relatively to previous years. I like this trend as a tea person.

The lunch arrangement was the weak point in the otherwise great organization. Lunch is picked up, and since there wasn't a lunch ballroom, we had to eat lunch in the conference room... facing the screen. This made for a lonely and awkward lunch. Having a lunch room which forces people to sit together helps create chance acquaintances, which adds a lot to the conference experience. I still find it hard to meet new people at the conferences. I think it would be really nice if conferences organized some nerdy ice-breaking events. The demo and poster session in Sensys helped a lot for this. Conferences should try more playful things for ice-breaking. Maybe, instead of writing conference name on nametags (duh, everybody in attendance knows conference name), write a twitter bio-like self-describing line about the person, or what that person wants to talk about: "ask me about crowdsourcing coffee shop line waiting times". Another idea could be to get people collaborating on an acitivity, game, or challenge. For example get people in attendance collaboratively write/edit/produce a dozen different conference summaries! Editors and webservices already exists for collaborative editing, and maybe the conference should randomly or purposefully cluster people into these editing groups.

Hmm, maybe I should start a conference critiquing business. Just pay my registration and travel, and I am set :-) I will write conference critics and reviews for your conference.

Finally, this is my motivation level after this conference.

Thursday, November 8, 2012

Sensys'12 (day 1)

I am attending Sensys'12  at Toronto, which is a short distance from Buffalo. Sensys is a premiere conference for wireless sensor network and, more recently, smartphone sensing research. Here are my rough notes from the first day of conference.

This paper is really good, it blew me away. Later at the conference banquet Wednesday night, the paper got selected as the Best Paper. This work aims to break up the GPS as a blackbox and see how we can make it more energy-efficient. The paper goes into fairly technical detail about how GPS works, which I cannot replicate here. Some basics are: GPS uses CDMA to transmit 50bps information over a carrier signal of 1.575Ghz. Seeing more satellites improves accuracy. GPS localization requires the following phases on the device: acquisition, tracking, decoding, code-phase, least-square calculation.

The insight of the paper is to offload some of these phases to the cloud, so the device does less work and spends less power. For example, the codephase involves getting the "ephemeris" from the signal for learning the location of satellites, but this look up table is already made available by NASA and alike in cloud. So that can be easily offloaded to the cloud, and the device does not need to do it. Another trick is coarse time navigation. If a nearby landmark has same/close nms for the GPS signals (nms= number of milliseconds for time of flight of GPS signal), instead of calculating the location, use the landmark's location. If there is no such known location, then leastsquare calculation may give several possible locations, but there is still hope. It is possible to use the elevation information (the height dimension often ignored in the lattitude-longitude pair) as a checksum and eliminate the spurious possibilities and to figure out the correct location. The elevation data and the computation power to do this is available at the cloud side.

Using similar tricks, they reduce the duration of the acquisition phase at the device and the amount of information to be gathered by the device from the satellites. They develop hardware for this (they call this Co-GPS design). The Cleon prototype is comprised of max2769 gps receiver + msp430 as the chip, and talks to the phone or device (ipod?) using audio communication through the headphone jack. Cleon ends up sampling GPS only for 2ms (16 mHz sampling, 2 bit/sample, 2ms) as that is enough for accurate localization (within 30meters) via offloading to the cloud-side. The awesome thing about Cleon is that it can run on 1.5 years on 2AA battery with continuous sampling. Wow! (This calculation does not take into account the communication cost to cloud as this is done by the device Cleon connects to and can be over wifi, bluetooth, etc.). A byproduct of Cleon is atomic time stamping for outdoor sensors. They will opensource the software and the hardware soon.

IOdetector: a generic service for indoor outdoor detection

This paper aims to provide a service for the smartphone to detect whether it is currently outdoor, semi-outdoor, indoor. The application of this service is to automatically and in "an energy-efficient manner" detect when the phone should not even bother activating the GPS for an app because the phone is indoor or semi-outdoor and the GPS error will be big, and it would not justify the energy wasted by the GPS. The service uses three modalities: light signal, cellular signal, magnetic field signal. Also secondarily, acceleration and proximity & time (to sanitize light readings) are employed. These are all very low energy sensors. The light detector accounts for day or night, and can detect indoor fluorescent lighting. The cellular signal detector uses signal from multiple celltowers to avoid the corner-turning effect. However, each subdetector cannot completely classify the three types (indoor, outdoor, semi), so they fuse these subdetectors employing the confidence results provided by them.

Musical Heart : A hearty way to listen to music

It is a proven fact that music has profound effect on human heart rates. Some music calms, some music gets you worked up. This work presents a biofeedback-based context-aware music recommendation app for smartphones, MusicalHeart. This app requires its own special sensor equipped earphones to achieve the biofeedback part. More specifically, there is an IMU (inertial measurement unit) and a microphone embedded in the headphone. The microphone is used for heart-rate detection (requires some filtering), and the IMU for activity detection (employs k-means clustering). These sensors communicates to the smartphone app via the audio jack, and they are powered by a thin film battery. In addition to the sensors in the headphone MusicalHeart app also uses phone GPS and WAP information for context detection.

The Banquet

The conference banquet was held at the CN tower (the very tall thin tower that is the symbol of Toronto). This banquet is up there for me with the Mobicom'11 banquet. The CN tower was a very interesting and exciting venue to have the banquet. Especially with the glass floor!

Sunday, November 4, 2012

An abundance of solutions

My son started the Kindergarten this year. I try to be a good parent and work with him closely on his sight-words and handwriting. Today we were reviewing his weekly reader magazine, and I read to him the following question from the magazine.
There is only one pumpkin left at the pumpkin patch. Mingo and Zip both want the pumpkin. What should they do?
I guess the easy answer (maybe, the answer the magazine tries to teach) is that they share the pumpkin. But my son thought about a little bit, and said they should ask the farmer if the farmer has more pumpkins somewhere. I went along, and together we thought of several more solutions to the problem.

• Mingo and Zip carve the pumpkin nicely, and sell the carved pumpkin at a good price at the bazaar and buy two pumpkins with that money.
• They divide the pumpkin into half. For dividing fairly, you know the trick right. One of them gets to cut in two pieces, and the other gets to choose. (However, my wife argues this is not fair if the person to cut is clumsy and cannot manage to cut into equal pieces. Then the chooser has an obvious advantage.)
• Mingo and Zip share the pumpkin in the time domain. Mingo displays it on the odd numbered days, Zip on the even-numbered. (I wonder if it is possible to share the pumpkin on the frequency domain or via code division.)
• Mingo paints and displays the pumpkin on Halloween, then Zip gets the pumpkin and cooks it.
• They do a coin toss about who gets the pumpkin.
• Mingo pays Zip and gets the pumpkin.
• They take the seeds of the pumpkin, sow the seeds, and next year they have many pumpkins each.
• They both forgo the pumpkin, and present it to one of their mutual friends.
• One of them gets up early and takes the pumpkin, the other gets a good life lesson.
• They help each other and make a nice pumpkin pie together. They apply mapreduce to the problem, make a business out of it, and get rich. See mapreduce in simple terms about how.
OK, I didn't really try to discuss the mapreduce idea with my son. I am waiting a couple more years before we have the mapreduce and writing robust programs talk.

Saturday, November 3, 2012

MDCC: Multi-Data Center Consistency

In order to reduce latencies to geographically distributed users, Google, Yahoo, Facebook, Twitter, etc., replicate data across geographical regions. But replication across datacenters, over the WAN, is expensive. WAN delays are in the hundreds of milliseconds and vary significantly, so common wisdom is that synchronous wide-area replication is unfeasible, which means strong consistency should be diluted/relaxed. (See COPS paper for an example.)

In the MDCC work (Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, March 2012), the authors describe an "optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols". Optimistic and strongly-consistent is an odd/unlikely couple. The way MDCC achieves this is by starting with Paxos, which is a strongly consistent protocol. MDCC then adds optimizations to Paxos to obtain an optimistic commit protocol that does not require a master and that is strongly consistent at a very low-cost.

Paxos optimized

Classic Paxos is a 2-phase protocol. In Phase 1, a master M tries to establish the mastership for an update for a specific record r. Phase 2 tries to accept a value: the master M requests the storage nodes to store r next (using the ballot number M established in the first phase), and waits for a majority number of ACKs back.

Multi-Paxos optimization employs the lease idea to avoid the need for phase 1 in the fault-free executions (timely communication and master does not crash). The master puts a lease on other nodes on being a master for many seconds, so there is no need to go through Phase 1. (This is still fault-tolerant. If the master fails, other nodes need to wait until the lease expires, but then they are free to chose a new master as per classic Paxos protocol.) Fast-Paxos optimization is more complex. In fast-paxos a node that wants to commit a value first try without going through the master (saving a message round, potentially over the WAN), directly communicating with the other nodes optimistically. If a fast-quorum number of nodes (more than majority number of nodes needed in classic quorum) reply, the fast round has succeeded. However, in such fast rounds, since updates are not issued by the master, collisions may occur. When a collision is detected, the node then goes through the master which resolves the situation with a classic round.

Finally building over these earlier optimizations, the Generalized-Paxos optimization (which is a superset of Fast-Paxos) makes the additional observation that some collisions (different orderings of operations) are really not conflicts, as the colliding operations commute, or the order is not important. So this optimization does not enforce reverting to classic Paxos for those cases. MDCC uses Generalized Paxos, and as a result, MDCC can commit transactions in a single round-trip across data centers in the normal operation case.

The good thing about using Paxos over the WAN is you /almost/ get the full CAP  (all three properties: consistency, availability, and partition-freedom). As we discussed earlier (Paxos taught), Paxos is CP, that is, in the presence of a partition, Paxos keeps consistency over availability. But, Paxos can still provide availability if there is a majority partition. Now, over a WAN, what are the chances of having a partition that does not leave a majority? WAN has a lot of redundancy. While it is possible to have a data center partitioned off the Internet due to a calamity, what are the chances of several knocked off at the same time. So, availability is also looking good for MDCC protocol using Paxos over WAN.

MDCC details

Now, while the optimistic and strongly-consistent protocol is nice, one may argue that there has not been much novel (at least academically) in that part of the paper; MDCC basically puts together known optimizations over Paxos to achieve that optimistic and strongly-consistent replication protocol. The claim to novelty in the MDCC paper comes from their proposed new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication. The model allows developers to specify certain callbacks that are executed depending on the different phases of a transaction. MDCC’s transaction programming model provides a clean and simple way for developers to implement user-facing transactions with potentially wildly varying latencies, which occur when replicating across data centers. Figure below shows an example of a transaction for checking out from a web store.

Evaluation of MDCC is done using the TPC-W benchmark with MDCC deployed across 5 geographically diverse data centers. The evaluation shows that "MDCC is able to achieve throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data center outage without a significant impact on response times while guaranteeing strong consistency."

Conclusion

This is a nice and useful paper because it tells a simpler united story about using Paxos (more accurately, Generalized Paxos) for executing transactions across datacenters. This paper also provides a very nice summary of the Paxos and optimizations to Paxos, as well as providing a case study where the usefulness of these optimizations are presented. So even if you are not interested with the details and quantitative measurements about the actual multi datacenter replication problem, the paper is still worth a read in this respect.

Friday, November 2, 2012

I often download pdf presentations from the web, but then when I reopen them in my laptop later, I wonder if they got updated/changed in the meanwhile. It would be nice to have a service where my local copy also gets automatically updated/synced (a la dropbox) when the authoritative source copy gets changed. (The difference from the dropbox model is here the source does not need to cooperate by deploying a sync software, e.g. dropbox, or even be aware that its document is being synced.)

So here is my project idea: auto-sync all web/url downloaded documents, so the downloaded local copy of a file is kept up-to-date with the authoritative copy on the web. I think many people will find this useful. Some other examples of documents downloaded from the web and would benefit from auto-updating include: city documents (e.g., street-parking rules, garbage collection rules, etc.), event calendars, tax documents, CVs. (Do you have any other examples?)

This shouldn't be too hard to build. A completely client-side solution is possible. A client-side software that keeps track of web-downloaded documents, and periodically checks with http to detect whether any of these documents got changed would do it. If a change is detected, the software should download a copy, and should prompt the user when she opens the local document next about whether the updated or the old copy to be used.

Of course a cloud-hosted push-based synchronization solution would be more scalable and efficient. This would be also more gentle for the original content provider as instead of thousands of client-side software periodically checking for updates, only the cloud-hosted synchronization service will check for the update to the document. Upon detecting an update the cloud service will push the updated document to all clients that posses this document. I am sure there are a lot of details to work out to make this service efficient. And there may even be a way to monetize this service eventually, who knows? This cloud hosted synchronization service idea may be seen as extending the CDNs to reach out and embrace the pc/laptops as the last hop for content-replication.

A further generalization of this "auto-sync documents downloaded from the web" idea is to allow any host to modify/update the document and yet still maintain the other copies in sync/up-to-date. This then turns into an application-level virtual-filesystem that is web-scale. And implementing that would be more tricky.

Thursday, October 4, 2012

Building fault-tolerant applications on AWS

Below is my summary of the Amazon Web Services (AWS) whitepaper released on Oct 2011 by Jeff Barr, Attila Narin, and Jinesh Varia.

AWS aims to simplify the task of building and maintaining fault-tolerant distributed systems/services for its customers. For this, AWS prescribes the customers to embrace the following philosophy when they are building their applications on AWS.

1. Make computing nodes disposable/easily replaceable

AWS (as with any cloud provider actually) employs a level of indirection over the physical computer, called virtual machine (VM), to make computing nodes easily replaceable. You then need a template to define your service instance over a VM, and this is called Amazon Machine Image (AMI). The first step towards building fault-tolerant applications on AWS is to create your own AMIs. Starting your application then is simply a matter of launching VM instances on Amazon Elastic Compute Cloud (EC2) using your AMI. Once you have created an AMI, replacing a failing instance is very simple; you can just launch a replacement instance that uses the same AMI as its template. This can be done programmatically through an API invocation.

In short, AWS wants you to embrace an instance as the smallest unit of failure and make it easily replaceable. AWS helps you to automate and make this process more transparent by providing elastic IP addresses and elastic load balancing. To minimize downtime, you may keep a spare instance running, and easily fail over to this hot instance by quickly remapping your elastic IP address to this new instance. Elastic load balancing further facilitates this process by detecting unhealthy instances within its pool of Amazon EC2 instances and automatically rerouting traffic to healthy instances, until the unhealthy instances have been restored.

2. Make state/storage persistent

When you launch replacement service instances on EC2 as above, you also need to provide persistent state/data that these instances have access to. Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are off-instance storage that persists independently from the life of an instance. Any data that needs to persist should be stored on Amazon EBS volumes, not on the local hard-disk associated with an Amazon EC2 instance because that disappears when the instance die. If the Amazon EC2 instance fails and needs to be replaced, the Amazon EBS volume can simply be attached to the new Amazon EC2 instance. EBS volumes store data redundantly, making them more durable than a typical hard drive. To further mitigate the possibility of a failure, backups of these volumes can be created using a feature called snapshots.

Of course this begs the question of "what is the sweet-point in storing to EBS" as it comes with a significant penalty over EC2 RAM, and a slight penalty over EC2 disk. I guess this depends on how you can stretch the definition of "the data that needs to persist" in your application.

3. Rejuvenate your system by replacing instances

If you follow the first two principles, you can (and should) rejuvenate your system by periodically replacing old instances with new server instances transparently. This ensures that any potential degradation (software memory leaks, resource leaks, hardware degradation, filesystem fragmentation, etc.) does not adversely affect your system as a whole.

4. Use georeplication to achieve disaster tolerance

Amazon Web Services are available in 8 geographic "regions". Regions consist of one or more Availability Zones (AZ), are geographically dispersed, and are in separate geographic areas or countries. The Amazon EC2 service level agreement commitment is 99.95% availability for each Amazon EC2 Region. But in order to achieve the same availability in your application, you should deploy your application over multiple availability zones, for example by maintaining a fail-over site in another AZ as in the figure.

5. Leverage other Amazon Web Services as fault-tolerant building blocks

Amazon Web Services offers a number of other services (Amazon Simple Queue Service, Amazon Simple Storage Service, Amazon SimpleDB, and Amazon Relational Database Service.) that can be incorporated into your application development. These services are fault-tolerant, so by using them, you will be increasing the fault tolerance of your own applications.

Let's take the Amazon Simple Queue Service (SQS) example. SQS is a highly reliable distributed messaging system that can serve as the backbone of your fault-tolerant application. Once a message has been pulled by an instance from an SQS queue, it becomes invisible to other instances (consumers) for a configurable time interval known as a visibility timeout. After the consumer has processed the message, it must delete the message from the queue. If the time interval specified by the visibility timeout has passed, but the message isn't deleted, it is once again visible in the queue and another consumer will be able to pull and process it. This two-phase model ensures that no queue items are lost if the consuming application fails while it is processing a message. Even in an extreme case where all of the worker processes have failed, Amazon SQS will simply store the messages for up to four days.

Conclusions:

Netflix (one of the largest and most prominent customers of AWS) embraced all of the above points. In fact Netflix has developed a chaos monkey tool to constantly and unpredictably kill some of its service instances in an effort to enforce that their services are built in a fault-tolerant and resilient manner and expose and resolve hidden problems with their services.

OK, after reading this, you can say that at some level the cloud computing fault-tolerance is boring: the prescribed fault correction action is to just replace the failed instance with a new instance. And if you say this, this will make the AWS folks happy, because this is the goal that they try to attain. They want to make faults uninteresting, and automatically dealt with. Unfortunately, not all faults can be isolated at the instance level, the real world isn't that simple. There are many different types of faults, such as misconfigurations, unanticipated faults, application-level heisenbugs, and bohrbugs, that won't fit into this mold. I intend to investigate these remaining nontrivial types of faults, and how to deal with them. I am particularly interested in exploring what role can self-stabilization play here. Another point, that didn't get coverage in this whitepaper is about how to detect faults and unhealthy instances. I would be interested to learn what techniques are employed in practice by AWS applications for this.

Monday, September 17, 2012

Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen
Proc. 23rd ACM Symposium on Operating Systems Principles
(SOSP ’11) Cascais, Portugal, October 2011.

Geo-replicated, distributed data stores are all the rage now. These stores provide wide area network (WAN) replication of all data across geographic regions in all the datacenters. Geo-replication is very useful for global-scale web-applications, especially social networking applications. In social networking, your updates to your account (tweets, posts, etc) are replicated across several regions because you may have followers there, and they should get low-latency access to your updates. The paper refers to properties desired of such a geo-replication service with the acronym ALPS (Availability, low Latency, Partition-tolerance, and high Scalability).

Of course, in addition to ALPS, we need to add some consistency requirement to the geo-replication service, because otherwise the updates to a record can arrive at replicas in different/arbitrary orders and expose inconsistent views of the updates to the record. The eventual consistency property that Dynamo advocates is too weak for the programmer to build on to provide other services. The inconsistent views of the updates can escape and poison those other services. On the other hand, strong-consistency (linearizability) is impossible due to CAP. So, causal consistency is what the COPS paper shoots for. Actually, a later paper shows that this is the best you can get under these constraints.

Yahoo!’s PNUTS provides per-record timeline consistency (or per-key sequential consistency), which is a pretty good consistency guarantee. All the writes to a record are serialized by the per-record-primary server responsible for that record and these updates are replicated in the same order (the order the per-record primary determines) to the other replicas. PNUTS, however, does not provide any consistency across records/keys. Achieving such a consistency (more specifically, causal consistency) across records/keys introduces scalability challenges and is the aim of this paper, as we explain in the rest of this summary. (Spoiler: To achieve scalability, instead of using vector clocks, COPS explicitly encodes dependencies in metadata associated with each key's version. When keys are replicated remotely, the receiving datacenter performs dependency checks before committing the incoming version.)

Causal consistency model

Dave Anderson explains the causal-consistency model nicely in his blog post.
Consider three updates by a single thread: write A=1, write B=dog, write B=cow, write A=2.
In Dynamo, the order at which these updates are applied is unconstrained. In a
remote datacenter, read(A), read(A) might return A=2, A=1
In PNUTS, the value of “A” could never go backwards: A=1, A=1, … A=2, A=2, …
But in both Dynamo and PNUTS, the relative values of A and B are unconstrained, and each of these systems could read(A), read(B): {2, dog}. Even though the originating thread set B=cow before it set A=2.
In COPS, this latter read sequence could not happen. The allowable reads are: {1, }, {1,dog}, {1,cow}, {2, cow}
In all three systems (in PNUTS, using Read-any), concurrent updates at different datacenters could cause conflicts that invoke conflict handlers, and in this way the three systems are not different.
And this brings us to the issue of conflict handling in COPS.

Convergent conflict handling

In causal consistency you can still have conflicts for concurrent events. (Inherent in a distributed system, two nodes may start updates concurrently, and this leads to a "logical" conflict as both updates are uninformed of each other.) For resolving these and to achieve converge at all datacenters, COPS employs convergent conflict handling, which refers to using an associative and commutative handler. Thomas's last-write-wins rule satisfies this constraint and is used by default, or developer can write application-specific conflict-resolution rules provided that they are convergent.

More concretely, convergent conflict handling requires that all conflicting puts be handled in the same manner at all replicas, using a handler function h. This handler function h must be associative and commutative, so that replicas can handle conflicting writes in the order they receive them and that the results of these handlings will converge (e.g., one replica’s h(a, h(b, c)) and another’s h(c, h(b, a)) agree).

The paper coins the term "causal+" to refer to causal-consistency plus convergence conflict handling together.

Scalability problems in causal+ model

Previous causal+ work (Bayou, Practi) was not scalable, because they achieved causal+ via log-based replay. Log-exchange-based serialization inhibits replica scalability, as it relies on a single serialization point in each replica to establish ordering. Thus, causal dependencies between keys are limited to the set of keys that can be stored on one node, or alternatively, a single node must provide a commit ordering and log for all operations across a cluster.

Inner-workings of COPS

A get operation for a record is local at the closest datacenter and is non-blocking. Since all data is replicated at each datacenter, we can have local get.

A put operation for a record is
1) translated to put-after-dependencies based on the dependencies seen in this site
2) queued for asynchronous replication to other sites/replicas
3) returns done to the client at this point (early reply)
4) asynchronous replication to other sites/datacenters occur.

Each operation maintains dependencies for operations. Replication dependencies are checked at each datacenter, and when they are satisfied the value is updated there.

Nodes in each datacenter are responsible for different partitions of the keyspace, but the system can track and enforce dependencies between keys stored on different nodes. COPS explicitly encodes dependencies in metadata associated with each key’s version. When keys are replicated remotely, the receiving datacenter performs dependency checks before committing the incoming version. The paper shows that by only checking the nearest dependencies you can reduce the number of dependency checks during replication and have a faster solution.

Subleties in definition of availability for put operation

The availability of get and put operations are defined at the datacenter scope and not at the whole system scope. The paper defines availability as: "All operations issued to the data store complete successfully. No operation can block indefinitely or return an error signifying that data is unavailable."

What this means is that the availability of put is satisfied at step 3 when early reply is returned to the client of the datacenter. However, the replication of this put operation to other datacenters in fact can be blocked until put-after-dependencies for the operation is satisfied at other datacenters. Consider a put operation originated at site C, which introduced put-after-dependencies to updates originated at site B. When this put operation (now a put-after operation) is asynchronously replicated to site A at step 4, this put-after will block at site A until site A receives the dependent-upon updates originated at site B. And if site B is partitioned away from site A, the put-after operation originated at C will block at A. Although site C and A are connected the put operation will not be able to complete step 4 as long as B is disconnected from A. I think this is a subtle point, which many readers may not notice even after reading the paper a couple of times.

COPS-GT (Get Transactions) operation

Reading a set of dependent keys using a single-key get interface cannot ensure causal+ consistency, even though the data store itself is causal+ consistent. To illustrate the problem the paper gives a photo album example, where Alice first changes her album access control list (ACL) to "friends only", and then writes a new description of her travels and adds more photos to the album. A natural (but incorrect!) implementation of code to read Alice's album might (1) fetch the ACL, (2) check permissions, and (3) fetch the album description and photos. This approach contains a straightforward "time-to check-to-time-to-use" race condition: when Eve accesses the album, her get(ACL) might return the old ACL, which permitted anyone (including Eve) to read it, but her get(album contents) might return the "friends only" version.

To address this issue, a two-round COPS-GT (Get Transactions) algorithm is introduced in the paper. "In the first round, the library issues concurrent get by version operations to the local cluster for each key the client listed in get transaction. These values, however, may not be consistent with one another, as in the example above. Each get by version operation returns a <value, version, deps> tuple, where deps is a list of keys and versions. The client library then examines every dependency entry <key, version>. The causal dependencies for that result are satisfied if either the client did not request the dependent key, or if it did, the version it retrieved was greater than or equal to the version in the dependency list. For all keys that are not satisfied, the library issues a second round of concurrent get by version operations. The version requested will be the newest version seen in any dependency list from the first round. These versions satisfy all causal dependencies from the first round."

The paper notes that COPS-GT comes with additional cost: compared to the regular version of COPS, COPS-GT is less efficient for certain workloads (e.g., write-heavy) and is less robust to network partitions and datacenter failures.
￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼￼

Concluding remarks

The paper is a long and dense read. While the model is formalized well, I wish some issues got better coverage in the paper. One problem I had with the paper is that just when I think the paper is going to deal with a hard problem, it backs out and starts talking about other peripheral issues. For example, in Section 3.4, the problems with log exchange-based serialization is not elaborated sufficiently for me; the paper skipped to discussing system design too quickly. In Section 4.1, when I thought the paper will be explaining composability of linearizability, the paper skipped to presenting the system components. (Here some papers by Herlihy are referenced saying that "linearizable systems can be implemented scalably by partitioning the keyspace into N linearizable partitions and having clients access each partition independently". But details are missing.)

UPDATE:

So how big can these dependencies get? It turns out, the dependencies are tracked by the client library based on the client's history. It seems like the dependency list can grow very long as the client reads many different keys. As a result, one can end up with a pretty long list of dependencies: to every other key read by the client.

Instead, to keep dependencies sweet and short, we can use the Hybrid Logical Clocks (HLC). The loosely synchronized real-time component of HLC would help in truncating the dependency list. The logical clock component of HLC would help in maintaining this dependency list precisely even in the uncertainty region of loosely synchronized clocks.

Finally, the COPS client library solution also did not account for whether the client has backchannels. Using real-time windows can help account for the backchannels as well.

High Scalability post on COPS and followup
COPS project page
Q&A for COPS
Dave's post clarifying some issues

Friday, September 14, 2012

Scalable distributed data structures for internet service construction

I think this 2000 paper (by Gribble, Brewer, Hellerstein, and Culler) may as well be the original NoSQL paper. The paper starts off by identifying the problems with RDBMS that prohibit scalability. (This is how you would motivate a NoSQL key-value store system even today :-)
1. RDBMSs have not been designed with Internet service workloads, service properties, and cluster environments in mind, and as a result they fail to provide the right scaling, consistency, or availability guarantees.
2. RDBMSs permit users to decouple the logical structure of their data from its physical layout, which is a good thing, but excessive data independence (isolation of application from modifying the layout of data definition and organization) can make parallelization and therefore scaling hard.
3. RDBMSs always choose consistency over availability.
The paper then advocates a design sweet-point for achieving scalable and consistent data management for web services. RDBMS is out because it provides a too-high-a-level abstraction with ACID and SQL. Filesystems, on the other hand, expose a too low-level interface with little data independence and less strictly defined consistency guarantees where filesystem elements (files and directories) are directly exposed to the clients and the clients are responsible for logically structuring their application data to using these elements. The paper aims to choose a level of abstraction that provides a well-defined and simple consistency model somewhere in between that of an RDBMS and a filesystem. As a solution, the paper proposes a distributed data structure (DDS) ---in this case a distributed hash table--- and argues that DDS interfaces, while not as general as SQL, are rich enough to successfully build sophisticated services. DDS is touted to achieve the desired web service properties: scalability, fault-tolerance, availability, consistency, durability, concurrency.

DDS and key-value stores

Figure: High-level view of a DDS: a DDS is a self-managing, cluster-based data repository. All service instances (S) in the cluster see the same consistent image of the DDS; as a result, any WAN client (C) can communicate with any service instance.

DDS is basically a key-value store as we understand it today. As the paper puts it, DDS provides a persistent data management layer designed to simplify cluster-based Internet service construction. A distributed hash-table underlies this data management layer, and it simplifies Internet service construction by decoupling service-specific logic from the complexities of persistent, consistent state management. This allows services to inherit the necessary service properties from the DDS rather than having to implement them themselves. DDS presents a conventional single-host data structure interface to service authors, but in fact it partitions and replicates the data across a cluster. DDS is a cluster-level data structure and is not designed for a WAN.

The novel aspects of a DDS are the level of abstraction it presents to service authors (by providing a data structure at the programming language level), the consistency model it supports, the access behavior (concurrency and throughput demands) that it presupposes, and its design and implementation choices that are made based on its expected runtime environment and the types of failures that it should withstand. SEDA is employed for implementing DDS to achieve high throughput and high concurrency.

DDS architecture

Figure: Distributed hash table architecture: each box in the diagram represents a software process.

Services using DDS may keep soft-state but they rely on the hash table to manage all persistent state. DDS library contains only soft-state, including metadata about the cluster's current configuration and the partitioning of data in the distributed hash tables across the bricks. The DDS library acts as the 2-phase commit coordinator for update operations on the distributed hash tables. (Dynamo forgoes this consistency step, and avoids the complications discussed next.) The paper explains recovery mechanisms for what happens when coordinator fails during this 2-phase commit. However, this unavoidably leads to many corner cases and complicated to manage and may lead to recovery-induced inconsistencies. The 2-phase commit would also slow down write operations and limit scalability.

Figure: Distributed hash table metadata maps: The key is used to traverse the DP map trie and retrieve the name of the key's replica group. The replica group name is then used looked up in the RG map to find the group's current membership.

The DDS key-lookup uses a trie-based mapping that can deal nicely with overloaded and hot keys. (For this Dynamo employs a ring-based consistent hashing.) To find the partition that manages a particular hash table key, and to determine the list of replicas in partitions' replica groups, the DDS libraries consults two metadata maps that are replicated on each node of the cluster. First is DP map maintained as trie. And the second map is replica group membership table. These two maps are soft-state and self-cleaning. Instead of enforcing consistency synchronously, the libraries can drift out of date, but lazily updated when they are used to perform operations on the bricks.

Conclusion

I think this paper is very nice introduction to the NoSQL key-value store area, in that you can see the original issues and original design decisions that led to the NoSQL key-value store approach in this paper. The DDS approach of providing a simple data-structure abstraction to the service authors and enabling them to inherit scalability, consistency, fault-tolerance, availability properties from the underlying careful distributed implementation of the data structure ultimately gave us BigTable, MegaStore, and similar distributed data structures.

Wednesday, August 8, 2012

Crowdsourced line wait-time forecasting using smartphones

Have you ever been frustrated with the unpredictability of the coffee shop line waiting times? I have, and far too many times. (I am fast growing into a grumpy old man.) I frequent the Tim Hortons coffee shop at our Student Union at University at Buffalo. Sometimes I would walk there to grab a quick mocha only to find a long waiting line and I would have to walk all the way back with empty hands. One day it hit me that this is a perfect opportunity to put our research on crowdsourced coordination and collaboration using smartphones to good use. We proceeded to develop Android and iPhone apps that forecasts the current (and near future) line waiting times at this coffee shop.

In the first version of our app, we asked users to manually provide line wait-times when they are waiting in line and tried to serve other users with the data input by these. Our model was: "if you used our app 5 times to get wait-time forecasted, before you can use it again, we ask you to report the wait time from the coffee-shop". (We checked the accuracy of the report using the localization information from the phone to prevent users to makeup wait-times.) We quickly noticed that this is an extra work for the users and the data arriving from the volunteers is too sparse to serve satisfactory results to the queriers. In the later versions of our service, we automated the line wait-time detection by using the localization capabilities of the smartphones in an energy-efficient manner.

Here is how our service works now. When the user wants to learn the current wait-time for the coffee shop, she fires our app and our app contacts our AWS cloud backend to provide this information. Then our app takes this as a clue that the user is interested in going to the coffee shop and starts tracking the user in the background. (This is done in an energy-efficient manner, the app gets a localization on the user more frequently as the user gets closer to the coffee shop, or else the localization requests gets more spread out and eventually dropped. The phone localization module fuses information from GPS, wifi, and cell tower triangulation to this end.) When the user enters the coffee shop, the app records the arrival timestamp. After the user is served and leaves the coffee shop, the app records the departure timestamp, and transmits this information to the cloud backend as another data point.

The backend stores and processes all the reported data to obtain models for waiting times at the coffee shop. For this, we first devised a solution based on a constrained nearest-neighbor search in a multi-dimensional (week of the year, day of the week, time of the day) space. Later, we improved on this solution by adapting two statistical time-series forecasting methods, namely exponential smoothing and Holt Winters. This actually turned out to be an interesting modeling problem due to the sparseness and false-positives in the collected data. (Some users choose to sit at the coffee shop after being served and this creates false-positives that we need to weed-out.)

Our line wait time forecasting apps for Android and iPhone platforms are downloaded by more than 1000 users in our university, and are used on a daily basis by hundreds of users to monitor the line wait-times of the Tim Hortons coffee shop. (We have recently added Starbucks to our service as well.) And, the punchline is our service provides line wait-time estimates that has mean absolute error of less than 2 minutes!

The paper (to appear at MobiCase'12) is available at: http://www.cse.buffalo.edu/~demirbas/publications/lineking.pdf

Wednesday, May 23, 2012

My advice to the 2012 class

I feel sad at graduations. When I graduated college in 1997, all I felt was sadness and melancholy, rather than joy. I chose not to attend my MS and PhD graduation ceremonies, but I remember I felt sad after both defenses. It turns out, I also feel sad at my students' graduations. I graduated 3 PhD students and a couple MS students with the thesis option. It was always sad to see them depart.

This semester, I have been teaching distributed systems to the graduating class (seniors) at my sabbatical institute, Bilkent University. They are bright and talented (for example, Ollaa was developed by 3 of them). For the last class of their semester and their college lives as well, instead of discussing yet more about distributed systems, we walked out of the classroom, sat on the grass outside, and had a nice chat together.

I gave my students some parting advice. I tried to convey what I thought would be most useful to them as they pursue careers as knowledge/IT workers and researchers. Here is that advice in 3 stories and lessons.

1st Story: How I almost flunked the differential equations course

I was taking differential equations course in the first year of college. I was feeling comfortable with the course because I had seen some of the content in high school. I had gotten an A in the first midterm, so I didn't pay attention to the course until the finals. For the finals I had to go through a lot of content. I read/followed the practice questions in the textbook rather than solving questions on my own. As I followed the solutions of the practice questions, they seemed obvious and easy to do. Of course, when I took the final exam I was baffled since I couldn't solve any of the questions on my own. What seemed obvious when I was following the book turned out to be not obvious at all when I had to come up with them myself. I did so bad that I almost flunked the course. I retook the course in the summer, this time I didn't repeat the same mistake and got an A.

1st Lesson: The best way of learning is by doing

There are different levels of knowing. The first level is "knowing by word". With passive learning (like I did for studying for the diff final), you get the most limited form of understanding/knowing. It is ephemeral, and you are not able to reconstruct that on your own. The second level is "knowing by first hand experience". At this level, you are able to understand/witness and if necessary rediscover the knowledge on your own. Finally, the third level is "knowing by heart". At this level, you internalized the knowledge so that it becomes part of your nature. (See Grok).

The best way of learning is by doing. You need to work hands-on, make a lot of mistakes so you can start to internalize that knowledge. That means you should make, and produce things continuously. For this to work in a sustainable manner, you need to change your default mode from the consumer mode to the producer mode. Always have a side-project to sharpen the saw. Start your day by working for an hour or so on your side-project, instead of browsing news/twitter/etc. At night, spare some time for your side-project as well.

2nd Story: (Ok, a quotation actually, not a story)

Once you get a B.S., you think "you know everything". Once you get an M.S., you realize "you know nothing". Once you get a Ph.D., you realize that "yes, you know nothing, but that is not a problem, because nobody knows anything!"

2nd Lesson: We are all equally stupid

This is the summary of the PhD experience, so this bears repeating: "you know nothing, but that is not a problem, because nobody knows anything!"

We are all equally stupid. Our brains consist of 1.3 kg (3pounds) of gooey material. It is hard to hold 10 variables in our brains simultaneously. Our brains are very poor at reasoning about even very short concurrent programs (I know from first-hand experience :-). Our brains are also very susceptible to biases and fallacies.

The people you see as experts/geniuses are that way because they have been working/thinking/internalizing these topics for more than a decade. Those experts become baffled if you ask them a question a little bit outside of the frame they are accustomed to.

So here is the trick, "you can level the playing field by work on new things/technologies".

3rd Story: Kung-fu master and his student

M: You must pass one more test: What is the meaning of the Black Belt?
S: The end of my journey, a well-deserved reward for my hard work.
M: You are not ready for the Black Belt. Return in one year.

One year passes...
S: It is a symbol of distinction and the highest achievement in our art
M: You are not ready for the Black Belt. Return in one year.

Another year passes...
S: The Black Belt represents not the end, but the beginning, the start of a never-ending journey of discipline, work and the pursuit of an ever higher standard.
M: You are now ready to receive the Black Belt and begin your work.

3rd Lesson: Just ask

There are actually two lessons to this story. The first one is that, like the black belt, the Ph.D. is the beginning, not the culmination, of your career. But the more important second lesson is: If you bother to talk to and learn from the people who have already gone through this process, you might graduate two years earlier :-) :-)

So, for whatever goal you are chasing, be effective. Learn the criteria for the next step and optimize for it. This again connects back to the first lesson. A shortcut to getting the first-hand knowledge is to ask to the people with the first-hand knowledge. This is of course not the same thing as first-hand knowledge, but it is the closest it gets to that. (Steve Jobs also has the following advice about asking.)

I will write another blog post specifically about how to succeed in PhD, but that may take some time.

Wednesday, May 2, 2012

Replicated/Fault-tolerant atomic storage

The cloud computing community has been showing a lot of love for replicated/fault-tolerant storage these days. Examples of replicated storage at the datacenter level are GFS, Dynamo, Cassandra, and at the WAN level PNUTS, COPS, and Walter. I was searching for foundational distributed algorithms on this topic, and  found this nice tutorial paper on replicated atomic storage: Reconfiguring Replicated Atomic Storage: A Tutorial, M. K. Aguilera, I. Keidar, D. Malkhi, J-P Martin, and A. Shraer, 2010.

Replication provides masking fault-tolerance to crash failures. However, this would be a limited/transient fault-tolerance unless you reconfigure your storage service to add a new replica to replace the crashed node. It turns out, this on-the-fly reconfiguration of a replicated storage service is a subtle/complicated issue due to the concurrency and fault-tolerance issues involved. This team at MSR @ Slicon Valley has been working on reconfiguration issues in replicated storage for some time. But, in this post I am not going to talk about their reconfiguration work, and instead will just focus on the replicated/fault-tolerant atomic storage part.

Majority replication algorithm
There is this elegant algorithm for replicated/fault-tolerant atomic storage that I think every distributed systems researcher/developer should know about. It is simple and powerful. And, it is fun; I promise your brain will feel better about itself after you learn this majority replication algorithm. The algorithm originally appeared in: Sharing memory robustly in message-passing systems, H. Attiya, A. Bar-Noy, and D. Dolev, 1995. Here I will summarize the algorithm based on the discussion provided in the tutorial paper.

The algorithm employs majority replication to provide atomic read/write operations in the presence of crash failures under an asynchronous execution model. (Of course, the FLP result states the impossibility of solving consensus under this model, but this is a weaker problem than solving consensus.) Here, atomic means that the system provides linearizability, a strong type of consistency that guarantees that a read returns the most recent version of data. This single-copy consistency is stronger than Amazon Dynamo's eventual consistency and even GFS's consistency. The algorithm is on the CP side of the CAP triangle; availability is sacrificed when a majority of replicas are unreachable.

Write operation
Each storage node keeps a local copy of what it believes to be the most recent value stored by a client, together with a timestamp indicating the freshness of the value. A vt-pair refers to a pair of (value, timestamp), which a storage node keeps. To execute a write(v) operation, the client proceeds in two phases: it executes a get phase followed by a set phase.

get phase:
vt-set= read vt pairs from majority of storage nodes
select unique t' such that t' > max (t in vt-set)

set phase:
write_request (v, t') on storage nodes
storage nodes store vt' only if t' > their stored t
storage nodes send ack
when majority acks are received, return OK

(Uniqueness of t' can be ensured by adjoining the client-id to the timestamp, so that a timestamp consists of a pair with a number and a client-id, ordered lexicographically.)

The read() operation is very similar to the write operation. The client also executes the get and set phases back to back. The only difference is that in the set phase, the client writes back the maximum timestamp vt pair it learns in the get phase.

get phase:
vt-set= read vt pairs from majority of storage nodes
select vt' such that t' = max (t in vt-set)

set phase:
write_request (v,t') on storage nodes
storage nodes store vt' only if t' > their stored t
storage nodes send ack
when majority acks are received, return v

The Set phase in read() is needed to prevent oscillating reads due to storage node failures, in which successive reads oscillate between a new and an old value while a write is in progress-- which is a violation of atomicity. The Set phase ensures that a subsequent read() will return a value at least as recent as the value returned by the current read().  The key intuition here is that any two majorities of storage nodes always have at least one storage node in common. Therefore if some client stores value v at a majority of storage nodes then another client is guaranteed to see v when it queries any majority.

Relation to Paxos
The majority replication algorithm seems closely related to the Paxos consensus algorithm. The t in the vt-pair corresponds to ballot number in Paxos. The Get and Set phases correspond to the phase1 and phase2 of Paxos. (Of course since majority replication is not for consensus, there is nothing corresponding to phase3:commit of Paxos.) Finally, the read operation corresponds to the learning operation in Paxos. Now the differences. In majority replication clients do not coordinate for the write operation, whereas in Paxos, leaders are constrained to re-propose/rewrite the value with the highest t. Also to avoid the dueling-leaders problem, Paxos relies on a leader election service so that the system eventually converges to one leader that can safely anchor/finalize a value as the decision value. (The leader election service in Paxos needs some partial-synchrony to make progress, so consensus is achieved only then.) In summary, majority replication is a building block for Paxos consensus.

This relation is explained in more detail in the "Perspectives on the CAP theorem" paper.

Concluding remarks
The nice thing about this elegant algorithm is that it can tolerate/mask the crash of a minority of storage nodes and an arbitrary number of client nodes, and it works in an "asynchronous" system. That the correctness of this algorithm does not depend on a synchronous system makes this algorithm really robust for deployment in distributed systems, especially WAN systems.

Consensus/Paxos based algorithms can make reconfiguration of replication service possible. Examples are RAMBO algorithm, and FAB: Building Distributed Enterprise Disk Arrays from Commodity Components, which provides an implementation of these ideas. But, the reconfiguration tutorial paper explains that it is also possible to implement reconfiguring of replication under the asynchronous model (without consensus)!!