## Thursday, November 29, 2012

### From Clarity to Efficiency for Distributed Algorithms

Y. A. Liu, S. D. Stoller, B. Lin, M. Gorbovitski, Appeared in OOPSLA'12.

This paper describes a high-level language, DistAlgo, for clear description of distributed algorithms (e.g., Paxos, mutual execution). In contrast to embarrassingly parallel tasks which involve very regular and simple synchronization patterns (e.g., MapReduce tasks), distributed algorithms generally involve complex and tight synchronization patterns. DistAlgo specifies these complex synchronization conditions via using logical quantifications over message histories of processes involved in the algorithm. But, unfortunately, this turns out to be extremely inefﬁcient if executed straightforwardly: each quantiﬁer will cause a linear factor in running time, and any use of the history of messages sent and received will cause space usage to be unbounded.

To address this inefficiency, this paper focuses on optimizing implementations of DistAlgo programs by incrementalization of these expensive synchronization conditions. To this end, their method first transforms these quantifications are into set queries, and then transforms further to minimize their nesting. Finally, from here, the method systematically extracts single quantified order comparisons, and transforms them into efficient incremental updates of auxiliary variables as messages are sent and received. (An earlier paper that focuses more on the syntax of DistAlgo, and uses Paxos as a running example, appeared at SSS'12.).

To give a glance of DistAlgo and how it does this transformations, I am including some figures from the paper as is. The paper uses Lamport's mutual exclusion algorithm as a running example and shows its DistAlgo and automatically optimized DistAlgo implementation. DistAlgo fills a gap and seems like it can be very useful in the coming years for writing distributed programs. Distributed algorithms can be expressed in DistAlgo clearly at a high level, like in pseudocode, but also precisely, like in formal speciﬁcation languages, and be executed as part of real applications, as in programming languages. DistAlgo extends Python with processes as objects, send/receive of messages, yield points, synchronization conditions as logical quantifications, and configuration of distributed system topology. DistAlgo compiler and optimizer are implemented in Python (building on Python's parser module and AST package) and generates executable Python code.

## Tuesday, November 20, 2012

### Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary

Appeared in OSDI '12, Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguica, and Rodrigo Rodrigues.

In order to reduce latencies to geographically distributed users, big webservices companies (Google, Yahoo, Facebook, Twitter) replicate data across geographical regions. But replication across datacenters, create cross-site consistency problems, which is further complicated with the huge WAN latency delays. If you want to have strong consistent updates across sites, you have to pay the price in terms of latency (basically you revert to doing synchronous replication via say Paxos as in MDCC). This is the ELC part in the PACELC.

To alleviate this latency versus consistency tension, this paper proposes RedBlue consistency, which enables blue operations to be fast/asynchronous (and eventually consistent) while the remaining red operations are strongly-consistent/synchronous (and slow). So a program is partitioned into red and blue operations, which run with different consistency levels. While red operations must be executed in the same order at all sites (which make them slow), the order of execution of blue operations can vary from site to site (allowing them to be executed without requiring coordination across sites). "In systems where every operation is labeled red, RedBlue consistency is equivalent to serializability; in systems where every operation is labeled blue, RedBlue consistency allows the same set of behaviors as eventual consistency."

To facilitate this red blue partioning, each program operation u is split into two components: a generator operation g_u with no side effects, which is executed only at the primary site against some system state S, and produces a shadow operation h_u(S), which is executed at every site (including the primary site).

This is simply separation of concerns principle. The generator operation decides which state transitions should be made while the shadow operation applies the transitions in a state-independent manner. This separation leads to a fine-grained classification of operations, with potentially more execution paths leading to blue operations. Also this leads to a simple logic for shadow operations that can be based on operations that are intrinsically commutative (increment/decrement, insertion/removal) or that commute via last-writer-wins strategy.

#### Red Blue rules

Now here are the rules for labelling the [shadow] operations as red or blue.
1. For any pair of non-commutative shadow operations u and v, label both u and v red.
2. For any shadow operation u that may result in an invariant being violated,  label u red.
3. Label all non-red shadow operations blue.
This is unfortunately not an automated process. Developer has to manually partition the operations to generator and shadow operations, and after that mark them as red and blue manually, by following the above rules. As such, this is an error-prone process and is hard to scale.

The below is the description from the paper of how to keep track of the order and consistency of red/blue operations:
When a generator operation completes, the coordinator must determine if the operation reads a coherent system snapshot and obeys the ordering constraints of a causal serialization. To do this, the coordinator checks the timestamps of the data items read and written by the completing operation, and compares them to the timestamps associated with operations completing concurrently and the remote shadow operations that were being applied simultaneously at that site. Upon successful completion of a generator operation the coordinator assigns the corresponding shadow operation a timestamp that is component-wise equal to the latest operation that was incorporated at its site, and increments its blue and, if this shadow operations is red, the red component of the logical timestamp. This timestamp determines the position of the shadow operation in the RedBlue order, with the normal rules that determine that two operations are partially ordered if one is equal to or dominates the other in all components.
The paper reports an evaluation of this idea by modifying the TPC-W and RUBiS benchmarks on an online social network case study (Twitter is the Hello World of georeplication :-). The experimental results show that RedBlue consistency provides substantial performance gains without sacrificing consistency.

#### Discussion

Although the paper does not mention it, a very relevant work to this is the CALM conjecture and the Bloom project from UC Berkeley. The CALM principle says that (1) logically monotonic distributed code is eventually consistent without any need for coordination protocols (distributed locks, two-phase commit, paxos, etc.) and (2) eventual consistency can be guaranteed in any program by protecting non-monotonic statements ("points of order") with coordination protocols. It is easy to see that the logically monotonic operations correspond to the blue operations, and non-monotonic operations correspond to red operations in the RedBlue work.

After the OSDI presentation of the paper, there were a couple of concerns raised about the approach. Mike Freedman (Princeton) asked the question: "blue cannot overtake the red, so you cannot vary their order, doesn't this degrade performance significantly?". Marcos Aguilera (HP Research) commented that similar approaches have been around; he referred to the generic broadcast work, and the later and more general Generalized Paxos work. The Generalized Paxos work seems to be very related indeed, and I am not sure what in the RedBlue work constitute the major differences. Maybe the RedBlue work provides a case study and more principled approach to identify commutable actions in Generalized Paxos. Another shortcoming of the RedBlue work is that it does not have any fault-tolerance build in. It may be hard to add fault-tolerance as an after thought, so maybe it is best to think of this work in the Generalized Paxos framework.

## Friday, November 9, 2012

### Sensys'12 (day2)

Here are my much anticipated (at least by Ted :-) notes from the second day of Sensys.

#### Kyun queue: a sensor network system to monitor road traffic queues

I missed this talk, but this sounds like an interesting paper. The idea seems to be to deploy a transmitter and a receiver node on the opposite sides of a road. Depending on the density and the speed of the cars passing between the transmitter and the receiver, the received radio signal has different signatures. The work studies these signals and figures out the road conditions based on these.

#### Low cost crowdcounting using audio tones

Crowd counting can be useful for event planning, learning popularity of aisles, etc. This work assumes that everyone in the crowd has a smartphone and this counting app running on their phone. (This is the biggest limitation of the work.) This work uses the audio tones (again ultrasound tones) emitted by the phones to count the people. The app employs the smartphone mic and speaker, no additional hardware is needed. The system uses an initiation tone to start. The phone that hears this tone picks a frequency using uniform or better geometric choosing method and forwards the tone. The frequencies are picked in the ultrasound region (that same 4khz band alluded to in yesterday's work.) Using uniform choosing method is not prone to collisions, the counts can be off due to the losses in collisions. The geometric choosing method employs geometric distributions and hence is more prone to collisions. (My work on thresholding counting is very applicable for this problem: Singlehop Collaborative Feedback Primitives for Threshold Querying in Wireless Sensor Networks.) They did evaluations with 25+ phones, but in pretty controlled environments (phones at hand). The phones don't work well at pockets or behind clothing. The audio tone method is said to be low-power, but it is actually similar cost to the wifi. The authors didn't use wifi because its range can be too long for the region to be counted. They said that, if ZigBee were available on the phones, they would use it.

How refreshing! The presenter and a lead author of this work is an undergraduate student. This work aims to build a platform (hardware and software components) to collect analog sensor (any sensor) data through the audio port of any phone. But the problem is there is almost no standard I/O connector, no standard programming interface, no third party app support on the smartphones. HiJack did a great job on the iPhone using the audio headset port (the Energy-efficient GPS and the MusicalHeart works at this Sensys used the HiJack approach). But HiJack is iPhone specific. This work extends to other phones using available power from "mic bias voltage" at the audio headset port. Powered this way, the analog data/voltage from the sensor attached to the audio port are encoded as the amplitude of a square wave. This work uses the phone's builtin voice recording app (which is already optimized to be low-energy and use hardware compression) to capture and store the attached sensor data and the phone's ability to send the voice recording via email/mms for exporting the captured data. The software to process the signal and interpret it is at the cloud. The platform is evalued with an EKG application. Now you can use your smartphone as a holter device! The authors estimate that the hardware board for this platform would cost $5 in mass production. #### Prius: generic hybrid trace compression for WSNs The goal of this work is to compress traces efficiently. Traces are important for debugging WSN deployments, but trace sizes can become large, so they need to be compressed otherwise whether you write to flash (most often the case) or exfiltrate to the basestation you are wasting a lot of precious energy. Traditional compression techniques (prediction-based, dictionary-based grammar-based) are not applicable for WSNs because they require more memory and computation than is available in WSNs. The insights behind Prius work are that traces are highly repetitive, traces evolve slightly, and program memory is not as scarce as data/var memory. The Prius approach is to mine trace patterns offline (static-analysis of the program code?) and to store them in program memory. In the offline learning phase, Prius creates a compression dictionary, this dictionary is then processed by data structure generator to obtain an efficient data structure data structure to be used in the headerfile of the deployed program. #### DoubledipL leveraging thermoelectric harvesting for low power monitoring of sporadic water use This work does waterflow monitoring. The motes they built include acceleration and temperature sensors and use wireless communication. The motes are harvesting their energy on the pipes using the thermal gradients of the hot/cold water in the pipe and the environment temperature. #### Business meeting There was a short business meeting in the afternoon. We learned that next year Sensys will be in Rome, Italy at the Sapienza University, Nov 11-15. The rest of the business meeting was spent for some soul-searching about Sensys. Is the excitement for Sensys diminishing as maybe indicated by the Sensys submission and attendance numbers are going down. With the smartphone going big, and also getting a lot of focus at Sensys, how can Sensys distinguish itself from Ubicomp or Ubisys? Andrew Campbell (Darthmouth) said he is nervous and somewhat pessimistic about the future of Sensys. He called on to the PhD students, Assistant professors, to get up and be more engaged/central in Sensys community. Also several people voiced interest for having a workshop on big data / machine learning as part of Sensys. #### Demo session A lot of the papers in the conference got demoed in the demo and poster session. Probably there were about 10 demos, and 20 more posters. The most interesting demo must have been the proximity social interaction monitoring sensors demo from University of Utah. They had brought in 300 motes. Their large programming board, can program upto 100 motes at the same time. They produced their own hardware, and said each mote cost about$30. They distributed upto 100 motes for monitoring, about half of the people in attendance got a mote from them. They also had stationary motes under each poster table. Although the reception area was quite large, is L shaped, and a lot of humans were in the way (all very detrimental to wireless communication), the stationary motes talk to singlehop to their basestation connected to the laptop. The mobile nodes had much smaller communication radius, they captured signals they heard from nearby motes, and broadcasted these so that stationary motes can pick these up and forward them to the basestation.

The muscle fatigue monitoring demo was also very interesting. Yes, this was the guy with the yoga pants. I got to take a closer look at the Cleon mote from the best paper (energy efficient GPS). I tried the MusicalHeart demo. I was told that music would not be able to slow my heart rate at high activity level (jogging) as I suspected. I was told that I had to set my tempo to the music, and then my heart rate will slow. So the music only becomes the cue to speed up or slow down my activity. Sort of a bummer.

You could understand if a talk/paper is well received depending on the number of people asking questions after the talk. The talks that receive no questions mean something was not quite right. (On related news, if I do not receive any questions after my presentations, I will probably have a breakdown and start crying.)

The tea (especially the vanilla orchard variety) was very good and there was a good supply of tea. I observed that more people were drinking tea relatively to previous years. I like this trend as a tea person.

The lunch arrangement was the weak point in the otherwise great organization. Lunch is picked up, and since there wasn't a lunch ballroom, we had to eat lunch in the conference room... facing the screen. This made for a lonely and awkward lunch. Having a lunch room which forces people to sit together helps create chance acquaintances, which adds a lot to the conference experience. I still find it hard to meet new people at the conferences. I think it would be really nice if conferences organized some nerdy ice-breaking events. The demo and poster session in Sensys helped a lot for this. Conferences should try more playful things for ice-breaking. Maybe, instead of writing conference name on nametags (duh, everybody in attendance knows conference name), write a twitter bio-like self-describing line about the person, or what that person wants to talk about: "ask me about crowdsourcing coffee shop line waiting times". Another idea could be to get people collaborating on an acitivity, game, or challenge. For example get people in attendance collaboratively write/edit/produce a dozen different conference summaries! Editors and webservices already exists for collaborative editing, and maybe the conference should randomly or purposefully cluster people into these editing groups.

Hmm, maybe I should start a conference critiquing business. Just pay my registration and travel, and I am set :-) I will write conference critics and reviews for your conference.

Finally, this is my motivation level after this conference.

## Thursday, November 8, 2012

### Sensys'12 (day 1)

I am attending Sensys'12  at Toronto, which is a short distance from Buffalo. Sensys is a premiere conference for wireless sensor network and, more recently, smartphone sensing research. Here are my rough notes from the first day of conference.

This paper is really good, it blew me away. Later at the conference banquet Wednesday night, the paper got selected as the Best Paper. This work aims to break up the GPS as a blackbox and see how we can make it more energy-efficient. The paper goes into fairly technical detail about how GPS works, which I cannot replicate here. Some basics are: GPS uses CDMA to transmit 50bps information over a carrier signal of 1.575Ghz. Seeing more satellites improves accuracy. GPS localization requires the following phases on the device: acquisition, tracking, decoding, code-phase, least-square calculation.

The insight of the paper is to offload some of these phases to the cloud, so the device does less work and spends less power. For example, the codephase involves getting the "ephemeris" from the signal for learning the location of satellites, but this look up table is already made available by NASA and alike in cloud. So that can be easily offloaded to the cloud, and the device does not need to do it. Another trick is coarse time navigation. If a nearby landmark has same/close nms for the GPS signals (nms= number of milliseconds for time of flight of GPS signal), instead of calculating the location, use the landmark's location. If there is no such known location, then leastsquare calculation may give several possible locations, but there is still hope. It is possible to use the elevation information (the height dimension often ignored in the lattitude-longitude pair) as a checksum and eliminate the spurious possibilities and to figure out the correct location. The elevation data and the computation power to do this is available at the cloud side.

Using similar tricks, they reduce the duration of the acquisition phase at the device and the amount of information to be gathered by the device from the satellites. They develop hardware for this (they call this Co-GPS design). The Cleon prototype is comprised of max2769 gps receiver + msp430 as the chip, and talks to the phone or device (ipod?) using audio communication through the headphone jack. Cleon ends up sampling GPS only for 2ms (16 mHz sampling, 2 bit/sample, 2ms) as that is enough for accurate localization (within 30meters) via offloading to the cloud-side. The awesome thing about Cleon is that it can run on 1.5 years on 2AA battery with continuous sampling. Wow! (This calculation does not take into account the communication cost to cloud as this is done by the device Cleon connects to and can be over wifi, bluetooth, etc.). A byproduct of Cleon is atomic time stamping for outdoor sensors. They will opensource the software and the hardware soon.

#### IOdetector: a generic service for indoor outdoor detection

This paper aims to provide a service for the smartphone to detect whether it is currently outdoor, semi-outdoor, indoor. The application of this service is to automatically and in "an energy-efficient manner" detect when the phone should not even bother activating the GPS for an app because the phone is indoor or semi-outdoor and the GPS error will be big, and it would not justify the energy wasted by the GPS. The service uses three modalities: light signal, cellular signal, magnetic field signal. Also secondarily, acceleration and proximity & time (to sanitize light readings) are employed. These are all very low energy sensors. The light detector accounts for day or night, and can detect indoor fluorescent lighting. The cellular signal detector uses signal from multiple celltowers to avoid the corner-turning effect. However, each subdetector cannot completely classify the three types (indoor, outdoor, semi), so they fuse these subdetectors employing the confidence results provided by them.

#### Musical Heart : A hearty way to listen to music

It is a proven fact that music has profound effect on human heart rates. Some music calms, some music gets you worked up. This work presents a biofeedback-based context-aware music recommendation app for smartphones, MusicalHeart. This app requires its own special sensor equipped earphones to achieve the biofeedback part. More specifically, there is an IMU (inertial measurement unit) and a microphone embedded in the headphone. The microphone is used for heart-rate detection (requires some filtering), and the IMU for activity detection (employs k-means clustering). These sensors communicates to the smartphone app via the audio jack, and they are powered by a thin film battery. In addition to the sensors in the headphone MusicalHeart app also uses phone GPS and WAP information for context detection.

#### The Banquet

The conference banquet was held at the CN tower (the very tall thin tower that is the symbol of Toronto). This banquet is up there for me with the Mobicom'11 banquet. The CN tower was a very interesting and exciting venue to have the banquet. Especially with the glass floor!

## Sunday, November 4, 2012

### An abundance of solutions

My son started the Kindergarten this year. I try to be a good parent and work with him closely on his sight-words and handwriting. Today we were reviewing his weekly reader magazine, and I read to him the following question from the magazine.
There is only one pumpkin left at the pumpkin patch. Mingo and Zip both want the pumpkin. What should they do?
I guess the easy answer (maybe, the answer the magazine tries to teach) is that they share the pumpkin. But my son thought about a little bit, and said they should ask the farmer if the farmer has more pumpkins somewhere. I went along, and together we thought of several more solutions to the problem.

• Mingo and Zip carve the pumpkin nicely, and sell the carved pumpkin at a good price at the bazaar and buy two pumpkins with that money.
• They divide the pumpkin into half. For dividing fairly, you know the trick right. One of them gets to cut in two pieces, and the other gets to choose. (However, my wife argues this is not fair if the person to cut is clumsy and cannot manage to cut into equal pieces. Then the chooser has an obvious advantage.)
• Mingo and Zip share the pumpkin in the time domain. Mingo displays it on the odd numbered days, Zip on the even-numbered. (I wonder if it is possible to share the pumpkin on the frequency domain or via code division.)
• Mingo paints and displays the pumpkin on Halloween, then Zip gets the pumpkin and cooks it.
• They do a coin toss about who gets the pumpkin.
• Mingo pays Zip and gets the pumpkin.
• They take the seeds of the pumpkin, sow the seeds, and next year they have many pumpkins each.
• They both forgo the pumpkin, and present it to one of their mutual friends.
• One of them gets up early and takes the pumpkin, the other gets a good life lesson.
• They help each other and make a nice pumpkin pie together. They apply mapreduce to the problem, make a business out of it, and get rich. See mapreduce in simple terms about how.
OK, I didn't really try to discuss the mapreduce idea with my son. I am waiting a couple more years before we have the mapreduce and writing robust programs talk.

## Saturday, November 3, 2012

### MDCC: Multi-Data Center Consistency

In order to reduce latencies to geographically distributed users, Google, Yahoo, Facebook, Twitter, etc., replicate data across geographical regions. But replication across datacenters, over the WAN, is expensive. WAN delays are in the hundreds of milliseconds and vary significantly, so common wisdom is that synchronous wide-area replication is unfeasible, which means strong consistency should be diluted/relaxed. (See COPS paper for an example.)

In the MDCC work (Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, March 2012), the authors describe an "optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols". Optimistic and strongly-consistent is an odd/unlikely couple. The way MDCC achieves this is by starting with Paxos, which is a strongly consistent protocol. MDCC then adds optimizations to Paxos to obtain an optimistic commit protocol that does not require a master and that is strongly consistent at a very low-cost.

#### Paxos optimized

Classic Paxos is a 2-phase protocol. In Phase 1, a master M tries to establish the mastership for an update for a specific record r. Phase 2 tries to accept a value: the master M requests the storage nodes to store r next (using the ballot number M established in the first phase), and waits for a majority number of ACKs back.

Multi-Paxos optimization employs the lease idea to avoid the need for phase 1 in the fault-free executions (timely communication and master does not crash). The master puts a lease on other nodes on being a master for many seconds, so there is no need to go through Phase 1. (This is still fault-tolerant. If the master fails, other nodes need to wait until the lease expires, but then they are free to chose a new master as per classic Paxos protocol.) Fast-Paxos optimization is more complex. In fast-paxos a node that wants to commit a value first try without going through the master (saving a message round, potentially over the WAN), directly communicating with the other nodes optimistically. If a fast-quorum number of nodes (more than majority number of nodes needed in classic quorum) reply, the fast round has succeeded. However, in such fast rounds, since updates are not issued by the master, collisions may occur. When a collision is detected, the node then goes through the master which resolves the situation with a classic round.

Finally building over these earlier optimizations, the Generalized-Paxos optimization (which is a superset of Fast-Paxos) makes the additional observation that some collisions (different orderings of operations) are really not conflicts, as the colliding operations commute, or the order is not important. So this optimization does not enforce reverting to classic Paxos for those cases. MDCC uses Generalized Paxos, and as a result, MDCC can commit transactions in a single round-trip across data centers in the normal operation case.

The good thing about using Paxos over the WAN is you /almost/ get the full CAP  (all three properties: consistency, availability, and partition-freedom). As we discussed earlier (Paxos taught), Paxos is CP, that is, in the presence of a partition, Paxos keeps consistency over availability. But, Paxos can still provide availability if there is a majority partition. Now, over a WAN, what are the chances of having a partition that does not leave a majority? WAN has a lot of redundancy. While it is possible to have a data center partitioned off the Internet due to a calamity, what are the chances of several knocked off at the same time. So, availability is also looking good for MDCC protocol using Paxos over WAN.

#### MDCC details

Now, while the optimistic and strongly-consistent protocol is nice, one may argue that there has not been much novel (at least academically) in that part of the paper; MDCC basically puts together known optimizations over Paxos to achieve that optimistic and strongly-consistent replication protocol. The claim to novelty in the MDCC paper comes from their proposed new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication. The model allows developers to specify certain callbacks that are executed depending on the different phases of a transaction. MDCC’s transaction programming model provides a clean and simple way for developers to implement user-facing transactions with potentially wildly varying latencies, which occur when replicating across data centers. Figure below shows an example of a transaction for checking out from a web store.

Evaluation of MDCC is done using the TPC-W benchmark with MDCC deployed across 5 geographically diverse data centers. The evaluation shows that "MDCC is able to achieve throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data center outage without a significant impact on response times while guaranteeing strong consistency."

#### Conclusion

This is a nice and useful paper because it tells a simpler united story about using Paxos (more accurately, Generalized Paxos) for executing transactions across datacenters. This paper also provides a very nice summary of the Paxos and optimizations to Paxos, as well as providing a case study where the usefulness of these optimizations are presented. So even if you are not interested with the details and quantitative measurements about the actual multi datacenter replication problem, the paper is still worth a read in this respect.