This week, Amazon just announced their data ingress service, Kinesis, for data acquisition tasks in Big Data scenarios. Coincidentally, my open source library, Windows Azure Cache Extension Library (WACEL), had just been released two weeks before the announcement, and data acquisition is one of the key scenarios supported by WACEL. This post introduces how to use WACEL + Windows Azure Cache to implement data acquisition scenarios on Windows Azure. It then compares differences between WACEL and Kinesis. Before I continue, I need to clarify that WACEL is not an official product and this post reflects only my person opinions. The post doesn’t reflect or suggest Microsoft’s product roadmaps in any way.

The problem

Both WACEL+ Cache and Kinesis attempt to solve the same problem: to provide an easy-to-use, high-throughput, scalable and available data acquisition solution for Big Data scenarios. Data acquisition is often the starting point of solving (or creating) a Big Data problem. A feasible solution not only need to provide efficient, high-throughput data pipelines that allow large amount of data to be pumped into the system, but also need to provide mechanisms for backend data processors to handle the data, often at a much slower rate as the data comes in, without being overwhelmed by the  data flood.

Kinesis focuses on providing a general, hosted solution for high-throughput data ingress. However WACEL+Cache is built with sensor data collection in mind. For a large number of sensors, having a single fat data pipeline is not necessarily desirable. A data processor will have to filter through the large data stream and group information by sensor. When scaling out, this means either a large number of processors scanning the same stream, with each taking only a very small potion of the data stream, or some sort of data filtering and dispatching mechanisms need to be built.

WACEL examines sensor data acquisition more closely. It classifies sensor data into two broad categories: streams and events, each with different characteristics and processing requirements, as summarized in the following table:

  Streams Events
Throughput High Low
Timeliness Temporal Maybe time-sensitive
Data Lose Allowed Not Allowed
Data Retention Processed Raw
Data Process Simple Complex
Data Structure Circular Buffer Queue

In the case of streams, because sensor data is constantly pumped into the system, higher, and more stable throughput is required. Example of sensor streams include GPS sensors and various environmental sensors, etc. In the case of events, because events only happen occasionally, the throughput requirement is not as high in most cases. Examples of sensor events include motion detections and other threshold violations, etc. Streamed data is often temporal, which means if it’s not examined within a short period of time, it loses its value quickly. For example, the coordinate reported by a GPS a minute ago is most like irrelevant to a auto-piloting program. On the other hand, events may or may not be time-sensitive. It’s often acceptable to lose some of the streamed data, however it’s often unacceptable to lose events.  And for streamed sensor data, it’s often pointless to record the raw data for extended period of time, but to record characteristics that are abstracted from the raw data, such as moving averages, trends, and summaries.

WACEL provides two data structures for ingress data from sensors: a circular buffer that keeps only latest n items, which is ideal for handling streams; and a queue, which is ideal for handling events. Unlike Kinesis, you don’t need have to pump data through a few fat data pipelines. WACEL allows you create and tear-down pipelines instantly whenever you like it. You can have a separate data pipeline for a group of sensors, or even for each individual sensor. And for different types of sensors, you can pick between circular buffers or queues to better fit with your scenarios. Separate data pipelines also simplify data processing. You can attach processors to specific data streams, instead of having to filter data by yourself.

Getting Started: A Simple Scenario

Now let’s see a simple scenario. In this scenario, a GPS sensor is connected to a client computer and it sends standard NMEA statements. The client computer listens to the stream and forward all $GPGAA statements to a backend server. The GPS data is kept in a circular buffer, which keeps the last 5 position fixes. On the server side, a data processor gets the average location from the last 5 statements (as moving averages often provide better positioning results by reducing drifting).

Client implementation

  1. On client program, add a reference to WACEL NuGet package.
  2. Modify your application configuration file to point to a Windows Azure Cache cluster. Many WACEL data structures support either Windows Azure Cache or Windows Azure Table Storage as backend (with extensibility to support other backend). In this case we’ll use a cache cluster.
  3. Create a new CachedCompressedCircularBuffer instance and start sending data. That’s really all you need to do, WACEL takes care of everything else, including batching, compressing, handling transient errors, and in near future offline supports.
     CachedCompressedCircularBuffer buffer = new CachedCompressedCircularBuffer("mygps", 5, batchSize: 5);
     buffer.Add("$GPGAA,151119.00,4307.0241,N,07729.2249,W,1,06,03.2,+00125.5,M,,,,*3F");

Note although we’ve specified to use a 5-item batch, the client code doesn’t need to explicitly manage batches. It simply adds item (or items) to the buffer. WACEL automatically batches the items up, compress them, and send them to server – this is one of the benefits of using a client library. On the other hand, if you want to commit a batch before it’s filled up, you can call the Flush() method any time to commit a partially filled batch.

Server implementation

The server implementation is just as easy:

  1. Add a reference to WACEL NuGet package.
  2. Modify your application configuration to point to the same Windows Azure Cache cluster.
  3. Create a new CachedCompressedCircularBuffer instance and start reading data. WACEL allows you two easy ways to consume data on circular buffer: first, you can use the Get() method to get all data items that are currently on the buffer; second, you can simply to use a indexer to access individual items in the buffer. For instance, buffer[0] for latest item, buffer[-1] for second latest item, and so on. Note WACEL chooses to use negative indexes for older items to reflect the fact that you are tracing back to older data. In this case, we’ll simply read all statements in the buffer and calculate average lat/long locations.
    CachedCompressedCircularBuffer buffer = new CachedCompressedCircularBuffer("mygps", 5, batchSize: 5);
    string[] coordinates = buffer.Get(0);
    //parse statements and calculate average lat/long

The above Get() call returns all the 5 $GPGAA statements in the buffer, and the processor can calculate the average lat/long based on these statements.

Throughput and scaling

The throughput of WACEL is bound to your bandwidth. When a 5-item is used, the above code provides roughly 50 tps, which is far more than sufficient for capturing and transferring a single GPS sensor data. If you chose to use a larger batch size such as 50, the code allows 470+ tps, which is roughly 33K data / second. And remember, this is only from a single thread from a single client. WACEL library itself is stateless, so it scales as you scale out your applications, and its throughput is decided by your network speed and how fast Windows Azure Cache can handle data operations. WACEL’s automatic batching and compression also improve throughputs. In a load test (CompressedCircularPerfTest.AddWithLargeBatch), I was able to achieve 6M/sec (and 2000 tps) using a single client. Both metrics exceeds Kinesis throughput promises, btw. Of course, the data is acquired under optimum conditions. You can find source code of these tests on WACEL’s CodePlex site – look for the CompressedCircularPerfTest class under Microsoft.Ted.Wacel.TestConsole.

WACEL + Windows Azure Cache vs. Kinesis

The following table is a quick comparison between the WACEL + Windows Azure Cache solution and Kinesis.

  WACEL + Windows Azure Cache Kinesis
Client WACEL data structures (.Net) KCL (Java) or HTTP PUT
Server Managed Windows Azure Cache cluster Managed Kinesis Shards
Throughput Windows Azure Cache allows maximum of about 600,000 ops/seconds with a 32-VM cluster. There’s no throttling on data sizes 1M/s; 1,000 ops/seconds
Scale 1 small VM to 32 extra-large VMs 1 to 10 shards
Price No transactional fee. You are charged only be amount to data in cache. When circular buffer is used, each sensor only uses number of items specified in the data structure $0.015/hour per shard
Data persistence Data can be temporal in memory or be persisted to Windows Table Storage. Additional data providers can be chained to support persistence. Actually, as an experiment,  I wrote a S3-based provider in about 20 minutes so WACEL could work with S3. I’ll talk about the extension in another post. Stored for 24 hours. Data persistence is supported when Redis engine is used.
Latency Sub-second latency. Data is immediately available for consumption. single-second delay, about 5 seconds on average
Client-side optimization Automatic batching & gzip compression *Unknown* with KCL.

Summary

WACEL + Windows Azure Cache is a potential data acquisition solution on Windows Azure platform. Check out WACEL today!

0

Add a comment

When we learn classic compute algorithms, we don't start with learning how a computer is built. We don't start with how a transistor works or how to build integrated circuits. Instead, we go straight with the abstraction - bits, commands, programs and such. I think we should take the same approach when we learn quantum computing. Instead of trying to understand the bizarre quantum world, we should take some quantum behaviors granted and go straight with higher-level abstracts such as qubits and quantum gates. And once we grasp these basic concepts, we should go even a level higher to use a high-level language like Q# and focus on how quantum algorithms work, and how we can apply quantum algorithms on practical problems.

Bono is an open source quantum algorithm visualizer I'm building in the open. This project is inspired by a few existing systems such as QuirkIBM Q Experience, and the Programming Quantum Computers book. Bono is a Javascript-based visualizer that features:

  • Drag-and-drop quantum circuit editing.
  • Dynamic circuit evaluation.
  • Works offline in a browser. No server is needed.
  •  Generates Q# code.
I've also created a YouTube channel dedicated to quantum algorithms. My goal is to create an easily digestable course for regular software developers to learn about quantum computing without needing to understand any underlying quantum physics. 

Bono is at its infancy. And I'm still a new student in the quantum world. I intentionally develop both Bono and the video series in the open. I hope the early results can inspire collaborations so that we can explore the quantum computing world together.

Last but not least, you can find more Q# related contents at the Q# Advent Calendar 2019.
0

Add a comment

Series
存档
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.