In-Stream Big Data Query Processing Pipeline: Storm, Cassandra, Kafka
Interesting read on the shortcomings and drawbacks of batch-oriented data processing over on Highly Scalable. Data partitioning and shuffling are discussed as solutions to large data sets. Lineage tracking is also discussed 🙂 The article concludes with a brief overview of the solution built with Storm, Cassandra and Kafka:
The system naturally uses Kafka 0.8 as a partitioned fault-tolerant event buffer to enable stream replay and improve system extensibility by easy addition of new event producers and consumers. Kafka’s ability to rewind read pointers also enables random access to the incoming batches and, consequently, Spark-style lineage tracking. It is also possible to point the system input to HDFS to process the historical data.
Cassandra is employed for state checkpointing and in-store aggregation, as described earlier. In many use cases, it also stores the final results.
Twitter’s Storm is a backbone of the system. All active query processing is performed in Storm’s topologies that interact with Kafka and Cassandra. Some data flows are simple and straightforward: the data arrives to Kafka; Storm reads and processes it and persist the results to Cassandra or other destination. Other flows are more sophisticated: one Storm topology can pass the data to another topology via Kafka or Cassandra.