Market Data – Big Data Thoughts
Exegy’s Market Data Peaks (both European and US) provides a nice reminder of the quantity of market data that is part of the overall financial services big data problem. Apart from market data, there is also the ever growing volume of order/trade data, and risk data. As has been blogged about before, data growth is only on an upward road.
It’s thus always interesting to see new products launched in the Big Data space. Last month Nodeable launched StreamReduce. StreamReduce is interesting, as at its heart is Storm. Storm exists in the same space as “Complex Event Processing” (CEP) systems like Esper, Streambase, Microsoft StreamInsight (of which I have blogged about extensively a few years ago) and S4. StreamReduce is basically Storm in the cloud, with a few extras such as connectors to Apache Hadoop.
In the early days of Big Data, Hadoop was king, and batch processing was the way forwards. However, like many software engineering roads, we are passed batch processing and now in a world of real-time (or Nearly Real-Time), which essentially brings us to CEP’s as being a pillar in the Big Data Analysis. Storm and S4 have gain much attention in web land, with some debate as to which is the better of the two. Yahoo Labs has a somewhat old, but useful read on S4 – Distributed Stream Computing Platform:
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model , providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.
So if CEP is one of the Big Data Pillars, the others are probably storage and visualization/analysis. In the Big Data storage space, Cassandra is one of the many contenders. Acunu offers a Cassandra variant, which based on Acunu’s web site hints as Storm usage by clients:
Storm, MQ and other event frameworks fit together well with Acunu Reflex with Analytics. You can use them to pre-process or filter incoming data, which is then processed, stored and served to a front-end application using Acunu
Metamarkets in 2010/11 clearly viewed the relational database and NoSQL technology offerings at the time to be in appropriate for Big Data, and when their own way with Druid, leveraging a data storage schema very similar to Twitter’s Rainbird. Metamarkets design architecture principles are interesting. The second and third principles are expected, with Zookeeper providing coordination. The first principle specifically around partial aggregation is most interesting.
Finally, to end on visualization. If we start with R, Metamarkets again provides us with some thoughts. DataStax and Pentaho provide yet another Cassandra solution with a web-based interface to access, interactively analyze, and visualize big data and then report and create dashboards. However its unclear how real-time the web solution is, and if its leveraging websockets to facilitate internet streaming of data. I suspect Pentaho’s visualization solution will not support the usual trading desk requirements for trade/risk drill-in of data in real-time.