MiFID II – Clock Synchronisation

•February 1, 2016 • Leave a Comment

LMAX blog has a great read on Precision Time Protocol (PTP).  Sensibly LMAX already had “GPS satellite antennae on the roof of some of our data centres” – very wise.  Interesting to see InfluxDB and Puppet are used.  Precision Time Protocol, like log files, and entitlements are often left until to late in a project.  If your involved in MiFID or distributed system, there blog postings are a must read.

Logstash: Log Management – Part 3

•January 29, 2016 • Leave a Comment

As I suspect is the usual course venturing down the ELK road, at some point you realise you don’t have enough disk space for all the logs you are consuming.   Luckily, Curator comes to the rescue.  One of the most basic and simple commands is:

curator show indices –all-indices

Its also worth knowing the directory layout of the nodes etc

Real-time Data Refinery

•January 21, 2016 • Leave a Comment

First, I must give credit to Hortonworks for a “Data Refinery” cool data buzzword.  Hortonworks “Storm and Kafka Together: A Real-time Data Refinery” article provide a great overview of data process, and why Storm and Kafka work so well together:

Apache Storm is a distributed real-time computation engine that reliably processes unbounded streams of data. While Storm processes stream data at scale, Apache Kafka processes messages at scale. Kafka is a distributed pub-sub real-time messaging system that provides strong durability and fault tolerance guarantees.

Tutorial available here to get your hands dirty with Storm and Kafka.

Along similar lines, “Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform”  provides further reading material.

Finally, “Real Time Streaming with Apache Storm and Apache Kafka” offers the classic Twitter Stream Sentiment Analysis.

Logstash: Geoip for Internal Networks – Part 2

•January 20, 2016 • Leave a Comment

Continuing with logstash.  If your doing anything with TopBeat, then consider the dashboards, available here.  Using FileBeats and TopBeats to feed logstash, will effectively mean logstash receives both streams via port 5044.  In your logstash config, your may want to insert the data into different ElasticSearch index’s.  One way tot do this is to check the input type of data from Beats:

filter {

  if [type] == “system” or [type] == “filesystem” or [type] == “process”  {

    mutate {

        add_field => { “_IndexName” => “%{type}” }


And then in the output section, change the index name as appropriate:

  elasticsearch {

        index => “%{_IndexName}-%{+YYYY.MM.dd}”

Microservices and DDD

•January 20, 2016 • Leave a Comment

Coupled of interesting presentations on InfoQ that are worth watching around the hype of microservices:

  • The Seven Deadly Sins of Microservices – some great comments in this presentation around the shiny stuff, with some good references to monolithic applications as well.  Boring technology reduces risk, and in many ways are the ideal tool.
  • Enabling Microservices with Domain Driven Design and Ports & Adapters – often not talked about enough, Domain Drive Design is nicely discussed in the context of microservices.

Deep Learning – H2O

•January 19, 2016 • Leave a Comment

Interesting interview around Deep Learning and Arno Candel.  Specifically the uptake of H2O:

For example, Cisco built a Propensity to Buy Model Factory using H2O. Paypal uses H2O for their Big Data Analytics initiatives and H2O Deep Learning for Fraud Detection. Ebay deploys H2O on their data science clusters with Mesos. ShareThis uses H2O for Conversion Estimation in Display Advertising to predict performance indicators such as CPA, CTR and RPM. MarketShare uses H2O to generate marketing plans and What-If scenarios for their customers. Vendavo is using it to build Pricing Engines for products and Trulia for finding fixer-uppers in luxury neighborhoods. Some retailers and insurance companies are using it to do nationwide modeling and prediction of demand to manage just-in-time inventories and recommendations.

Cloudera has an interesting article on “How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark”.  Key take away from this article is that Spark and H2O integrate as per:

the data import, ad-hoc data munging (parsing the date column, for example), and joining of tables by leveraging the power of Spark. We then publish the Spark RDD as an H2O Frame


Spark, Spring XD and Apache Geode – Machine Learning

•January 11, 2016 • Leave a Comment

Stock inference project over on Pivotal’s GitHub offer a nice starting point to machine learning with an interesting technology stack, which sensibly includes Spark :)  Not used Apache Geode previously – the transaction support could be interesting within the low latency container.  Spring XD offer a nice pipeline for aiding both real-time and batch processing of ingested data.  Interested to know if any readers have used Spring XD in earnest.


Get every new post delivered to your Inbox.

Join 734 other followers