Building a Data Science Team

•September 19, 2016 • Leave a Comment

Kaggle recently pointed me at an interesting article on how Airbnb built their data science team, “Building a Team from the Inside Out: Alok Gupta on the Evolution of Data Science at Airbnb”.

There are a few interesting take aways from the article worth noting:

centralized team of data scientists to smaller embedded teams which sit within product area

Data Scientist is broken down into 4 specialized roles:

  • Data engineers – They take messy data and transform it for analysis.
  • Product builders – People who build data products that are user-facing. For example, they may build a recommender engine.
  • Data analysts – They provide chief analyses outlining where opportunities lie for the business.
  • Data experimenters – Scientists who know how to design and perform an experiment.

A number of tools are referenced in the article:

Key however, seems to be Airbnb’s “knowledge sharing tool” that allows data scientists to “write up analyses from start to finish” – effectively a blog of research.  I would assume this comes with a fully repeatable way of anyone in Airbnb taking the research and branching down other roads, offering easy Return on Investment (ROI).  In many ways similar to Kaggle Kernels and Datasets.  Some details of the Airbnb sharing knowledge pipeline are discussed here.

The GROWS Method

•September 19, 2016 • Leave a Comment

Andy Hunt provides a good overview of GROWS, touching on the issues of Scrum and XP practices:

if you and/or your organization are following the XP practices, or the Scrum practices, then you are not agile

The Dreyfus Model looks like a great framework to aid “agile” teams.  It offers give stages of progression to gain specific skills:

  • Novice – Rules-based, just wants to accomplish the immediate goal
  • Advanced Beginner – Needs small, frequent rewards; big picture is confusing
  • Competent – Can develop conceptual models, can troubleshoot
  • Proficient – Driven to seek larger conceptual model, can self-correct
  • Expert – Intuition-based, always seeking better methods

Which leads to a great comment by Andy which is utterly true:

change needs to come from one’s own desire

Having read Andy’s article, coupled with the GROWS web site, it would appear that GROWS is quite an interesting method to provide guidance for teams to experiment, provide, learn and move forwards – its particularly nice to see checklists called out in GROWS

Deploying an Analytical Database

•September 16, 2016 • Leave a Comment

O’Reilly has a short but interesting article on “5 mistakes to avoid when deploying an analytical database”.

Point one is pretty much the case of the engineer in the sweet shop.  Stop and think before you decide on the technology stack, and if its really going to help.🙂

Point four is critical from my perspective.  The best TLA to describe this is the ELT paradigm compared to the old ETL world.  Extract all data element from the source systems, and store in your analytical database.  Zero information loss.

Continuous Delivery Coding Patterns

•September 12, 2016 • Leave a Comment

Yet anther resource off InfoQ, “Continuous Delivery Overview” married with “Continuous Delivery Coding Patterns: Latent-to-Live Code & Forward Compatible Interim Versions”.

Trunk-based-development (TBD) is probably one of the main chances that teams need to get their head around if today they are working off branches.

Latent-to-live code pattern is in my view the only sensible road, since until your code get into production, there is zero Return-on-Investment (ROI)

Pancake Stack: End-to-End, Real-time ML and AI Pipeline

•September 12, 2016 • Leave a Comment

InfoQ has an interesting read on the last-mile of theApache Spark machine learning pipeline, “Chris Fregly on the PANCAKE STACK Workshop and Data Pipelines”.

Its interesting to read that even Netflix and other struggled in this space:

The idea for PipelineIO dates back a few years to my time at Netflix where we were forced to build a custom ML prediction/serving layer. There wasn’t – and still isn’t, in my opinion – a production-ready, fault-tolerant, and low latency open source system to serve Netflix-scale predictions and recommendations in real-time

PipelineIO offers a good architecture overview on its home page:


Serverless – Real World Usage

•August 17, 2016 • 1 Comment

Great posting by Pete Johnson on a serverless application, “30K Page Views for $0.21: A Serverless Story”.  Its interesting to read a read world application that leverages AWS  Lambda functions coupled with S3 storage, providing real data on the extremely low charges for the number of page views.  If nothing else, read the “What I’ve Found is Cool (and Not) About Lambda” section🙂

Gamification of Kanban – Part 2

•July 28, 2016 • Leave a Comment

Ashish Parkhi offer an interesting read on his experience of gamification of agile, “Gamifying Agile Adoption – An experiment”.  Some interesting behaviour from the experiment:

  • The team members started following processes well.
  • There was more interaction and collaboration between SD, QA, Product Owners and Product Managers.

Software available here