Data Analytics Platform Reading

•September 29, 2016 • Leave a Comment

Few interesting articles recently worth a read, specifically around the (fast) data platform space:

  • A Guided Way To Manage Data In Motion For Streaming Applications
  • Swisscom Q/A On Choosing Scala And Spark For New Streaming Data Platform – particularly interesting, as its a real world application.  Nice to see Kafka in the stack🙂.  Nice to see User Experience (UX) and data privacy touched on by the article
  • The Next Generation Data Science Toolkit
  • Spark ML Data Pipelines – provides the usual stats on data cleanup.  Nice list of a few tools that readers may find useful e.g ActiveClean etc.  Word2Vec seems to be in a lot of conversations these days.  Not used BlinkDB.
  • Why data is the new coal
  • At the bleeding edge of AI: Quantum grocery picking and transfer learning
  • PaddlePaddle – text classification system overview
  • The Barclays Data Science Hackathon: Using Apache Spark and Scala for Rapid Prototyping
  • DevOps and Big Data: Rapid Prototyping for Data Science and Analytics

Do you pass the “Joel Test” for Data Science?

•September 22, 2016 • Leave a Comment

Interested to know how many teams score well on the “Joel test”for Data Science.

Number one killer questions for any project, is “Can new hires get set up in the environment to run analyses on their first day?”.  Most projects I’ve seen fail on this.  Docker maybe your friend here🙂

“Can predictive models be deployed to production without custom engineering or infrastructure work?” is a little ambiguous, but hits the nail on the head with regards to “Done” and getting into production to achieve an Return on Investment (ROI)

The Remote Manifesto

•September 21, 2016 • Leave a Comment

Great read over on GibLab with further details provided here.

Point 6 of the remote manifesto offer some good advice for team that are in estimation pain with stakeholders demanding concrete estimates and delivery dates:

Don’t get fixated on trying to estimate workloads. This is mostly a waste of time and usually inaccurate. If it’s to get a general idea, use T-shirt sizes for a measure. S, M, or L. If you get to XL then you can guarantee it’s inaccurate.

Point 8 is extremely important in all aspects of life:

At GitLab we have a Slack channel #thanks for this purpose. It always feels good to give and receive a thanks.

“iOS” MacBook

•September 19, 2016 • Leave a Comment

Interesting read over on OSNews about the death of macOS.  In many ways this could be the solution to Apple being rid of Intel.

On macOS, maybe its a numbers game – iOS, watchOS, and tvOS vs macOS?

Building a Data Science Team

•September 19, 2016 • Leave a Comment

Kaggle recently pointed me at an interesting article on how Airbnb built their data science team, “Building a Team from the Inside Out: Alok Gupta on the Evolution of Data Science at Airbnb”.

There are a few interesting take aways from the article worth noting:

centralized team of data scientists to smaller embedded teams which sit within product area

Data Scientist is broken down into 4 specialized roles:

  • Data engineers – They take messy data and transform it for analysis.
  • Product builders – People who build data products that are user-facing. For example, they may build a recommender engine.
  • Data analysts – They provide chief analyses outlining where opportunities lie for the business.
  • Data experimenters – Scientists who know how to design and perform an experiment.

A number of tools are referenced in the article:

Key however, seems to be Airbnb’s “knowledge sharing tool” that allows data scientists to “write up analyses from start to finish” – effectively a blog of research.  I would assume this comes with a fully repeatable way of anyone in Airbnb taking the research and branching down other roads, offering easy Return on Investment (ROI).  In many ways similar to Kaggle Kernels and Datasets.  Some details of the Airbnb sharing knowledge pipeline are discussed here.

The GROWS Method

•September 19, 2016 • Leave a Comment

Andy Hunt provides a good overview of GROWS, touching on the issues of Scrum and XP practices:

if you and/or your organization are following the XP practices, or the Scrum practices, then you are not agile

The Dreyfus Model looks like a great framework to aid “agile” teams.  It offers give stages of progression to gain specific skills:

  • Novice – Rules-based, just wants to accomplish the immediate goal
  • Advanced Beginner – Needs small, frequent rewards; big picture is confusing
  • Competent – Can develop conceptual models, can troubleshoot
  • Proficient – Driven to seek larger conceptual model, can self-correct
  • Expert – Intuition-based, always seeking better methods

Which leads to a great comment by Andy which is utterly true:

change needs to come from one’s own desire

Having read Andy’s article, coupled with the GROWS web site, it would appear that GROWS is quite an interesting method to provide guidance for teams to experiment, provide, learn and move forwards – its particularly nice to see checklists called out in GROWS

Deploying an Analytical Database

•September 16, 2016 • Leave a Comment

O’Reilly has a short but interesting article on “5 mistakes to avoid when deploying an analytical database”.

Point one is pretty much the case of the engineer in the sweet shop.  Stop and think before you decide on the technology stack, and if its really going to help.🙂

Point four is critical from my perspective.  The best TLA to describe this is the ELT paradigm compared to the old ETL world.  Extract all data element from the source systems, and store in your analytical database.  Zero information loss.