Artificial Intelligence – Data Quality

•June 23, 2016 • Leave a Comment

Most corporations as soon as they venture down the road of “big data”, and AI, realise they often don’t have a big data issue, they have a data quality issue, which is probably coupled to data holes within the corporate data set.  This is driven by a number of issues, including:

  • No Chief Data Officer
  • No Data Strategy
  • Lack of thought as to how the data from a application (division, department etc) will be used outside of the application (User Experience) itself
  • No Acceptance Criteria on stories around data quality

Data hygiene is key to deriving using predictions and classifications from AI models – obvious:)

What follows are a few pointers that may aid in the area of data hygiene:

  • Identification of source of truth (SoT) of data e.g. market data, trades, orders, recruitment.  Using a secondary copy of the SoT can often lead to “issues”
  • Context around data changes in the SoT.  Specifically, who changed what, when, and ideally why.  “Why” can be difficult in certain instances, but ideally would provide some context on the path that lead to a data change e.g phone call from a client requesting an amendment to a trade
  • Taxonomy/ontology – if you are doing anything around LDA to extract topics, then its going to help considerably if the input data leverages a taxonomy to reduce the surface area of data.
  • Applications are often built with no thought around any of the above points.  Further, if you are using ELK or similar as a data source for AI models, its will not be uncommon to find that application development didn’t consider the logs during development😦  In this scenario, I’d advise mandating ELK to development teams:)  At a minimum, this will aid the reduction of support tickets as support staff will at least have meaningful log files to work with:)

Its truly amazing how time can be wasted prior to training AI models with cleaning and collecting data😦

AI Chatbots

•June 15, 2016 • Leave a Comment

“Deep Learning For Chatbots, Part 1 – Introduction” provides a good overview on the techniques needed to develop your own chatbot.  Clearly, a closed domain problem is easier.

Microsoft’s Bot Framework also provide some good resources.  Particularly nice, is that fact that Bot Builder is Node.js.  Microsoft has gone for one approach to understanding natural language – LUIS.

Botkit also looks interesting, but doesn’t seem to have that complex a NLP ability.

ChatterBot has a training mode.  I’ve not used it, but it would be interest for example, to play in traders Bloomberg conversations or similar, and see how the bot faired:)

Polyglot Persistence

•June 14, 2016 • Leave a Comment

Slide 8 of Martin Fowler’s deck provide clarity on what polyglot persistence means.

using multiple data storage technologies, chosen based upon the way data is being used by individual applications. Why store binary images in relational database, when there are better storage systems?

PaaS: Serverless/Nanoservices

•June 13, 2016 • Leave a Comment

Few articles worth a read on serverless architecture:

  • OpenWhisk Vies With AWS Lambda As Developer Service
  • Google has quietly launched its answer to AWS Lambda
  • How I decided to use Serverless/Nanoservices Architecture with AWS to make CAPI
  • Is “Serverless” architecture just a finely-grained rebranding of PaaS?


Ingesting Documents and RDF

•June 10, 2016 • Leave a Comment

Strangle, MarkLogic appears to have some best documentation of dealing with documents and RDF in a single data hub, even though the concept isn’t unique to MarkLogic.  As I’ve blogged about before, there are a number of other products, and its conceivable to build your own using various open source frameworks and libraries (Apache Jena etc)

The BBC News sample is probably the best I’ve found so far.

We started with each article as a single XHTML document. We then used OpenCalais to analyze the articles and find the entities (real-world things) within them. OpenCalais spotted entities like people, their roles, places (cities and countries) and organizations. On top of this it linked individuals with their role(s) and also determined the subject headings (categories) of the documents. For example, for one news article, OpenCalais generated triples for us that indicated the item was about war, identified the places mentioned in the article, and provided geo-location information for those places

The results can be seen in the downloadable zip, which provides a directory for the source (XHTML articles), and a directory wih the generated RDF file from OpenCalais.  The RDF file uses rdf:Description to reference to the original BBC URL of the article.  Both the XHTML and the RDF files are then ingested into MarkLogic – as expected.

Finally, “SPARQL and XQuery Together” shows how to leverage both content structures.

Interested in anyones experience of MarkLogic, as the documents hints at a cool product.


deeplearning() -> anomaly()

•June 10, 2016 • Leave a Comment

“Anomaly Detection: Increasing Classification Accuracy with H2O’s Autoencoder and R” offers some great ideas around using the h2o.deeplearning() algorithm to detect anomalies.

Further, a read of this may aid in boundary detection, and further coolness via h2o.feature here.

Once you’ve got your model, drop it to a H2o POJO if required, and hook it up to your stream of data.

Operational Data Hub

•June 10, 2016 • Leave a Comment

Although bias to Marklogic, “Decimating Data Silos With Multi-Model Databases” provides an interesting.  Few interesting take aways:

60 percent of the cost of a new data warehousing project is allocated to ETL and corporations spend $36 billion annually on creating relational data silos.

Of no surprise is the “semantic metadata” concept, and also the callout to ELT over ETL.

Three approaches to avoiding ETL hell:

  1. A semantics triple store/graph-only architecture
  2. A document store-only architecture
  3. A multi-model approach that combines a document store and triple store

On Option 2, “Defensive programming for unexpected data structures” is so true based on what I’ve seen

Agree that Option 3 is the preference, and best of both worlds (document storage and ontology).  Not being a Marklogic expert is about appear in many ways that this is similar to combining MongoDB or similar with D2RQ?


Get every new post delivered to your Inbox.

Join 775 other followers