Data Lake: Which RDF Data Store?

•May 3, 2016 • Leave a Comment

A number of triple stores are listed here.  GraphDB looks interesting, but I suspect the enterprise edition tips the scale towards an Open Source solution.  In the world of big data, I suspect you really want SPARSQL over Hadoop?

store data in a key value store in Hadoop [Hbase], but then index it with our graph database so that we can do these SPARQL queries

This is echo’d by “Avoiding Three Common Pitfalls of Data Lakes”:

Smart data technologies substantially reduce the complexity of data lake implementations while accelerating the time-to-value they produce. The graph-based model and detailed descriptions of data elements they enable substantially enhance integration efforts, enabling business users to link data according to relevant attributes that provide pivotal context across data sources and business models. Resource Description Framework (RDF) graphs are designed to incorporate new data and sources without having to re-configure existing representations. The result is considerably decreased time to a more profound form of analytics, in which users can not only ask more questions more expediently than before, but also determine relationships and data context to issue ad-hoc queries for specific needs.

Although older, still of interest “Storing (and querying) RDF in NoSQL database managers”

The paper then describes the storage and querying of RDF using HBase with Jena for querying, HBase with Hive as the query engine (with Jena’s ARQ to parse the queries before converting them to HiveQL), CumulusRDF (Cassandra with Sesame), and Couchbase

Which leads to the following possible options:

  • RDF data store used to index your data lake
  • D2R on top of your data lake

Anyone got a view?

Ontology Matching

•May 3, 2016 • Leave a Comment

Interesting read on matching multiple ontologies using machine learning.  With the advent of semantic data lakes, I can only assume such ideas will gain tractions to reduce the manual processes that occur today to aid linkage between data.

Culture Impacts Application Performance

•April 30, 2016 • Leave a Comment

Yet another classic article from Joe Duffy, this time on Performance Culture. Joe sums the whole performance culture issue up quite well in the early part of his essay:

The cultural transformation must start from the top – management taking an active role in performance, asking questions, seeking insights, demanding rigor – while it simultaneously comes from the bottom – engineers actively seeking to understand performance of the code they are writing, ruthlessly taking a zero-tolerance stance on regressions, and being ever-self-critical and on the lookout for proactive improvements.

For me, the major issue that performance issues stem from in engineering is management. The drive on features, ignoring performance, and likewise not understanding the engineering demands required from engineering to achieve performance, come back to Joe’s classic comment, “Management must care “. In many cases, I truly don’t think management care, or care to understand.

Its impossible not to disagree with the single appointed test/performance team member. This is the same outside of performance in the general quality space, and the “tester” silo mentality😦

engineers learned bad habits by outsourcing the basic quality of their code, assuming that someone else would catch any problems that arise.

“My number one rule is ‘no jerks,'” – hard not to disagree with this. Shame when jerks are forced on teams by management😦

In every team with a poor performance culture, it’s management’s fault. Period. End of conversation.

Finding the balance between performance and features is critical, else the cost to resolve performance with be a multipe of the original non allocated time:

Performance doesn’t come for free. It costs the team by forcing them to slow down at times, to spend energy on things other than cranking out features

Great to hear that Microsoft is following the path of “the elimination of “test” as a discipline mentioned ” and a “renewed focus on engineering systems”.

In my view, management in a lot of companies needs to wake up from the sleepy drug of features, features,features. Without a stable engineering backed culture within a team, driven by a sensible blend of features and engineering, products are doomed to fail, causing considerabel financial loss

SPARQL and Machine Learning

•April 29, 2016 • Leave a Comment

As discussed in previous postings, with the industry trend appearing to be off down the Semantic Data Lake road, its probably worth understanding the Return on Investment (ROI) for Semantic Machine Learning :

  • Today, following the CRISP-DM process, or at least something similar, a good degree of time is spent understanding the data, and linking different data sets together to create the appropriate data frames (H2OFrame etc) before you can start train models and tuning the model parameter.  A Semantic Data Lake should at least reduce the identification of data linkages time
  • Consuming raw data into a data lake is fine, as there is zero data loss – ELT rather than ETL.  However, with changing resources on your Data Science team, there is a degree of “learning” time about the data.  RDF’s and ontologies should reduce this.
  • Ontology Matching is something that might be quite interesting from the EL side of a data lake – reduced time in leverage new data sources in the lake.
  • With RDF underpinning your data lake, data scientist now have a degree of linkage resolve.  This is briefly discussed in “SPARQL with R in less than 5 minutes”:

In English, our query says, “Give me the values for the attributes “fires”, “acres” and “year” wherever they are defined

I think the killer statement in this article is:

As more data becomes available in RDF format, automated solutions for mining and analyzing the Semantic Web will become more and more useful.

Which leads us to the need for a data lake, as discussed elsewhere on the web, to be underpinned by RDF.  As an example, if your lake is consuming social network data (e.g Facebook), then your probably want to look at Friend of a Friend (FOAF), since this will allow you to query the data lake (of many data sources) leveraging the FOAF ontology:

PREFIX foaf: <>
SELECT ?name (COUNT(?friend) AS ?count)
    ?person foaf:name ?name .
    ?person foaf:knows ?friend .
} GROUP BY ?person ?name

Anyone using a Triple store for their data lake?  If so, how are you using it with the normal Hadoop/HDFS data lake world technologies?

Blockchain: FXSwaps Valuation

•April 28, 2016 • Leave a Comment

The OpenGamma and Clearmatics demo blockchain PoC appears to be a step in the right direction, both in technology terms, and in relevant to business terms.

the key innovation is the use of a distributed virtual machine to compute business logic and financial models. Robert Sams, Founder and CEO of Clearmatics explains:

“Distributed ledger technology emphasises shared and replicated data storage, but some very specific and rigorous business logic must govern how those ledgers are amended. If the automation of that logic isn’t also distributed, then DLT will actually encourage more centralisation in post-trade intermediation and do little to increase transparency.”

I assume as it was using Ethereum, it was using Smart Contracts as part of the PoC?

Machine Learning on Streaming Data – Samza and Flink

•April 28, 2016 • Leave a Comment

Based on a few comments, coupled with various web reading, I get the impression Spark and Storm are not the latest solution to use in a Streaming Data Machine Learning platform – maybe I’m wrong?  Apache Samza and Flink appear to be the new kids on the block.  There are a few comparisons of the various streaming engines – one here, and another here.

Samza is very interesting, since it uses a technology I like a lot, Apache Kafka:)  Flink however appears to be the newest kid on the block:) , and based on this simple code comparison, offers a clean API.

dataArtisans “Kafka + Flink: A practical, how-to guide” article offer some direction on connecting Kafka and Flink, which in many ways might be the approach to take to running Machine Learning models against streaming data.

Finally, although old, “Apache Flink: API, runtime, and project roadmap” slide 62 provide a view of the roadmap for Flink to integrate with Machine Learning libraries – also slide 67, with H2O mentioned on slide 71.

The next bus: Apache Kudu?

Semantic Data Lake

•April 28, 2016 • Leave a Comment

Continuing on the data lake road, Kafka Connect look very interesting from a way to feed a data lake.  This leads to “Prototype of Data Processing Infrastructure”, and usage of RDF and CumulusRDF.  RDF if interesting since its gained uptake due to SPARQL – semantic web.

Further searching of the web yields a very interesting article in the data analytics space, “Advanced Real-Time Healthcare Analytics with Apache Spark”.  Thankfully the article provide an appropriate architecture diagram, which is nicely using Apache Kafka:)  More interesting is that its using Ontologies:

The architecture is hybrid and also includes a production rule engine and an ontology reasoner. This is done in order to leverage existing clinical domain knowledge available from evidence-based clinical practice guidelines (CPGs) and biomedical ontologies like SNOMED. This approach complements machine learning algorithms’ probabilistic approach to clinical decision making under uncertainty. The production rule system can translate CPGs into executable rules which are fully integrated with clinical processes (workflows) and events. Drools supports both forward and backward chaining as well as the modeling of business processes (clinical workflows) with the business process modeling notation (BPMN). There are patterns for integrating rules and processes.

Interestingly, there is a W3C Machine Learning Schema Community Group – RDF etc. There’s also a list of projects on the Machine Learning and Ontology Engineering site.

Moving on, we find “Hadoop, Triple Stores, and the Semantic Data Lake“:

“What we’re investing most of our time in now is the semantic data lake, where we store data in a key value store in Hadoop [Hbase], but then index it with our graph database so that we can do these SPARQL queries,”


Get every new post delivered to your Inbox.

Join 761 other followers