SPARQL and Machine Learning


As discussed in previous postings, with the industry trend appearing to be off down the Semantic Data Lake road, its probably worth understanding the Return on Investment (ROI) for Semantic Machine Learning :

  • Today, following the CRISP-DM process, or at least something similar, a good degree of time is spent understanding the data, and linking different data sets together to create the appropriate data frames (H2OFrame etc) before you can start train models and tuning the model parameter.  A Semantic Data Lake should at least reduce the identification of data linkages time
  • Consuming raw data into a data lake is fine, as there is zero data loss – ELT rather than ETL.  However, with changing resources on your Data Science team, there is a degree of “learning” time about the data.  RDF’s and ontologies should reduce this.
  • Ontology Matching is something that might be quite interesting from the EL side of a data lake – reduced time in leverage new data sources in the lake.
  • With RDF underpinning your data lake, data scientist now have a degree of linkage resolve.  This is briefly discussed in “SPARQL with R in less than 5 minutes”:

In English, our query says, “Give me the values for the attributes “fires”, “acres” and “year” wherever they are defined

I think the killer statement in this article is:

As more data becomes available in RDF format, automated solutions for mining and analyzing the Semantic Web will become more and more useful.

Which leads us to the need for a data lake, as discussed elsewhere on the web, to be underpinned by RDF.  As an example, if your lake is consuming social network data (e.g Facebook), then your probably want to look at Friend of a Friend (FOAF), since this will allow you to query the data lake (of many data sources) leveraging the FOAF ontology:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name (COUNT(?friend) AS ?count)
WHERE {
    ?person foaf:name ?name .
    ?person foaf:knows ?friend .
} GROUP BY ?person ?name

Anyone using a Triple store for their data lake?  If so, how are you using it with the normal Hadoop/HDFS data lake world technologies?

~ by mdavey on April 29, 2016.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: