Ontologies Are Not Scary

•May 13, 2016 • Leave a Comment

FOAF offers great description of what an Ontology is:

FOAF describes the world using simple ideas inspired by the Web. In FOAF descriptions, there are only various kinds of things and links, which we call properties. The types of the things we talk about in FOAF are called classes. FOAF is therefore defined as a dictionary of terms, each of which is either a class or a property. Other projects alongside FOAF provide other sets of classes and properties, many of which are linked with those defined in FOAF.

In the D2RQ world, the key is the mapping.ttl file, which maps the relational world (tables and columns) to the Ontology world (classes and properties).

Given the above, and a bit of reading, you can define your own ontology, or use readily available open source ontologies:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

@prefix mydatalake: <http://www.something.com/mydatalake-schema#> .

mydatalake:SomeThing a owl:Class;
 rdfs:label "Something label"

iondatalake:SomeThingProperty a owl:FunctionalProperty;
 rdfs:label "SomeThingProperty label"
 rdfs:domain mydatalake:SomeThing
rdfs:range rdfs:Literal

<br data-mce-bogus="1">

Followed by a D2RQ mapping.ttl to allow SPARSQL against your datastore:


@prefix mydatalake: <http://www.something.com/mydatalake-schema#> .
@prefix db: <> .
@prefix vocab: <vocab/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .
@prefix jdbc: <http://d2rq.org/terms/jdbc/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

mydatalake:database a d2rq:Database;
 d2rq:jdbcDriver "org.apache.phoenix.jdbc.PhoenixDriver";
 d2rq:jdbcDSN "jdbc:phoenix:localhost";
 jdbc:autoReconnect "true";
 jdbc:zeroDateTimeBehavior "convertToNull";
 .

mydatalake:STOCK_SYMBOL a d2rq:ClassMap;
 d2rq:dataStorage mydatalake:database;
 d2rq:uriPattern "STOCK_SYMBOL";
 d2rq:class mydatalake:SomeThing;
 .
mydatalake:STOCK_SYMBOL_SYMBOL a d2rq:PropertyBridge;
 d2rq:belongsToClassMap mydatalake:STOCK_SYMBOL;
 d2rq:property foaf:name;
 d2rq:column "STOCK_SYMBOL.SYMBOL";
 .

Which then opens up the world to reasoners and rule engines thought the Apache Jena project.

Phoenix and D2RQ Compatibility

•May 12, 2016 • Leave a Comment

If you happen to want to follow an AllegroGraph view of Semantic data lakes, but with an Open Source stack, you may venture down the Phoenix and D2RQ road.  Be aware of an issue that isn’t obvious – class path hell.

If you following the D2RQ instructions, you’ll run into an issue with the Phoenix driver (the latest being phoenix-4.7.0-HBase-1.1-client.jar).  Loading the Phoenix driver to early on the classpath generates this issue:

WARN ContextHandler :: Empty contextPath
WARN AbstractLifeCycle :: FAILED o.e.j.w.WebAppContext@1e0b4072{/,null,null},webapp: java.lang.NoSuchMethodError: org.eclipse.jetty.server.Connector.getHost()Ljava/lang/String;
java.lang.NoSuchMethodError: org.eclipse.jetty.server.Connector.getHost()Ljava/lang/String;
at org.eclipse.jetty.webapp.WebInfConfiguration.getCanonicalNameForWebAppTmpDir(WebInfConfiguration.java:598)
at org.eclipse.jetty.webapp.WebInfConfiguration.makeTempDirectory(WebInfConfiguration.java:343)
at org.eclipse.jetty.webapp.WebInfConfiguration.resolveTempDirectory(WebInfConfiguration.java:282)

However, if you move the Phoenix driver to the end of the classpath, or close to the end, all is resolved:)

If you use the STOCK_SYMBOL.sql found in the Phoenix examples folder, you can then run the following SPARSQL query via the D2RQ Snorql web service:


SELECT ?r
WHERE {?n vocab:STOCK_SYMBOL_SYMBOL ?r }
ORDER BY DESC(?r)

Although basic, it at least shows that you’ve managed to get HBase data back via SPARSQL, assuming your mapping.ttl looks at minimal like the below:


@prefix map: <#> .
@prefix db: <> .
@prefix vocab: <vocab/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .
@prefix jdbc: <http://d2rq.org/terms/jdbc/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

map:database a d2rq:Database;
d2rq:jdbcDriver "org.apache.phoenix.jdbc.PhoenixDriver";
d2rq:jdbcDSN "jdbc:phoenix:localhost";
jdbc:autoReconnect "true";
jdbc:zeroDateTimeBehavior "convertToNull";
.

# Table group_personnel
map:STOCK_SYMBOL a d2rq:ClassMap;
d2rq:dataStorage map:database;
d2rq:uriPattern "STOCK_SYMBOL";
d2rq:class vocab:STOCK_SYMBOL;
d2rq:classDefinitionLabel "STOCK_SYMBOL";
.
map:STOCK_SYMBOL_SYMBOL a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:STOCK_SYMBOL;
d2rq:property vocab:STOCK_SYMBOL_SYMBOL;
d2rq:propertyDefinitionLabel "STOCK_SYMBOL SYMBOL";
d2rq:column "STOCK_SYMBOL.SYMBOL";
.

Stung by Python 3 – HBase and Phoenix

•May 11, 2016 • Leave a Comment

Playing around late one night with Apache HBase and Apache Phoenix.  Couldn’t figure out what the issue was around connectivity with Phoenix.  Followed the simple install instructions.  I knew HBase was running, as simple table creations worked.  However, Phoenix just didn’t want to return data😦

Resolution.  Downgrade from Python3 to Python2.7.  Problem solved

Real World Microservices

•May 10, 2016 • Leave a Comment

“Real World Microservices: When Services Stop Playing Well and Start Getting Real” offers a great read on real world experience.  Few take aways:

At Twitter, we learned that we need tools that operate on the communication between services

Tools worth looking at:

What’s nice is the Mesos + Marathon readiness of both the above tools:)

SPARQL Data Platform

•May 10, 2016 • Leave a Comment

Given the various postings on SPARQL recently, I thought it worth noting down the various data platform options I’ve considered:

  1. For pure PoC’ing, MySQL using the file import facility in MySQLWorkbench, running D2RQ to provide SPARQL access.  Simple, and easy to setup.
  2. For more of a Hadoop platform, HBase with Apache Phoenix offering a JDBC driver, again allowing D2RQ to be used as the SPARQL access layer.
  3. Apache Marmotta, in many ways an improvement on Option 1 above, since it sits on top of standard database technology.

Option 1 is probably the quickest to move forwards with, once you’ve become annoyed with accessing corporate data that is spread across n systems, and your still in Machine Learning Discovery land:)  If you’ve used Apache Marmotta, or have time to set it up and learning the platform, Option 3 maybe a better bet.

Option 2 is probably the production version, or at least a stab in the right direction, as it offers improvements on scaling, coupled with Hadoopness:)

Where’s all this going?  “SPARQL with R in less than 5 minutes” provide a quick and interesting read on the power of SPARQL.  If your building a data lake without a foundation (ontology), you maybe missing a trick.

Interested in anyone else’s options

Semantic Data Model

•May 9, 2016 • Leave a Comment

“When Hadoop Simply Isn’t Enough: How to Purpose-Build Architecture for Industrial Data” offer an interesting read, but not a mention of RDF’s.  ElasticSearch and Hadoop are mentioned as part of the solution, but I’m not clear on how linkage of data is achieved.  Am I missing something?

“The Data Lake Concept Is Maturing” however provide a more interesting read, with Apache Hadoop Distributed File System being call out for storage, coupled with:

when selecting a NoSQL database with which to work with their Hadoop clusters. MongoDB, he said, is typically used for department-level cache applications, Apache Cassandra for highly distributed interactive applications, and Apache Hbase for analytic applications which Bodkin said “can tolerate a bit more latency, having a smaller number of places where machine-learned models sit right next to your compute cluster in Hadoop.”

From a Graph database perspective, an interesting article on Neo4j and RDF’s, “Importing ttl (Turtle) ontologies in Neo4j”

Sempala research is interesting, but I don’t see the code anywhere.

Finally, to Spark and SPARQL, “RDF Graphs and GraphX“.

Which still leads to the question of what is the latest data lake software stack?  Is the road HBase to hold the data and/or the RDF’s? Jena-HBase  Or is HBase paired with a separate  graph database offering SPARQL queries?

Training the algorithm

•May 9, 2016 • Leave a Comment

Interesting article on the BBC recently, “Can a computer really recruit the best staff?”.  Some interesting take aways:

How we use our computers and phones logs our corporate activity, what we say on social networks does the same thing. When we use a keycard to get from department to department, that’s generating data about our movement across the workspace. Bill Nowaki calls all this data “artefacts”.

If the data is used in the right way, there are benefits:

But it would also be a great aid to improving someone’s performance… if an organisation uses the data of its most successful people to give tips and hints to the laggards.

And why the algorithm might be better in some scenarios, or at least the algorithm will help highlight the issues:

Study after study demonstrates a huge bias in the recruiting process… even in organisations which say they are committed to eliminating discrimination. White middle-aged men have a tendency to hire other white middle-aged men, whatever they intend. Robotised recruitment is blind to that sort of human influence.

 
Follow

Get every new post delivered to your Inbox.

Join 769 other followers