In one of the best presentations of the conference, Lisa Goddard and Gillian Byrne, librarians at the Memorial University of Newfoundland gave a clear explanation of the technology and issues relating to the semantic web and how it can be applied in libraries.
Why do we need a new Web? We often forget the kids of problems we have with the tools available to us, such as high recall and low precision with Google. The web is very vocabulary dependent. Today’s Web search engines do not group web pages, pull out concepts, or understand them. There is no access to the deep web. Identity is big issue–Google cannot disintermediate between alternate terms, and there is no way to do comparisons. Complex queries are impossible to do on Google. But we do have tools that can handle complex queries such as Scopus. These search engines can do this because they have clearly tagged relational databases on the back end. The semantic web solution is to turn the Web into something like a database, with structured data, controlled vocabularies, and linking. The point is to create machine-actionable data because computers visit websites as often as people do.
The basis of the semantic web is the Resource Description Framework (RDF). RDF objects are described as triples, with a subject, object, and predicate. Here is an example:
Using these triples, we can construct a semantic network of terms:
This allows us to identify related terms, even though the relationship may not be explicitly specified by any of the RDF triples. In the example above, we can see that “Shakespeare” is related to “UK” and “Scotland”. Since every object must have a unique identifier (a URI), its relationship can be resolved.
An ontology is a model describing a particular knowledge domain. Ontologies help establish controlled vocabularies and model relationships between entities and concepts. They have built-in data types that support reasoning because every term must have its own URI. Ontologies are published on the Web and shared. Rules can be written that describe the relationships between terms. For example “wrote” is the inverse of “written by”; “Anne Hathaway married Shakespeare” and “Shakespeare married Anne Hathaway” are symmetrical relationships; and “Shakespeare” and “William Shakespeare” are equivalent. Using these rules, ontologies help computers to reason, and new knowledge can be inferred from given knowledge. Ontologies are written in Web Ontology Language (OWL). The Protege Ontology Library has a list of many of those that are available. Many websites today use RDF,and almost any big technology player has some semantic capabilities.
We talk a lot about search, but not much about data problems, like disconnected data in silos. Many documents are linked with no way to describe the relationships between them, and the links that are used (ISSNs and ISBNs) are not strong identifiers. Since computers cannot see relationships between disparate materials, we need to link concepts, not data. We must think about sharing data in a way that the rest of the world can take advantage of it (and they are not going to adopt MARC!!).
Here are some of the obstacles in implementing the semantic web:
- Competing vocabularies: how many ways can you describe a book, article, or place? (See this article for a detailed explanation.)
- Co-referencing: different URIs are being created for the same thing. The sameAs tool developed at the University of Southampton helps find existing URIs and prevent this problem.
- There is lots and lots of data out there; how can we find it? The CKAN Data Hub is a helpful registry system for linked data, but people still must submit their data.
- Linked data sets are being released without good examples or good ways to search the data.
- We are good at sharing but not so good at trusting. We need the trust to link. External taxonomies are now beginning to be trusted, but work on the attribution of data sets is needed.
- Preservation. What happens when an ontology or linking hub disappears? Chaos could result.
- Ownership: Who owns the data? In an academic environment, we do not own the data; the vendors do.
- Licensing: There is no correlation between open and linked data. VoID (Vocabulary of Interlinked Datasets) is a schema to describe linked datasets and allows you to say that your dataset has a license.
All of these issues involved hard work on top of what librarians do now. We are in an age of chaotic innovation in libraries. Fortunately, there are some “chaos tamers” available to help us.
CIL 2014 Blogger and Blog Coordinator
Editor, Personal Archiving: Preserving Our Digital Heritage