CLARIAH Linked Data Workshop

[This blog post is co-written with Marieke van Erp and Rinke Hoekstra and is cross-posted from the Clariah website]

Linked Data, RDF and Semantic Web are popular buzzwords in tech-land and within CLARIAH. But they may not be familiar to everyone within CLARIAH. On 12 september, CLARIAH therefore organized a workshop at the Vrije Universiteit Amsterdam to discuss the use of Linked Data as technology for connecting data across the different CLARIAH work packages (WP3 linguistics, WP4 structured data and WP5 multimedia).

Great turnout at Clariah LOD workshop

The goal of the workshop was twofold. First of all, to give an overview from the ‘tech’ side of these concepts and show how they are currently employed in the different work packages. At the same time we wanted to hear from Arts and Humanities researchers how these technologies would best suit their research and how CLARIAH can support them in familiarising themselves with Semantic Web tools and data.

The workshop
Monday afternoon, at 13:00 sharp, around 40 people showed up for the workshop at the Boelelaan in Amsterdam. The workshop included plenary presentations that laid the groundwork for discussions in smaller groups centred around the different types of data from the different WPs (raw collective notes can be found on this piratepad).

Rinke Hoekstra presented an Introduction Linked Data: What is it, how does it compare to other technologies and what is its potential for CLARIAH. [Slides]
In the discussion that followed, some concerns about the potential for Linked Data to deal with data provenance and data quality were discussed.
After this, three humanities researchers from each of the work packages discussed experiences, opportunities, and challenges around Linked Data. Our “Linked Data Champions” of this day were:

  • WP3: Piek Vossen (Vrije Universiteit Amsterdam) [Slides]
  • WP4: Richard Zijdeman (International Institute of Social History)
  • WP5: Kaspar Beelen and Liliana Melgar (University of Amsterdam) [Slides]

Marieke van Erp, Rinke Hoekstra and Victor de Boer then discussed how Linked Data is currently being produced in the different work packages and showed an example of how these could be integrated (see image). [Slides]. If you want to try these out yourself, here are some example SPARQL queries to play with.hisco integrated data example

Break out sessions
Finally, in the break out sessions, the implications and challenges for the individual work packages were further discussed.

  • For WP3, the discussion focused on formats. There are manynatural language annotation formats used, some with a long history, and these formats are often very closely connected to text analysis software. One of the reasons it may not be useful to WP3 to convert all tools and data to RDF is that performance cannot be guaranteed, and in some cases has already been proven to not be preserved when doing certain text analysis tasks in RDF. However, converting certain annotations, i.e. end results of processing to RDF could be useful here. We further talked about different types of use cases for WP3 that include LOD.
  • The WP4 break-out session consisted of about a dozen researchers, representing all working packages. The focus of the talk was on the expectations of the tools and data that were demonstrated throughout the day. Various persons were interested to apply QBer, the tool that allows one to turn csv files into Linked Data. The really exciting bit about this, is that the interest was shared by persons outside WP4, thus from persons usually working with text or audio-video sources. This does not just signal the interest in interdisciplinary research, but also the interest for research based on various data types. A second issue discussed was the need for vocabularies ((hierarchical) lists of standard terms). For various research fields such vocabularies do not yet exist. While some vocabularies can be derived relatively easily from existing standards that experts use, it will prove more difficult for a large range of variables. The final issue discussed was the quality of datasets. Should tools be able to handle ‘messy’ data? The audience agreed that data cleaning is the responsibility of the researcher, but that tools should be accompanied by guidelines on the expected format of the datafile.
  • In the WP5 discussion, issues around data privacy and copyrights were discussed as well as how memory institutions and individual researchers can be persuaded to make their data available as LOD (see image).

wp5 result

The day ended with some final considerations and some well-deserved drinks.

Share This:

Downscale2016 proceedings published


Proceedings screenshotI chose to publish the proceedings of Downscale2016 using Figshare. This gives a nice persistent place for the proceedings and includes a DOI. To cite the proceedings, use the text below. The proceedings is published using the CC-BY license.

Victor de Boer, Anna Bon, Cheah WaiShiang and Nana Baah Gyan (eds.) Proceedings of the 4th Workshop on Downscaling the Semantic Web (Downscale2016). Co-located with the 4th International Conference on ICT for Sustainability (ICT4S) Sep 1, 2016, Amsterdam, The Netherlands. doi:10.6084/m9.figshare.3827052.v1 



Share This:

Installing and Running the First Big-Data-Europe Health Pilot

[This blog post is reblogged from and written by Ronald Siebes and Victor de Boer]

As previously announced, the pilot implementation for the Big-Data-Europe platform for Societal Challenge 1 (the Health domain) facilitates the Open PHACTS discovery Platform functionality.  The Open PHACTS platform is built for researchers in Drug Discovery. It uses databases of physicochemical and pharmacological properties stored in a RDF Triple Store. This interconnected data is exposed through a Linked Data API composed of interoperable data. The system caches query results via a Memcached module. In the context of the SC1 pilot, most functionalities of the platform is now successfully replicated via Docker containers on the BDE infrastructure.

The Open PHACTS platform architecture
The Open PHACTS platform architecture

Please do try this at home! The pilot can be installed on Linux (through Docker compose) or Windows (through Docker toolbox). Installations instructions are available on the pilot’s GitHub page.  By design the technology itself is independent from the domain. Once you got familiar with the code and got it running by yourself, you should have enough experience to upload your own Linked Data, and create your own API.

Share This:

Crowd- and nichesourcing for film and media scholars

[This post describes Aschwin Stacia‘s MSc. project and is based on his thesis]

There are many online and private film collections that lack structured annotations to facilitate retrieval. In his Master project work, Aschwin Stacia explored the effectiveness of a crowd-and nichesourced film tagging platform,  around a subset of the Eye Open Beelden film collection.

Specifically, the project aimed at soliciting annotations appropriate for various types of media scholars who each have their own information needs. Based on previous research and interviews, a framework categorizing these needs was developed. Based on this framework a data model was developed that matches the needs for provenance and trust of user-provided metadata.

Fimtagging screenshot
Screenshot of the FilmTagging tool, showing how users can annotate a video

A crowdsourcing and retrieval platform (FilmTagging) was developed based on this framework and data model. The frontend of the platform allows users to self-declare knowledge levels in different aspects of film and also annotate (describe) films. They can also use the provided tags and provenance information for retrieval and extract this data from the platform.

To test the effectiveness of platform Aschwin conducted an experiment in which 37 participants used the platform to make annotations (in total, 319 such annotations were made). The figure below shows the average self-reported knowledge levels.

Average self-reported knowledge levels on a 5-point scale. The topics are defined by the framework, based on previous research and interviews.
Average self-reported knowledge levels on a 5-point scale. The topics are defined by the framework, based on previous research and interviews.

The annotations and the platform were then positively evaluated by media scholars as it could provide them with annotations that directly lead to film fragments that are useful for their research activities.

Nevertheless, capturing every scholar’s specific information needs is hard since the needs vary heavily depending on the research questions these scholars have.

  • Read more details in Aschwin’s thesis [pdf].
  • Have a look at the software at , and maybe start your own Filmtagging instance
  • Test the annotation platform yourself at or watch the screencast below

Share This: