Linked Data Scopes

At this year’s Metadata and Semantics Research Conference (MTSR2020), I just presented our work on Linked Data Scopes: an ontology to describe data manipulation steps. The paper was co-authored with Ivette Bonestroo, one of our Digital Humanities minor students as well as Rik Hoekstra and Marijn Koolen from KNAW-HUC. The paper builds on earlier work by the latter two co-authors and was conducted in the context of the CLARIAH-plus project.

This figure shows envisioned use of the ontology: scholarly output is not only the research paper, but also an explicit data scope. This data scope includes (references to) datasets.

With the rise of data driven methods in the humanities, it becomes necessary to develop reusable and consistent methodological patterns for dealing with the various data manipulation steps. This increases transparency, replicability of the research. Data scopes present a qualitative framework for such methodological steps. In this work we present a Linked Data model to represent and share Data Scopes. The model consists of a central Data scope element, with linked elements for data Selection, Linking, Modeling, Normalisation and Classification. We validate the model by representing the data scope for 24 articles from two domains: Humanities and Social Science.

The ontology can be accessed at .

You can do live sparql queries on the extracted examples as instances of this ontology at

You can watch a pre-recorded video of my presentation below. Or you can check out the slides here [pdf]

Share This:

The Benefits of Linking Metadata for Internal and External users of an Audiovisual Archive

[This post describes the Master Project work of Information Science students Tim de Bruyn and John Brooks and is based on their theses]

Audiovisual archives adopt structured vocabularies for their metadata management. With Semantic Web and Linked Data now becoming more and more stable and commonplace technologies, organizations are looking now at linking these vocabularies to external sources, for example those of Wikidata, DBPedia or GeoNames.

However, the benefits of such endeavors to the organizations are generally underexplored. For their master project research, done in the form of an internship at the Netherlands Institute for Sound and Vision (NISV), Tim de Bruyn and John Brooks conducted a case study into the benefits of linking the “Common Thesaurus for Audiovisual Archives(or GTAA) and the general-purpose dataset Wikidata. In their approach, they identified various use cases for user groups that are both internal (Tim) as well as external (John) to the organization. Not only were use cases identified and matched to a partial alignment of GTAA and Wikidata, but several proof of concept prototypes that address these use cases were developed. 


For the internal users, three cases were elaborated, including a calendar service where personnel receive notifications when an author of a work has passed away 70 years ago, thereby changing copyright status of the work. This information is retrieved from the Wikidata page of the author, aligned with the GTAA entry (see fig 1 above).

A second internal case involves the new ‘story platform’ of NISV. Here Tim implemented a prototype enduser application to find stories related to the one currently shown to the user, based on persons occuring in that story (fig 2).

The external cases centered around the users of the CLARIAH Media Suite. For this extension, several humanities researchers were interviewed to identify worthwile extensions with Wikidata information. Based on the outcomes of these interviews, John Brooks developed the Wikidata retrieval service (fig 3).

The research presented in the two theses are a good example of User-Centric Data Science, where affordances provided by data linkages are aligned with various user needs. The various tools were evaluated with end users to ensure they match their actual needs. The research was reported in a research paper which will be presented at the MTSR2018 conference: (Victor de Boer, Tim de Bruyn, John Brooks, Jesse de Vos. The Benefits of Linking Metadata for Internal and External users of an Audiovisual Archive. To appear in Proceedings of MTSR 2018 [Draft PDF])

Find out more:

See my slides for the MTSR presentation below


Share This:

Big Data Europe Platform paper at ICWE 2017

With the launch of the Big Data Europe platform behind us, we are telling the world about our nice platform and the many pilots in the societal challenge domains that we have executed and evaluated. We wrote everything down in one comprehensive paper which was accepted at the 7th international conference on Web Engineering (ICWE 2017) which is to be held in Rome next month.

High-level BDE architecture (copied from the paper Auer et al.)

The paper “The BigDataEurope Platform – Supporting the Variety Dimension of Big Data”  is co-written by a very large team (see below) and it presents the BDE platform — an easy-to-deploy, easy-to-use and adaptable (cluster-based and standalone) platform for the execution of big data components and tools like Hadoop, Spark, Flink, Flume and Cassandra.  To facilitate the processing of heterogeneous data, a particular innovation of the platform is the Semantic Layer, which allows to directly process RDF data and to map and transform arbitrary data into RDF. The platform is based upon requirements gathered from seven of the societal challenges put forward by the European Commission in the Horizon 2020 programme and targeted by the BigDataEurope pilots. It is validated through pilot applications in each of these seven domains. .A draft version of the paper can be found here.


The full reference is:

Sören Auer, Simon Scerri, Aad Versteden, Erika Pauwels, Angelos Charalambidis, Stasinos Konstantopoulos, Jens Lehmann, Hajira Jabeen, Ivan Ermilov, Gezim Sejdiu, Andreas Ikonomopoulos, Spyros Andronopoulos, Mandy Vlachogiannis, Charalambos Pappas, Athanasios Davettas, Iraklis A. Klampanos, Efstathios Grigoropoulos, Vangelis Karkaletsis, Victor de Boer, Ronald Siebes, Mohamed Nadjib Mami, Sergio Albani, Michele Lazzarini, Paulo Nunes, Emanuele Angiuli, Nikiforos Pittaras, George Giannakopoulos, Giorgos Argyriou, George Stamoulis, George Papadakis, Manolis Koubarakis, Pythagoras Karampiperis, Axel-Cyrille Ngonga Ngomo, Maria-Esther Vidal.   . Proceedings of The International Conference on Web Engineering (ICWE), ICWE2017, LNCS, Springer, 2017


Share This:

Paper about automatic labeling in IJDL

mompeltOur paper  “Evaluating Unsupervised Thesaurus-based Labeling of Audiovisual Content in an Archive Production Environment” was accepted for publication in the International Journal on Digital Libraries (IJDL). This paper, co-authored with Roeland Ordelman and Josefien Schuurman reports on a series of information extraction experiments carried out at the Netherlands Institute for Sound and Vision (NISV). Specifically, in the paper we report on a two-stage evaluation of unsupervised labeling of audiovisual content using subtitles. We look at how such an approach can provide acceptable results given requirements with respect to archival quality, authority and service levels to external users.


For this, we developed a text extraction pipeline (TESS), pictured here which extracts key terms and matches them to the NISV thesaurus, the GTAA. This journal paper is an extended version of the paper previously accepted at the TPDL conference and here provide an analysis of the term extraction after being taken into production, where we focus on performance variation with respect to term types and television programs. Having implemented the procedure in our production work-flow allows us to gradually develop the system further and to also assess the effect of the transformation from manual to automatic annotation from an end-user perspective.

The paper will appear on the Journal site shortly. A final draft version of the paper can be found here: deboer_ijdl2016evaluating_draft [PDF].



Share This:

Two TPDL papers accepted!

Today, the TPDL (International Conference on Theory and Practice of Digital Libraries) results came in and both papers on which I am a co-author got accepted. Today is a good day 🙂 tess_algThe first paper, we present work done during my stay at Netherlands Institute for Sound and Vision on automatic term extraction from subtitles. The interesting thing about this paper was that it was mainly how these algorithms were functioning in a ‘real’ context, that is within a larger media ecosystem. The paper was co-authored with Roeland Ordelman and Josefien Schuurman.

Screenshot of the QHP toolOn the second paper, I am one of the co-authors. In the paper “Supporting Exploration of Historical Perspectives across Collections”, we present an exploratory search application that highlights different perspectives on World War II across collections (including Verrijkt Koninkrijk). The project is funded by the Amsterdam Data Science seed project with Daan Odijk, research assistants Cristina Gârbacea and Thomas Schoegje, VU/CWI-colleagues Laura Hollink and Jacco van Ossenbruggen and  historian Kees Ribbens (NIOD). You can read more about it on Daan’s blog.

Share This: