I am an assistant professor (UD) at the User-Centric Data Science group at the Computer Science department of the Vrije Universiteit Amsterdam (VU). I am also a senior research fellow at Netherlands Institute for Sound and Vision. In my research, I combine (Semantic) Web technologies with Human-Computer Interaction, Knowledge Representation and Information Extraction to tackle research challenges in various domains. These include Cultural Heritage, Digital Humanities and ICT for Development (ICT4D). More information on these projects can be found on this site or through my CV .
[This post describes the research of Michelle de Böck and is based on her MSc Information Sciences thesis.]
Digitization of cultural heritage content allows for the digital archiving, analysis and other processing of that content. The practice of scanning and transcribing books, newspapers and images, 3d-scanning artworks or digitizing music has opened up this heritage for example for digital humanities research or even for creative computing. However, with respect to the performing arts, including theater and more specifically dance, digitization is a serious research challenge. Several dance notation schemes exist, with the most established one being Labanotation, developed in 1920 by Rudolf von Laban. Labanotation uses a vertical staff notation to record human movement in time with various symbols for limbs, head movement, types and directions of movements.
Where for musical scores, good translations to digital formats exist (e.g. MIDI), for Lanabotation, these are lacking. While there are structured formats (LabanXML, MovementXML), the majority of content still only exists either in non-digitized form (on paper) or in scanned images. The research challenge of Michelle de Böck’s thesis therefore was to identify design features for a system capable of recognizing Labanotation from scanned images.
Michelle designed such a system and implemented this in MATLAB, focusing on a few movement symbols. Several approaches were developed and compared, including approaches using pre-trained neural networks for image recognition (AlexNet). This approach outperformed others, resulting in a classification accuracy of 78.4%. While we are still far from developing a full-fledged OCR system for Labanotation, this exploration has provided valuable insights into the feasibility and requirements of such a tool.
As part of the ESWC 2019 conference program, the ESWC PhD Symposium was held in wonderful Portoroz, Slovenia. The aim of the symposium, this year organized by Maria-Esther Vidal and myself, is to provide a forum for PhD students in the area of Semantic Web to present their work and discuss their projects with peers and mentors.
Even though this year, we received 5 submissions, all of the submissions were of high quality, so the full day symposium featured five talks by both early and middle/late stage PhD students. The draft papers can be found on the symposium web page and our opening slides can be found here. Students were mentored by amazing mentors to improve their papers and presentation slides. A big thank you to those mentors: Paul Groth, Rudi Studer, Maria Maleshkova, Philippe Cudre-Mauroux, and Andrea Giovanni Nuzzolese.
The program also featured a keynote by Stefan Schlobach, who talked about the road to a PhD “and back again”. He discussed a) setting realistic goals, b) finding your path towards those goals and c) being a responsible scientist and person after the goal is reached.
Students also presented their work through a poster session and the posters will also be found at the main conference poster session on tuesday 4 June.
On 23 May, as part of the VU ICT4D course, for the 6th time, W4RA and SIKS organized the annual symposium “Perspectives on ICT4D“. This year’s theme was how to tackle “Global Challenges” in a collaborative, trans-disciplinary way. Food Security is one of the Global Challenges Lia van Wesenbeeck – Director of the Amsterdam Centre for World Food Studies – gave a great presentation on “Tackling World Food Challenges”.
Our international speaker on the same topic, Mr. Seydou Tangara, coordinator of the AOPP, was unfortunately not able to join due to visa problems. He was replaced by prof. Hans Akkermans, who presented the Vienna manifesto on digital humanism and its relation to ICT4D.
Andre Baart from UvA talked about the CARPA project and challenges in developing applications for people in Mali while Jaap Gordijn discussed the need for business modelling for developing sustainable services, with interesting case studies from Sarawak, Malaysia.
The ICT4D students presented their voice application services during the coffee break. They demonstrated applications ranging from equipment-lending services to seed markets and weather services.
Last friday, the students of the class of 2018/2019 of the course Digital Humanities and Social Analytics in Practice presented the results of their capstone internship project. This course and project is the final element of the Digital Humanities and Social Analytics minor programme in which students from very different backgrounds gain skills and knowledge about the interdisciplinary topic.
The course took the form of a 4-week internship at an organization working with humanities or social science data and challenges and student groups were asked to use these skills and knowledge to address a research challenge. Projects ranged from cleaning, indexing, visualizing and analyzing humanities data sets to searching for bias in news coverage of political topics. The students showed their competences not only in their research work but also in communicating this research through great posters.
The complete list of student projects and collaborating institutions is below:
- “An eventful 80 years’ war” at Rijksmuseum identifying and mapping historical events from various sources.
- An investigation into the use of structured vocabularies also at the Rijksmuseum
- “Collecting and Modelling Event WW2 from Wikipedia and Wikidata” in collaboration with Netwerk Oorlogsbronnen (see poster image below)
- A project where an search index for Development documents governed by the NICC foundation was built.
- “EviDENce: Ego Documents Events modelliNg – how individuals recall mass violence” – in collaboration with KNAW Humanities Cluster (HUC)
- “Historical Ecology” – where students searched for mentions of animals in historical newspapers – also with KNAW-HUC
- Project MIGRANT: Mobilities and connection project in collaboration with KNAW-HUC and Huygens ING
- Capturing Bias with media data analysis – an internal project at VU looking at indentifying media bias
- Locating the CTA Archive Amsterdam where a geolocation service and search tool was built
- Linking Knowledge Graphs of Symbolic Music with the Web – also an internal project at VU working with Albert Merono
In the context of our ArchiMediaL project on Digital Architectural History, a number of student projects explored opportunities and challenges around enriching the colonialarchitecture.eu dataset. This dataset lists buildings and sites in countries outside of Europe that at the time were ruled by Europeans (1850-1970).
Patrick Brouwer wrote his IMM bachelor thesis “Crowdsourcing architectural knowledge: Experts versus non-experts” about the differences in annotation styles between architecture historical experts and non-expert crowd annotators. The data suggests that although crowdsourcing is a viable option for annotating this type of content. Also, expert annotations were of a higher quality than those of non-experts. The image below shows a screenshot of the user study survey.
Rouel de Romas also looked at crowdsourcing , but focused more on the user interaction and the interface involved in crowdsourcing. In his thesis “Enriching the metadata of European colonial maps with crowdsourcing” he -like Patrick- used the Accurator platform, developed by Chris Dijkshoorn. A screenshot is seen below. The results corroborate the previous study that the in most cases the annotations provided by the participants do meet the requirements provided by the architectural historian; thus, crowdsourcing is an effective method to enrich the metadata of European colonial maps.
[this post is based on Frank Walraven‘s Master thesis]
Who uses DBPedia anyway? This was the question that started a research project for Frank Walraven. This question came up during one of the meetings of the Dutch DBPedia chapter, of which VUA is a member. If usage and users are better understood, this can lead to better servicing of those users, by for example prioritizing the enrichment or improvement of specific sections of DBPedia Characterizing use(r)s of a Linked Open Data set is an inherently challenging task as in an open Web world, it is difficult to know who are accessing your digital resources. For his Msc project research, which he conducted at the Dutch National Library supervised by Enno Meijers , Frank used a hybrid approach using both a data-driven method based on user log analysis and a short survey of know users of the dataset. As a scope Frank selected just the Dutch DBPedia dataset.
For the data-driven part of the method, Frank used a complete user log of HTTP requests on the Dutch DBPedia. This log file (see link below) consisted of over 4.5 Million entries and logged both URI lookups and SPARQL endpoint requests. For this research only a subset of the URI lookups were concerned.
As a first analysis step, the requests’ origins IPs were categorized. Five classes can be identified (A-E), with the vast majority of IP addresses being in class “A”: Very large networks and bots. Most of the IP addresses in these lists could be traced back to search engine
indexing bots such as those from Yahoo or Google. In classes B-F, Frank manually traced the top 30 most encounterd IP-addresses, concluding that even there 60% of the requests came from bots, 10% definitely not from bots, with 30% remaining unclear.
The second analysis step in the data-driven method consisted of identifying what types of pages were most requested. To cluster the thousands of DBPedia URI request, Frank retriev
ed the ‘categories’ of the pages. These categories are extracted from Wikipedia category links. An example is the “Android_TV” resource, which has two categories: “Google” and “Android_(operating_system)”. Following skos:broader links, a ‘level 2 category’ could also be found to aggregate to an even higher level of abstraction. As not all resources have such categories, this does not give a complete image, but it does provide some ideas on the most popular categories of items requested. After normalizing for categories with large amounts of incoming links, for example the category “non-endangered animal”, the most popular categories where 1. Domestic & International movies, 2. Music, 3. Sports, 4. Dutch & International municipality information and 5. Books.
Frank also set up a user survey to corroborate this evidence. The survey contained questions about the how and why of the respondents Dutch DBPedia use, including the categories they were most interested in. The survey was distributed using the Dutch DBPedia websitea and via twitter however only attracted 5 respondents. This illustrates
the difficulty of the problem that users of the DBPedia resource are not necessarily easily reachable through communication channels. The five respondents were all quite closely related to the chapter but the results were interesting nonetheless. Most of the users used the DBPedia SPARQL endpoint. The full results of the survey can be found through Frank’s thesis, but in terms of corroboration the survey revealed that four out of the five categories found in the data-driven method were also identified in the top five resulting from the survey. The fifth one identified in the survey was ‘geography’, which could be matched to the fifth from the data-driven method.Frank’s research shows that although it remains a challenging problem, using a combination of data-driven and user-driven methods, it is indeed possible to get an indication into the most-used categories on DBPedia. Within the Dutch DBPedia Chapter, we are currently considering follow-up research questions based on Frank’s research.
[This post describes the Master Project work of Information Science students Tim de Bruyn and John Brooks and is based on their theses]
Audiovisual archives adopt structured vocabularies for their metadata management. With Semantic Web and Linked Data now becoming more and more stable and commonplace technologies, organizations are looking now at linking these vocabularies to external sources, for example those of Wikidata, DBPedia or GeoNames.
However, the benefits of such endeavors to the organizations are generally underexplored. For their master project research, done in the form of an internship at the Netherlands Institute for Sound and Vision (NISV), Tim de Bruyn and John Brooks conducted a case study into the benefits of linking the “Common Thesaurus for Audiovisual Archives” (or GTAA) and the general-purpose dataset Wikidata. In their approach, they identified various use cases for user groups that are both internal (Tim) as well as external (John) to the organization. Not only were use cases identified and matched to a partial alignment of GTAA and Wikidata, but several proof of concept prototypes that address these use cases were developed.
For the internal users, three cases were elaborated, including a calendar service where personnel receive notifications when an author of a work has passed away 70 years ago, thereby changing copyright status of the work. This information is retrieved from the Wikidata page of the author, aligned with the GTAA entry (see fig 1 above).
A second internal case involves the new ‘story platform’ of NISV. Here Tim implemented a prototype enduser application to find stories related to the one currently shown to the user, based on persons occuring in that story (fig 2).
The external cases centered around the users of the CLARIAH Media Suite. For this extension, several humanities researchers were interviewed to identify worthwile extensions with Wikidata information. Based on the outcomes of these interviews, John Brooks developed the Wikidata retrieval service (fig 3).
The research presented in the two theses are a good example of User-Centric Data Science, where affordances provided by data linkages are aligned with various user needs. The various tools were evaluated with end users to ensure they match their actual needs. The research was reported in a research paper which will be presented at the MTSR2018 conference: (Victor de Boer, Tim de Bruyn, John Brooks, Jesse de Vos. The Benefits of Linking Metadata for Internal and External users of an Audiovisual Archive. To appear in Proceedings of MTSR 2018 [Draft PDF])
Find out more:
See my slides for the MTSR presentation below
[This post describes the Information Sciences Master Project of Hameedat Omoine and is based on her thesis.]
In the quest to improve the lives of farmers and improve agricultural productivity in rural Burkina Faso, meteorological data has been identified as one of the is key information needs for local farmers. Various online weather information services are available, but many are not tailored specifically to tis target user group. In a research case study, Hameedat Omoine designed a weather information system that collects not only weather but also related agricultural information and provides the farmers with this information to allow them to improve agricultural productivity and the livelihood of the people of rural Burkina Faso.
The research and design of the system was conducted at and in collaboration with 2CoolMonkeys, a Utrecht-based Open data and App-development company with expertise in ICT for Development (ICT4D).
Following the design science research methodology, Hameedat investigated the requirements for a weather information system, and the possible options for ensuring the sustainability of the system. Using a structured approach, she developed the application and evaluated it in the field with potential Burkinabe end users. The mobile interface of the application featured weather information and crop advice (seen in the images above). A demonstration video is shown below
Hameedat developed multiple alternative models to investigate the sustainability of the application. For this she used the e3value approach and language. The image below shows a model for the case where a local radio station is involved.
At the DHBenelux 2018 conference, students from the VU minor “Digital Humanities and Social Analytics” presented their final DH in Practice work. In this video, the students talk about their experience in the minor and the internship projects. We also meet other participants of the conference talking about the need for interdisciplinary research.
All good things come to an end, and that also holds for our great Horizon2020 project “Big Data Europe“, in which we collaborated with a broad range of techincal and domain partners to develop (Semantic) Big Data infrastructure for a variety of domains. VU was involved as work package leader in the Pilot and Evaluation work package and co-developed methods to test and apply the BDE stack in Health, Traffic, Security and other domains..
You can read more about the end of the project in this blog post at the BDE website.