Who uses DBPedia anyway?

[this post is based on Frank Walraven‘s Master thesis]

Who uses DBPedia anyway? This was the question that started a research project for Frank Walraven. This question came up during one of the meetings of the Dutch DBPedia chapter, of which VUA is a member. If usage and users are better understood, this can lead to better servicing of those users, by for example prioritizing the enrichment or improvement of specific sections of DBPedia Characterizing use(r)s of a Linked Open Data set is an inherently challenging task as in an open Web world, it is difficult to know who are accessing your digital resources. For his Msc project research, which he conducted at the Dutch National Library supervised by Enno Meijers , Frank used a hybrid approach using both a data-driven method based on user log analysis and a short survey of know users of the dataset. As a scope Frank selected just the Dutch DBPedia dataset.

For the data-driven part of the method, Frank used a complete user log of HTTP requests on the Dutch DBPedia. This log file (see link below) consisted of over 4.5 Million entries and logged both URI lookups and SPARQL endpoint requests. For this research only a subset of the URI lookups were concerned.

As a first analysis step, the requests’ origins IPs were categorized. Five classes can be identified (A-E), with the vast majority of IP addresses being in class “A”: Very large networks and bots. Most of the IP addresses in these lists could be traced back to search engine

indexing bots such as those from Yahoo or Google. In classes B-F, Frank manually traced the top 30 most encounterd IP-addresses, concluding that even there 60% of the requests came from bots, 10% definitely not from bots, with 30% remaining unclear.

The second analysis step in the data-driven method consisted of identifying what types of pages were most requested. To cluster the thousands of DBPedia URI request, Frank retriev

ed the ‘categories’ of the pages. These categories are extracted from Wikipedia category links. An example is the “Android_TV” resource, which has two categories: “Google” and “Android_(operating_system)”. Following skos:broader links, a ‘level 2 category’ could also be found to aggregate to an even higher level of abstraction. As not all resources have such categories, this does not give a complete image, but it does provide some ideas on the most popular categories of items requested. After normalizing for categories with large amounts of incoming links, for example the category “non-endangered animal”, the most popular categories where 1. Domestic & International movies, 2. Music, 3. Sports, 4. Dutch & International municipality information and 5. Books.

Frank also set up a user survey to corroborate this evidence. The survey contained questions about the how and why of the respondents Dutch DBPedia use, including the categories they were most interested in. The survey was distributed using the Dutch DBPedia websitea and via twitter however only attracted 5 respondents. This illustrates

the difficulty of the problem that users of the DBPedia resource are not necessarily easily reachable through communication channels. The five respondents were all quite closely related to the chapter but the results were interesting nonetheless. Most of the users used the DBPedia SPARQL endpoint. The full results of the survey can be found through Frank’s thesis, but in terms of corroboration the survey revealed that four out of the five categories found in the data-driven method were also identified in the top five resulting from the survey. The fifth one identified in the survey was ‘geography’, which could be matched to the fifth from the data-driven method.Frank’s research shows that although it remains a challenging problem, using a combination of data-driven and user-driven methods, it is indeed possible to get an indication into the most-used categories on DBPedia. Within the Dutch DBPedia Chapter, we are currently considering follow-up research questions based on Frank’s research.

Share This:

SEMANTiCS2017

This year, I was conference chair of the SEMANTiCS conference, which was held 11-14 Sept in Amsterdam. The conference was in my view a great success, with over 310 visitors across the four days, 24 parallel sessions including academic and industry talks, six keynotes, three awards, many workshops and lots of cups of coffee. I will be posting more looks back soon, but below is a storify item giving an idea of all the cool stuff that happened in the past week.

Share This:

A look back at Downscale2016

On 29 August, the 4th International Workshop on Downscaling the Semantic Web (Downscale2016) was held as a full-day workshop in Amsterdam co-located with the ICT4S conference. The workshop attracted 12 participants and we received 4 invited paper contributions, which were presented and discussed in the morning session (slides can be found below). These papers describe a issues regarding sustainability of ICT4D approaches, specific downscaled solutions for two ICT4D use cases and a system for distributed publishing and consuming of Linked Data.. The afternoon session was reserved for demonstrations and discussions. An introduction into the Kasadaka platform was followed by an in-depth howto on developing voice-based information services using Linked Data. The papers and the descriptions of the demos are gathered in a proceedings (published online at figshare: doi:10.6084/m9.figshare.3827052.v1).

downscale2016 participants
Downscale2016 participants (photo: Kim Bosman)

During the discussions the issue of sustainability was addressed. Different dimensions of sustainability were discussed (technical, economical, social and environmental). The participants agreed that a holistic approach is needed for successful and sustainable ICT4D and that most of these dimensions were indeed present in the four presentations and the design of the Kasadaka platform. There remains a question on how different architectural solutions for services (centralized, decentralized, cloud services) relate to eachother in terms of sustainability and when a choice for one of these is most suited. Discussion then moved towards different technical opportunities for green power supplies, including solar panels.

The main presentations and slides are listed below::

  • Downscale2016  introduction (Victor and Anna) (slides)
  • Jari Ferguson and Kim Bosman. The Kasadaka Weather Forecast Service (slides)
  • Aske Robenhagen and Bart Aulbers. The Mali Milk Service – a voice based platform for enabling farmer networking and connections with buyers. (slides)
  • Anna Bon, Jaap Gordijn et al. A Structured Model-Based Approach To Preview Sustainability in ICT4D (slides)
  • Mihai Gramada and Christophe Gueret Low profile data sharing with the Entity Registry System (ERS) (slides)

Share This:

Msc project: Low-Bandwith Semantic Web

[This post is based on the Information Sciences MSc. thesis by Onno Valkering]

To make widespread knowledge sharing possible in rural areas in developing countries, the notion of the Web has to be downscaled based on the specific low-resource infrastructure in place. In this paper, we introduce SPARQL over SMS, a solution for exchanging RDF data in which HTTP is substituted by SMS to enable Web-like exchange of data over cellular networks.

SPARQL in an SMS architecture
SPARQL over SMS architecture

The solution uses converters that take outgoing SPARQL queries sent over HTTP and convert them into SMS messages sent to phone numbers (see architecture image). On the receiver-side, the messages are converted back to standard SPARQL requests.

The converters use various data compression strategies to ensure optimal use of the SMS bandwidth. These include both zip-based compression and the removal of redundant data through the use of common background vocabularies. The thesis presents the design and implementation of the solution, along with evaluations of the different data compression methods.

Test setup with two Kasadakas
Test setup with two Kasadakas

The application is validated in two real-world ICT for Development (ICT4D) cases that both use the Kasadaka platform: 1) An extension of the DigiVet application allows sending information related to veterinary symptoms and diagnoses accross different distributed systems. 2) An extension of the RadioMarche application involves the retrieval and adding of current offerings in the market information system, including the phone number of the advertisers.

For more information:

  • Download Onno’s Thesis. A version of the thesis is currently under review.
  • The slides for Onno’s presentation are also available: Onno Valkering
  • View the application code at https://github.com/onnovalkering/sparql-over-sms

 

Share This:

Connecting collections across national borders

Items from two collections shown side-by-sideAs audiovisual archives are digitizing their collections and making these collections available online, the need arises to also establish connections between different collections and to allow for cross-collection search and browsing. Structured vocabularies can be used as connecting points by aligning thesauri from different institutions. The project “Gemeenschappelijke Thesaurus voor Uniforme Ontsluiting” was funded by the Taalunie -a cross-national organization focusing on the Dutch language- and executed by the Netherlands Institute for Sound and Vision and the Flemish VIAA archive. It involved a case study where partial collections of the two archives were connected by aligning their thesauri. This involved the conversion of the VRT thesaurus to the SKOS format and linking it to Sound and Vision’s GTAA thesaurus.cultuurlink screenshotThe interactive alignment tool CultuurLINK, made by Dutch company Spinque was used to align the two thesauri (see the screenshot above).

 

The links between the collections can be explored using a cross-collection browser, also built by Spinque. This allows users to search and explore connections between the two collections. Unfortunately, the collections are not publicly available so the demonstrator is password-protected, but a publicly accessible screencast (below) shows the functionalities.

The full report can be accessed through the VIAA site. There, you can also find a blog post in Dutch.

Update: a paper about this has been accepted for publication:

  • Victor de Boer, Matthias Priem, Michiel Hildebrand, Nico Verplancke, Arjen de Vries and Johan Oomen. Exploring Audiovisual Archives through Aligned Thesauri. To appear in Proceedings of 10th Metadata and Semantics Research Conference. [Draft PDF]

Share This:

Clarin video showcases Dutch Ships and Sailors project

The CLARIN framework commissioned the production of dissemmination videos showcasing the outcomes of the individual CLARIN projects. One of these projects was the Dutch Ships and Sailors project, a collaboration between VU Computer Science, VU humanities and the Huygens Institute for National History. In this project, we developed a heterogeneous linked data cloud connecting many different maritime databases. This data cloud allows for new types of integrated browsing and new historical research questions. In the video, we (Victor de Boer together with historians Jur Leinenga and Rik Hoekstra) explain how the data cloud was formed and how it can be used by maritime historians.

CLARIN Dutch Ships & Sailors from CLARIN-NL (Dutch, with Dutch or English subtitles)  See also other DSS-related posts on this website.

 

Share This:

CultuurLINK Linking Award

Happy and suprised to find the first (and so far only) CultuurLink Linking Award in my mail box yesterday! I checked with the nice people over at Spinque.com and it turns out it was a token of appreciation for being a prolific Cultuurlink user 🙂

I think the vocabulary alignment tool is great and easy to work with, so I can recommend it to anyone with a SKOS vocabulary who wants to match it with any of the major cultural thesauri in the ‘Hub’. Thanks to the people at Spinque for the great tool and the nice gesture!

spinqeprijs

Share This:

Linked Data for International Aid Transparency Initiative

In August 2013, VU Msc. student Kasper Brandt finished his thesis on developing, implementing and testing a Linked Data model for the International Aid Transparency Initiative (IATI). Now, more than a year later, that work was accepted for publication in the Journal on Data Semantics. We are very happy with this excellent result.

Model fragment
Model fragment

IATI is a multi-stakeholder initiative that seeks to improve the transparecy of development aid and to that end developed an open standard for the publication of aid information. Hundreds of NGOs and governments have registered to the IATI registry by publishing their aid activities in this XML standard. Taking the IATI model as an input, we have created a Linked Data model based on requirements elicitated from qualitative interviews using an iterative requirements engineering methodology. We have converted the IATI open data from a central registry to Linked Data and linked it to various other datasets such as World Bank indicators and DBPedia information. This dataset is made available for re-use at http://semanticweb.cs.vu.nl/iati .

burundi country page
Screenshot of an application bringing together information from multiple datasets

To demonstrate the added value of this Linked Data approach, we have created several applications which combine the information from the IATI dataset and the datasets it was linked to.  As a result, we have shown that creating Linked Data for the IATI dataset and linking it to other datasets give new valuable insights in aid transparency. Based on actual information needs of IATI users, we were able to show that linking IATI data adds significant value to the data and is able to fulfill the needs of IATI users.

A draft of the paper can be found here.

Share This:

DIVE wins 3rd prize in Semantic Web Challenge!

During last week’s International Semantic Web Conference (ISWC2014) in Riva del Garda, the DIVE team presented a demonstration prototype of the DIVE tool (which you can play around with live at http://dive.beeldengeluid.nl) . We submitted DIVE to the Open Track of the yearly Semantic Web Challenge for SW tools and applications. Initially, we were invited to give a poster presentation on the first day of the conference and after very positive reviews, we progressed to the challenge final.

Challenge 3rd place certificateFor this final we were asked to present the tool and give a live demonstration in front of the ISWC2014 crowd. Apparently the jury appreciated the effort since DIVE was awarded the third prize. The prize included a nice certificate as well as $1000,- sponsored by Elsevier.

This was a real team effort, but I think much of the praise goes to our partners at Frontwise. They built a very cool, very responsive and intuitive User Experience on top of our SPARQL endpoint. Great work! Also thanks to the people at Beeld en Geluid and KB for their assistance with delivering data in a timely fashion and of course the people at VU for their enrichment of the data. Great teamwork everyone! Embedded below you find the poster and the presentation. The paper is found here.

The presentation:

[slideshare id=40588578&w=425&h=355&style=border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;&sc=no]

The poster:

[slideshare id=40542300&w=477&h=510&style=border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;&sc=no]

Share This:

Master Project Andrea Bravo Balado: Linking Historical Ship Records to Newspaper Archives

[This post was written by Andrea Bravo Balado and is cross-posted at her own blog. It describes her MSc. project supervised  by myself]

Linking historical datasets and making them available for the Web has increasingly become a subject of research in the field of digital humanities. In the Netherlands, history is intimately related to the maritime activity because it has been essential in the development of economic, social and cultural aspects of Dutch society. As such an important sector, it has been well documented by shipping companies, governments, newspapers and other institutions.

janwillemsen: foto Rotterdam historische schepen (click to view on flickr)In this master project we assume that, given the importance of maritime activity in every day life in the XIX and XX centuries, announcements on the departures and arrivals of ships or mentions of accidents or other events, can be found in newspapers.

We have taken a two-stage approach: first, an heuristic-based method for record linkage and then machine-learning algorithms for article classification to be used for filtering in combination with domain features. Evaluation of the linking method has shown that certain domain features were indicative of mentions of ships in newspapers. Moreover, the classifier methods scored near perfect precision in predicting ship related articles.

Enriching historical ship records with links to newspaper archives is significant for the digital history community since it connects two datasets that would have otherwise required extensive annotating work and man hours to align. Our work is part of the Dutch Ships and Sailors Linked Data Cloud project. Check out Andrea’s thesis[pdf].

[googleapps domain=”docs” dir=”presentation/d/1HSzQIWc5SX4AGjOsOlja6gF-n44OwGJRxixklUSQ6Gs/embed” query=”start=false&loop=false&delayms=30000″ width=”680″ height=”411″ /]

Share This: