Who uses DBPedia anyway?

[this post is based on Frank Walraven‘s Master thesis]

Who uses DBPedia anyway? This was the question that started a research project for Frank Walraven. This question came up during one of the meetings of the Dutch DBPedia chapter, of which VUA is a member. If usage and users are better understood, this can lead to better servicing of those users, by for example prioritizing the enrichment or improvement of specific sections of DBPedia Characterizing use(r)s of a Linked Open Data set is an inherently challenging task as in an open Web world, it is difficult to know who are accessing your digital resources. For his Msc project research, which he conducted at the Dutch National Library supervised by Enno Meijers , Frank used a hybrid approach using both a data-driven method based on user log analysis and a short survey of know users of the dataset. As a scope Frank selected just the Dutch DBPedia dataset.

For the data-driven part of the method, Frank used a complete user log of HTTP requests on the Dutch DBPedia. This log file (see link below) consisted of over 4.5 Million entries and logged both URI lookups and SPARQL endpoint requests. For this research only a subset of the URI lookups were concerned.

As a first analysis step, the requests’ origins IPs were categorized. Five classes can be identified (A-E), with the vast majority of IP addresses being in class “A”: Very large networks and bots. Most of the IP addresses in these lists could be traced back to search engine

indexing bots such as those from Yahoo or Google. In classes B-F, Frank manually traced the top 30 most encounterd IP-addresses, concluding that even there 60% of the requests came from bots, 10% definitely not from bots, with 30% remaining unclear.

The second analysis step in the data-driven method consisted of identifying what types of pages were most requested. To cluster the thousands of DBPedia URI request, Frank retriev

ed the ‘categories’ of the pages. These categories are extracted from Wikipedia category links. An example is the “Android_TV” resource, which has two categories: “Google” and “Android_(operating_system)”. Following skos:broader links, a ‘level 2 category’ could also be found to aggregate to an even higher level of abstraction. As not all resources have such categories, this does not give a complete image, but it does provide some ideas on the most popular categories of items requested. After normalizing for categories with large amounts of incoming links, for example the category “non-endangered animal”, the most popular categories where 1. Domestic & International movies, 2. Music, 3. Sports, 4. Dutch & International municipality information and 5. Books.

Frank also set up a user survey to corroborate this evidence. The survey contained questions about the how and why of the respondents Dutch DBPedia use, including the categories they were most interested in. The survey was distributed using the Dutch DBPedia websitea and via twitter however only attracted 5 respondents. This illustrates

the difficulty of the problem that users of the DBPedia resource are not necessarily easily reachable through communication channels. The five respondents were all quite closely related to the chapter but the results were interesting nonetheless. Most of the users used the DBPedia SPARQL endpoint. The full results of the survey can be found through Frank’s thesis, but in terms of corroboration the survey revealed that four out of the five categories found in the data-driven method were also identified in the top five resulting from the survey. The fifth one identified in the survey was ‘geography’, which could be matched to the fifth from the data-driven method.Frank’s research shows that although it remains a challenging problem, using a combination of data-driven and user-driven methods, it is indeed possible to get an indication into the most-used categories on DBPedia. Within the Dutch DBPedia Chapter, we are currently considering follow-up research questions based on Frank’s research.

Share This:

Big Data Europe Project ended

All good things come to an end, and that also holds for our great Horizon2020 project “Big Data Europe“, in which we collaborated with a broad range of techincal and domain partners to develop (Semantic) Big Data infrastructure for a variety of domains. VU was involved as work package leader in the Pilot and Evaluation work package and co-developed methods to test and apply the BDE stack in Health, Traffic, Security and other domains..

You can read more about the end of the project in this blog post at the BDE website.

Share This:

SEMANTiCS2017

This year, I was conference chair of the SEMANTiCS conference, which was held 11-14 Sept in Amsterdam. The conference was in my view a great success, with over 310 visitors across the four days, 24 parallel sessions including academic and industry talks, six keynotes, three awards, many workshops and lots of cups of coffee. I will be posting more looks back soon, but below is a storify item giving an idea of all the cool stuff that happened in the past week.

Share This:

Big Data Europe Platform paper at ICWE 2017

With the launch of the Big Data Europe platform behind us, we are telling the world about our nice platform and the many pilots in the societal challenge domains that we have executed and evaluated. We wrote everything down in one comprehensive paper which was accepted at the 7th international conference on Web Engineering (ICWE 2017) which is to be held in Rome next month.

High-level BDE architecture (copied from the paper Auer et al.)

The paper “The BigDataEurope Platform – Supporting the Variety Dimension of Big Data”  is co-written by a very large team (see below) and it presents the BDE platform — an easy-to-deploy, easy-to-use and adaptable (cluster-based and standalone) platform for the execution of big data components and tools like Hadoop, Spark, Flink, Flume and Cassandra.  To facilitate the processing of heterogeneous data, a particular innovation of the platform is the Semantic Layer, which allows to directly process RDF data and to map and transform arbitrary data into RDF. The platform is based upon requirements gathered from seven of the societal challenges put forward by the European Commission in the Horizon 2020 programme and targeted by the BigDataEurope pilots. It is validated through pilot applications in each of these seven domains. .A draft version of the paper can be found here.

 

The full reference is:

Sören Auer, Simon Scerri, Aad Versteden, Erika Pauwels, Angelos Charalambidis, Stasinos Konstantopoulos, Jens Lehmann, Hajira Jabeen, Ivan Ermilov, Gezim Sejdiu, Andreas Ikonomopoulos, Spyros Andronopoulos, Mandy Vlachogiannis, Charalambos Pappas, Athanasios Davettas, Iraklis A. Klampanos, Efstathios Grigoropoulos, Vangelis Karkaletsis, Victor de Boer, Ronald Siebes, Mohamed Nadjib Mami, Sergio Albani, Michele Lazzarini, Paulo Nunes, Emanuele Angiuli, Nikiforos Pittaras, George Giannakopoulos, Giorgos Argyriou, George Stamoulis, George Papadakis, Manolis Koubarakis, Pythagoras Karampiperis, Axel-Cyrille Ngonga Ngomo, Maria-Esther Vidal.   . Proceedings of The International Conference on Web Engineering (ICWE), ICWE2017, LNCS, Springer, 2017

 

Share This:

Web and Media at ICT.OPEN2017

On 21 and 22 March, researchers from VU’s Web and Media group attended ICT.OPEN, the principal ICT research conference in the Netherlands. Here over 500 scientists from all ICT research disciplines & interested researchers from industry come together to learn from each other, share ideas and network. The conference featured some great keynote speeches, including one from Nissan’s Erik Vinkhuyzen on the role of anthropological and sociological research to develop better self-driving cars.  Barbara Terhal from Aachen University gave a challenging, but well-presented talk on the challenges regarding robustness for quantum computing.

As last year, the Web and Media group this year was well represented through multiple oral presentations with accompanying posters and demonstrations :

  • Oana Inel, Carlos Martinez and Victor de Boer presented DIVEplus. Oana did such a good job presenting the project in the main programme (see Oana’s DIVE+@ICTOpen2017 slides), through the demo and in front of a poster that the poster was selected as best Poster in the SIKS track.
  • Benjamin Timmermans, Tobias Kuhn and Tibor Vermeij presented the Controcurator project with a demonstration and poster presentation. In the demo the ControCurator human-machine framework for identifying controversy in multimodal data is shown.
  • Tobias Kuhn discussed “Genuine Semantic Publishing” in the Computer Science track on the first day. His slides can be found here. After the talk there was a very interesting discussion about the role of the narrative writing process and how it would relate to semantic publishing.
  • Ronald Siebes and Victor de Boer then discussed how Big and Linked Data technologies developed in the Big Data Europe project are used to deliver pharmacological web-services for drug discovery. You can read more in Ronald’s blog post.
  • Benjamin Timmermans and Zoltan Zslavik also presented the CrowdTruth demonstrator, which is shown in this short demonstrator video.
  • Sabrina Sauer presented the MediaNow project with a nice poster titled MediaNow – using a living lab method to understand media professionals’ exploratory search.

 

Share This:

Hands on BDE Health at ICT.OPEN 2017

[This post was written by Ronald Siebes and crossposted at big-data-europe.eu and wm.cs.vu.nl]

Last week, BigDataEurope was present at the principal ICT research conference in the Netherlands, ICT.OPEN, where over 500 scientists from all ICT research disciplines & interested researchers from industry come together to learn from each other, share ideas and network.

This is the first time that the NWO, the Netherlands Organisation for Scientific Research, added the “Health” track, a recognition of the increased importance of ICT in the domain of diagnosis, drug discovery and health-care. We presented a short paper written by Ronald Siebes, Victor de Boer, Bryn Williams-Jones, Kiera McNeice and Stian Soiland-Reyes covering the current state of the SC1 “Health, demographic change and well-being” pilot which implements the Open PHACTS functionality on the Big Data Europe infrastructure.

We succeeded to demonstrate the ease of use and practical value of the SC1 pilot for researchers in the domain of Drug Discovery and developers of Big Linked Data solutions and are looking forward to further strengthen our collaboration with the various. The paper was accepted as a poster presentation but also selected for an oral presentation at the “Health & ICT” track.

 

Share This:

Big Data Europe Youtube channel

For those curious about the Big Data Europe technology stack and who rather view videos than read descriptions and documentation, we have started a youtube video channel where BDE researchers explain the how, why and what of the BDE stack. Embedded below is a short clip of Hajira Jabeen explaining how BDE enables someone to get started with Big Data. More clips are available on the channel.

Share This:

A Look Back at the 2nd BDE Workshop on Big Data in Health, Demographic Change and Wellbeing

[reblogged from Big-Data-Europe.eu]

On 9 December 2016, the second workshop for the Big Data Europe Health, Demographic Change and Wellbeing societal challenge was held in Brussels. The aim of this workshop was to highlight progress from the BigDataEurope project in building the foundations of a generically applicable big data platform which can be applied across all Horizon 2020 societal challenges. This workshop specifically focused on health, and showcased our first pilot’s application to early bioscience research data.

The workshop in full effect

The workshop had 15 participants, from within the health domain and outside it, including many participants from the European Commission. Together we discussed different perspectives on how we may use appropriate H2020 instruments and work programmes to better integrate the ecosystem of linked data repositories, data management services and virtual collaboration environments to increase the pace of knowledge sharing in health.

The workshop featured presentations from BDE’s Simon Scerri and Aad Versteden on the general goals and progress of the BigDataEurope project and the BDE infrastructure respectively. After lunch, Ronald Siebes (BDE / VU Amsterdam) presented the first pilot in this specific domain. More information on that pilot can be found here. An extensive round-table discussion followed, in which possible options for new applications and connections were considered.

Snapshot of the SC1 pilot interface, as presented by Ronald Siebes

One question raised was whether the generic BDE infrastructure can be used by European SMEs. The fact that the BDE infrastructure is completely Open Source, very easy to install and features intuitive interface components makes re-use relatively simple even for smaller institutions and companies.

A significant part of the discussion focussed on possible new use cases for expanding the scope of the pilot. One suggestion was to look at post-hoc integration of clinical data, which represents a typical problem of data ‘variance’. This would require integrating information from different versions of medical questionnaires, which may be recorded or stored in different ways. Data provenance is also a key concern, as keeping a trail of what has happened to clinical data is crucial to tracking patients’ histories. Once integrated, this data could then be mined to identify biases or data patterns.

Finally, the workshop participants discussed potential connections to other European projects. Here many projects were mentioned including the MIDAS project, the Big-O project on childhood obesity, the PULSE projects and IMI / IMI2 projects including EMIF. We will be seeking collaborations with these projects and will continue to develop new and interesting Big Data use cases in this domain in the coming year.

More images can be found below: BDE Health Workshop SC 1.2

Share This:

Web of Voices and W4RA video at the Webscience@10 TV Channel

For its 10th anniversary, the Web Science Trust organized an event Webscience@10. For this event, a Webscience@10 TV channel was launched to showcase different research and education initatives around the world. On behalf of the VU Network Institute and W4RA, we submitted our Web of Voices video as well as a short introduction to the W4RA team.

You can watch the ~10 hours of video content at  http://www.webscience.org/webscience10/tv-channel-webscience10/. You can find us (listed under Netwerk Institute Amsterdam) at 2h31mins:

Share This:

Installing and Running the First Big-Data-Europe Health Pilot

[This blog post is reblogged from big-data-europe.eu and written by Ronald Siebes and Victor de Boer]

As previously announced, the pilot implementation for the Big-Data-Europe platform for Societal Challenge 1 (the Health domain) facilitates the Open PHACTS discovery Platform functionality.  The Open PHACTS platform is built for researchers in Drug Discovery. It uses databases of physicochemical and pharmacological properties stored in a RDF Triple Store. This interconnected data is exposed through a Linked Data API composed of interoperable data. The system caches query results via a Memcached module. In the context of the SC1 pilot, most functionalities of the platform is now successfully replicated via Docker containers on the BDE infrastructure.

The Open PHACTS platform architecture
The Open PHACTS platform architecture

Please do try this at home! The pilot can be installed on Linux (through Docker compose) or Windows (through Docker toolbox). Installations instructions are available on the pilot’s GitHub page.  By design the technology itself is independent from the domain. Once you got familiar with the code and got it running by yourself, you should have enough experience to upload your own Linked Data, and create your own API.

Share This: