Comparing Synthetic Data Generation Tools for IoT Data

[This post is based on the Bachelor Information Sciences project of Darin Pavlov and reuses text from his thesis. The research is part of VU’s effort in the InterConnect project and was supervised by Roderick van der Weerdt]

The concepts and technologies behind the Internet of Things (IoT) make it possible to establish networks of interconnected smart devices. Such networks can produce large volumes of data transmitted through sensors and actuators. Machine Learning can play a key role in processing this data towards several use cases in specific domains automotive, healthcare, manufacturing, etc. However, access to data for developing and testing Machine Learning is often hindered due to sensitivity of data, privacy issues etc.

One solution for this problem is to use synthetic data, resembling as much as possible real data. In his study, Darin Pavlov conducted a set of experiments, investigating the effectiveness of synthetic IoT data generation by three different tools:

This table shows the results of one of the two Machine Learning detection tests showing how difficult it is to differentiate the synthetic data from the real one with a Machine Learning model. For two datasets, the result is calculated as 1 minus the average ROC AUC score

Darin compared the tools on various distinguishability metrics. He observed that Mostly AI outperforms the other two generators, although Gretel.ai shows similar satisfactory results on the statistical metrics. The output of SDV on the other hand is poor on all metrics. Through this study we aim to encourage future research within the quickly developing area of synthetic data generation in the context of IoT technology.

More details can be found in Darin’s thesis.

Share This:

Interconnect Project kickoff

On 1 October 2019, the Horizon2020 Interconnect project has started. The goal of this huge and ambitious project is to achieve a relevant milestone in the democratization of efficient energy management, through a flexible and interoperable ecosystem where distributed energy resources can be soundly integrated with effective benefits to end-users.

To this end, its 51 partners (!) will develop an interoperable IOT and smart-grid infrastructure, based on Semantic technologies, that includes various end-user services. The results will be validated using 7 pilots in EU member states, including one in the Netherlands with 200 appartments.

The role of VU is to develop in close collaboration with TNO extend and validating the SAREF ontology for IOT as well as and other relevant ontologies. VU will lead a task on developing Machine Learning solutions on Knowledge graphs and extend the solutions towards usable middle layers for User-centric ML services in the pilots, specifically in the aforementioned Dutch pilot, where VU will collaborate with TNO and VolkerWessel iCity and Hyrde.

Interconnect team photo, taken at the location of the kickoff meeting: the FC Porto stadium

Share This:

The ESWC2019 PhD Symposium

As part of the ESWC 2019 conference program, the ESWC PhD Symposium was held in wonderful Portoroz, Slovenia. The aim of the symposium, this year organized by Maria-Esther Vidal and myself, is to provide a forum for PhD students in the area of Semantic Web to present their work and discuss their projects with peers and mentors.

Even though this year, we received 5 submissions, all of the submissions were of high quality, so the full day symposium featured five talks by both early and middle/late stage PhD students. The draft papers can be found on the symposium web page and our opening slides can be found here. Students were mentored by amazing mentors to improve their papers and presentation slides. A big thank you to those mentors: Paul Groth, Rudi Studer, Maria Maleshkova, Philippe Cudre-Mauroux,  and Andrea Giovanni Nuzzolese.

The program also featured a keynote by Stefan Schlobach, who talked about the road to a PhD “and back again”. He discussed a) setting realistic goals, b) finding your path towards those goals and c) being a responsible scientist and person after the goal is reached.

Students also presented their work through a poster session and the posters will also be found at the main conference poster session on tuesday 4 June.

Share This:

Who uses DBPedia anyway?

[this post is based on Frank Walraven‘s Master thesis]

Who uses DBPedia anyway? This was the question that started a research project for Frank Walraven. This question came up during one of the meetings of the Dutch DBPedia chapter, of which VUA is a member. If usage and users are better understood, this can lead to better servicing of those users, by for example prioritizing the enrichment or improvement of specific sections of DBPedia Characterizing use(r)s of a Linked Open Data set is an inherently challenging task as in an open Web world, it is difficult to know who are accessing your digital resources. For his Msc project research, which he conducted at the Dutch National Library supervised by Enno Meijers , Frank used a hybrid approach using both a data-driven method based on user log analysis and a short survey of know users of the dataset. As a scope Frank selected just the Dutch DBPedia dataset.

For the data-driven part of the method, Frank used a complete user log of HTTP requests on the Dutch DBPedia. This log file (see link below) consisted of over 4.5 Million entries and logged both URI lookups and SPARQL endpoint requests. For this research only a subset of the URI lookups were concerned.

As a first analysis step, the requests’ origins IPs were categorized. Five classes can be identified (A-E), with the vast majority of IP addresses being in class “A”: Very large networks and bots. Most of the IP addresses in these lists could be traced back to search engine

indexing bots such as those from Yahoo or Google. In classes B-F, Frank manually traced the top 30 most encounterd IP-addresses, concluding that even there 60% of the requests came from bots, 10% definitely not from bots, with 30% remaining unclear.

The second analysis step in the data-driven method consisted of identifying what types of pages were most requested. To cluster the thousands of DBPedia URI request, Frank retriev

ed the ‘categories’ of the pages. These categories are extracted from Wikipedia category links. An example is the “Android_TV” resource, which has two categories: “Google” and “Android_(operating_system)”. Following skos:broader links, a ‘level 2 category’ could also be found to aggregate to an even higher level of abstraction. As not all resources have such categories, this does not give a complete image, but it does provide some ideas on the most popular categories of items requested. After normalizing for categories with large amounts of incoming links, for example the category “non-endangered animal”, the most popular categories where 1. Domestic & International movies, 2. Music, 3. Sports, 4. Dutch & International municipality information and 5. Books.

Frank also set up a user survey to corroborate this evidence. The survey contained questions about the how and why of the respondents Dutch DBPedia use, including the categories they were most interested in. The survey was distributed using the Dutch DBPedia websitea and via twitter however only attracted 5 respondents. This illustrates

the difficulty of the problem that users of the DBPedia resource are not necessarily easily reachable through communication channels. The five respondents were all quite closely related to the chapter but the results were interesting nonetheless. Most of the users used the DBPedia SPARQL endpoint. The full results of the survey can be found through Frank’s thesis, but in terms of corroboration the survey revealed that four out of the five categories found in the data-driven method were also identified in the top five resulting from the survey. The fifth one identified in the survey was ‘geography’, which could be matched to the fifth from the data-driven method.Frank’s research shows that although it remains a challenging problem, using a combination of data-driven and user-driven methods, it is indeed possible to get an indication into the most-used categories on DBPedia. Within the Dutch DBPedia Chapter, we are currently considering follow-up research questions based on Frank’s research.

Share This:

Big Data Europe Project ended

All good things come to an end, and that also holds for our great Horizon2020 project “Big Data Europe“, in which we collaborated with a broad range of techincal and domain partners to develop (Semantic) Big Data infrastructure for a variety of domains. VU was involved as work package leader in the Pilot and Evaluation work package and co-developed methods to test and apply the BDE stack in Health, Traffic, Security and other domains..

You can read more about the end of the project in this blog post at the BDE website.

Share This:

SEMANTiCS2017

This year, I was conference chair of the SEMANTiCS conference, which was held 11-14 Sept in Amsterdam. The conference was in my view a great success, with over 310 visitors across the four days, 24 parallel sessions including academic and industry talks, six keynotes, three awards, many workshops and lots of cups of coffee. I will be posting more looks back soon, but below is a storify item giving an idea of all the cool stuff that happened in the past week.

Share This:

Big Data Europe Platform paper at ICWE 2017

With the launch of the Big Data Europe platform behind us, we are telling the world about our nice platform and the many pilots in the societal challenge domains that we have executed and evaluated. We wrote everything down in one comprehensive paper which was accepted at the 7th international conference on Web Engineering (ICWE 2017) which is to be held in Rome next month.

High-level BDE architecture (copied from the paper Auer et al.)

The paper “The BigDataEurope Platform ‚Äď Supporting the Variety Dimension of Big Data” ¬†is co-written by a very large team (see below) and it presents the BDE platform — an easy-to-deploy, easy-to-use and adaptable (cluster-based and standalone) platform for the execution of big data components and tools like Hadoop, Spark, Flink, Flume and Cassandra. ¬†To facilitate the processing of heterogeneous data, a particular innovation of the platform is the Semantic Layer, which allows to directly process RDF data and to map and transform arbitrary data into RDF. The platform is based upon requirements gathered from seven of the societal challenges put forward by the European Commission in the Horizon 2020 programme and targeted by the BigDataEurope pilots. It is validated¬†through pilot applications in each of these seven domains. .A draft version of the paper can be found here.

 

The full reference is:

Sören Auer, Simon Scerri, Aad Versteden, Erika Pauwels, Angelos Charalambidis, Stasinos Konstantopoulos, Jens Lehmann, Hajira Jabeen, Ivan Ermilov, Gezim Sejdiu, Andreas Ikonomopoulos, Spyros Andronopoulos, Mandy Vlachogiannis, Charalambos Pappas, Athanasios Davettas, Iraklis A. Klampanos, Efstathios Grigoropoulos, Vangelis Karkaletsis, Victor de Boer, Ronald Siebes, Mohamed Nadjib Mami, Sergio Albani, Michele Lazzarini, Paulo Nunes, Emanuele Angiuli, Nikiforos Pittaras, George Giannakopoulos, Giorgos Argyriou, George Stamoulis, George Papadakis, Manolis Koubarakis, Pythagoras Karampiperis, Axel-Cyrille Ngonga Ngomo, Maria-Esther Vidal.   . Proceedings of The International Conference on Web Engineering (ICWE), ICWE2017, LNCS, Springer, 2017

 

Share This:

Web and Media at ICT.OPEN2017

On 21 and 22 March, researchers from VU’s Web and Media group attended ICT.OPEN, the principal ICT research conference in the Netherlands. Here over 500 scientists from all ICT research disciplines & interested researchers from industry come together to learn from each other, share ideas and network. The conference featured some great keynote speeches, including one from Nissan’s Erik Vinkhuyzen on the role of anthropological and sociological research to develop better self-driving cars. ¬†Barbara Terhal from Aachen University gave a challenging, but well-presented talk on the challenges regarding robustness for quantum computing.

As last year, the Web and Media group this year was well represented through multiple oral presentations with accompanying posters and demonstrations :

  • Oana Inel, Carlos Martinez and Victor de Boer presented DIVEplus. Oana did such a good job presenting the project in the main programme (see Oana’s¬†DIVE+@ICTOpen2017¬†slides),¬†through the demo and¬†in front of a poster that the poster was¬†selected as best Poster in the SIKS track.
  • Benjamin Timmermans, Tobias Kuhn and Tibor Vermeij presented the Controcurator project¬†with a¬†demonstration and poster presentation.¬†In the demo the ControCurator human-machine framework for identifying controversy in multimodal data is shown.
  • Tobias Kuhn discussed “Genuine Semantic Publishing” in the Computer Science track on the first day. His slides can be found here. After the talk there was a very interesting discussion about the role of the narrative writing process and how it would relate to semantic publishing.
  • Ronald Siebes and Victor de Boer then discussed how Big and Linked Data technologies developed in the Big Data Europe project are used to deliver pharmacological web-services for drug discovery. You can read more in Ronald’s blog post.
  • Benjamin Timmermans and Zoltan Zslavik also presented the CrowdTruth demonstrator, which is shown in this short demonstrator video.
  • Sabrina Sauer presented the¬†MediaNow¬†project with a nice poster titled MediaNow ‚Äď using a living lab method to understand media professionals‚Äô exploratory search.

 

Share This:

Hands on BDE Health at ICT.OPEN 2017

[This post was written by Ronald Siebes and crossposted at big-data-europe.eu and wm.cs.vu.nl]

Last week, BigDataEurope was present at the principal ICT research conference in the Netherlands, ICT.OPEN, where over 500 scientists from all ICT research disciplines & interested researchers from industry come together to learn from each other, share ideas and network.

This is the first time that the NWO, the Netherlands Organisation for Scientific Research, added the “Health” track, a recognition of the increased importance of ICT in the domain of diagnosis, drug discovery and health-care. We presented a short paper written by Ronald Siebes, Victor de Boer, Bryn Williams-Jones, Kiera McNeice and Stian Soiland-Reyes covering the current state of the SC1 “Health, demographic change and well-being” pilot which implements the Open PHACTS functionality on the Big Data Europe infrastructure.

We succeeded to demonstrate the ease of use and practical value of the SC1 pilot for researchers in the domain of Drug Discovery and developers of Big Linked Data solutions and are looking forward to further strengthen our collaboration with the various. The paper was accepted as a poster presentation but also selected for an oral presentation at the “Health & ICT” track.

 

Share This:

Big Data Europe Youtube channel

For those curious about the Big Data Europe technology stack and who rather view videos than read descriptions and documentation, we have started a youtube video channel where BDE researchers explain the how, why and what of the BDE stack. Embedded below is a short clip of Hajira Jabeen explaining how BDE enables someone to get started with Big Data. More clips are available on the channel.

Share This: