Generating Synthetic Time-Series Data For Smart-Building Knowledge Graphs Using Generative Adversarial Networks

[This blog post is based on Jesse van Haaster‘s bachelor thesis Artificial Intelligence at VU]

Knowledge Graphs represent data as triples, connecting related data points. This form of representation is widely used for various applications, such as querying information and drawing inferences from data. For fine-tuning such applications, actual KGs are needed. However, in certain domains like medical records or smart home devices, creating large-scale public knowledge graphs is challenging due to privacy concerns. To address this, generating synthetic knowledge graph data that mimics the original while preserving privacy is highly beneficial.

Jesse’s thesis explored the feasibility of generating meaningful synthetic time series data for knowledge graphs. He specifically does this in the smart building / IoT domain, building on our previous work on IoT knowledge graphs, including OfficeGraph.

To this end, two existing generative adversarial networks (GANs), CTGAN and TimeGAN, are evaluated for their ability to produce synthetic data that retains key characteristics of the original OfficeGraph dataset. Jesse compared among other things the differences in distributions of values for key features, such as humidity, temperature and co2 levels, seen below.

Key value distributions for CTGAN-generated data vs original data
Key value distributions for TimeGAN-generated data vs original data

The experiment results indicate that while both models capture some important features, neither is able to replicate all of the original data’s properties. Further research is needed to develop a solution that fully meets the requirements for generating meaningful synthetic knowledge graph data.

More details can be found in Jesse’s thesis (found below) and his Github repository https://github.com/JaManJesse/SyntheticKnowledgeGraphGeneration

Share This:

Comparing Synthetic Data Generation Tools for IoT Data

[This post is based on the Bachelor Information Sciences project of Darin Pavlov and reuses text from his thesis. The research is part of VU’s effort in the InterConnect project and was supervised by Roderick van der Weerdt]

The concepts and technologies behind the Internet of Things (IoT) make it possible to establish networks of interconnected smart devices. Such networks can produce large volumes of data transmitted through sensors and actuators. Machine Learning can play a key role in processing this data towards several use cases in specific domains automotive, healthcare, manufacturing, etc. However, access to data for developing and testing Machine Learning is often hindered due to sensitivity of data, privacy issues etc.

One solution for this problem is to use synthetic data, resembling as much as possible real data. In his study, Darin Pavlov conducted a set of experiments, investigating the effectiveness of synthetic IoT data generation by three different tools:

This table shows the results of one of the two Machine Learning detection tests showing how difficult it is to differentiate the synthetic data from the real one with a Machine Learning model. For two datasets, the result is calculated as 1 minus the average ROC AUC score

Darin compared the tools on various distinguishability metrics. He observed that Mostly AI outperforms the other two generators, although Gretel.ai shows similar satisfactory results on the statistical metrics. The output of SDV on the other hand is poor on all metrics. Through this study we aim to encourage future research within the quickly developing area of synthetic data generation in the context of IoT technology.

More details can be found in Darin’s thesis.

Share This: