Information Extraction and Knowledge Graph Creation from Handwritten Historical Documents

[This post is based on the Bachelor Project AI of Annriya Binoy]

In her bachelor thesis “Evaluating Methodologies for Information Extraction and Knowledge Graph Creation from Handwritten Historical Documents”, Annriya Binoy provides a systematic evaluation of various methodologies for extracting and structuring information from historical handwritten documents, with the goal of identifying the most effective strategies.

As a case study, the research investigates several methods on scanned pages from the National Archive of the Netherlands, specifically the service records and pension registers of the late 18th century and early 19th century of the Koninklijk Nederlands Indisch Leger (KNIL), see the example below. The task was defined as that of extracting birth events.


Four approaches are analyzed:

  1. Handwritten Text Recognition (HTR) using the Transkribus tool
  2. a combination of Large Language Models (LLM) and Regular Expressions (Regex),
  3. Regex alone
  4. Fuzzy Search

HTR and the LLM-Regex combination show strong performance and adaptability with F1 measure values of 0.88. While Regex alone delivers high accuracy, it lacks comprehensiveness. Fuzzy Search proves effective in handling transcription errors common in historical documents, offering a balance between accuracy and robustness. This research offers initial but practical solutions for the digitization and semantic enrichment of historical archives, and it also addresses the challenges of preserving contextual integrity when constructing knowledge graphs from extracted data.

More details can be found in Annriya’s thesis below.

Share This:

Exploring Culinary Links with NLP and Knowledge Graphs

[This post is based on Nour al Assali‘s bachelor AI thesis]

Nour’s research explores the use of Natural Language Processing (NLP) and Knowledge Graphs to investigate the historical connections and cultural exchanges within global cuisines. The thesis “Flavours of History: Exploring Historical and Cultural Connections Through Ingredient Analysis Using NLP and Knowledge Graphs” describes a method for analyzing ingredient usage patterns across various cuisines by processing a dataset of recipes. Its goal is to trace the diffusion and integration of ingredients into different culinary traditions. The primary aim is to establish a digital framework for addressing questions related to culinary history and cultural interactions.

The methodology involves applying NLP to preprocess recipe data, focusing on extracting and normalizing ingredient names. The pipeline contains steps for stop word removal, token- and lemmatization, character replacements etc.

With the results, a Knowledge Graph is constructed to map relationships between ingredients, recipes, and cuisines. The approach also includes visualizing these connections, with an interactive map and other tools designed to provide insights into the data and answer key research questions. The figure below shows a visualisation of top ingredients per cuisine.

Case studies on ingredients such as pistachios, tomatoes, basil, olives, and cardamom illustrate distinct usage patterns and origins. The findings reveal that certain ingredients—like pistachios, basil, and tomatoes—associated with specific regions have gained widespread international popularity, while others, such as olives and cardamom, maintain strong ties to their places of origin. This research underscores the influence of historical trade routes and cultural exchanges on contemporary culinary practices and offers a digital foundation for future investigations into culinary history and food culture.

The code and dataset used in this research are available on GitHub: https://github.com/Nour-alasali/BPAI. The complete thesis can be found below.

Share This:

Generating Synthetic Time-Series Data For Smart-Building Knowledge Graphs Using Generative Adversarial Networks

[This blog post is based on Jesse van Haaster‘s bachelor thesis Artificial Intelligence at VU]

Knowledge Graphs represent data as triples, connecting related data points. This form of representation is widely used for various applications, such as querying information and drawing inferences from data. For fine-tuning such applications, actual KGs are needed. However, in certain domains like medical records or smart home devices, creating large-scale public knowledge graphs is challenging due to privacy concerns. To address this, generating synthetic knowledge graph data that mimics the original while preserving privacy is highly beneficial.

Jesse’s thesis explored the feasibility of generating meaningful synthetic time series data for knowledge graphs. He specifically does this in the smart building / IoT domain, building on our previous work on IoT knowledge graphs, including OfficeGraph.

To this end, two existing generative adversarial networks (GANs), CTGAN and TimeGAN, are evaluated for their ability to produce synthetic data that retains key characteristics of the original OfficeGraph dataset. Jesse compared among other things the differences in distributions of values for key features, such as humidity, temperature and co2 levels, seen below.

Key value distributions for CTGAN-generated data vs original data
Key value distributions for TimeGAN-generated data vs original data

The experiment results indicate that while both models capture some important features, neither is able to replicate all of the original data’s properties. Further research is needed to develop a solution that fully meets the requirements for generating meaningful synthetic knowledge graph data.

More details can be found in Jesse’s thesis (found below) and his Github repository https://github.com/JaManJesse/SyntheticKnowledgeGraphGeneration

Share This:

Hybrid Intelligence for Digital Humanities

For deep and meaningful integration of AI tools in the Digital Humanities (DH) discipline, Hybrid Intelligence (HI) as a research paradigm. In DH research, the use of digital methods and specifically that of Artificial Intelligence is subject to a set of requirements and constraints. In our position paper, which we presented at the HHAI2024 conference in Malmö, we argue that these are well-supported by the capabilities and goals of HI. Our paper includes the identification of five such DH requirements: Successful AI systems need to be able to

  1. collaborate with the (human) scholar;
  2. support data criticism;
  3. support tool criticism;
  4. be aware of and cater to various perspectives and
  5. support distant and close reading.

In our paper, we take the CARE principles of Hybrid Intelligence (collaborative, adaptive, responsible and explainable) as theoretical framework and map these to the DH requirements. In this mapping, we include example research projects. We finally address how insights from DH can be applied to HI and discuss open challenges for the combination of the two disciplines.

You can find the paper here: Victor de Boer and Lise Stork. “Hybrid Intelligence for Digital Humanities.” HHAI 2024: Hybrid Human AI Systems for the Social Good. pp. 94-104. Frontiers in Artificial Intelligence and Applications. Vol. 386. IOS Press. DOI: 10.3233/FAIA240186 

…and our presentation below:

Share This:

ESWC2024 Trip report

Last week, I joined the 21st edition of the Extended Semantic Web Conference (ESWC2024) held in Heraklion Crete. The 2004 edition was my first scientific conference ever, and I have been going to many editions ever since, so this feels a bit like my ‘home conference’. General Chair Albert Meroño and his team did a great job and it was overall a very nice conference. Paul Groth wrote a very nice trip report here, but I wanted to collect some thoughts and personal highlights in a short blogpost anyway.

The workshops

The workshops overall were very well organized and the ones I joined were well attended. This has been different in previous editions! The PhD symposium was very lively and I had nice chats with PhD candidates during the symposium lunch.

I joined part of the Genesy Workshop, where there were various talks about the potential of generative AI (a definite and unsurprising theme of the conference) and Semantic Web processes and technologies. The paper from Bouchouras et al: LLMs for the Engineering of a Parkinson Disease Monitoring and Alerting Ontology looked at using LLMs for Knowledge Engineering.

I was asked to give a keynote speech at the 2nd edition of the Workshop on Semantic Methods for Events and Stories (SEMMES), at ESWC2024. I talked about work on polyvocality in cultural heritage knowledge graphs. You can find my slides here.

There were very nice talks in the workshop, including the (best paper winning) Let the fallen voussoirs of Notre-Dame de Paris speak: Scientific Narration and 3D Visualization of Virtual Reconstruction Hypotheses and Reasoningfrom Guillem Anais, John Samuel, Gilles Gesquière, Livio De Luca and Violette Abergel that looked at a combination of modelling, argumentation and visualisation for architectural reconstruction.

I then joined the SemDH workshop on Semantic Digital Humanities and its panel discussion in the afternoon, which was really nice. One observation is that many of the talks in SEMMES could have been very interesting for SemDH as well and vice versa. Maybe merging the two would make sense in the future?

The Keynotes

There were three nice keynote speeches, each with its own angle and interesting points.

Elena Simperl gave a somewhat personal history of Knowledge Engineering and the role that machines and humans have in this process. This served as a prelude for the special track on this topic organized by her, Paul Groth and others. Elena called for tools and data for proper benchmarking, introduced the ProVe tool for provenance verification and explored what the roles are of AI (LLM) with respect to Knowledge engineers, domain experts and prompt engineers.

Katariina Kari reflected on 7 Years of Building Enterprise Knowledge Graphs at Zalando and Ikea. This was a very interesting talk about the impact of Knowledge Graphs in industry (she mentioned 7 figure sales increases) and about what works (SKOS, SHACL, OntoClean, Reasoning) and what doesnt work or isnt needed (OWL, Top level ontologies, big data).

Peter Clark of the Allen Institute for AI gave my favorite talk on Structured Reasoning with Language. He discussed their research on Knowledge Graphs and reasoning but also on Belief Graphs, that consist of atomic statements with textual entailment relations. LLMs can be used to ‘reason’ over such Belief Graphs for for example explaining decisions or search results.

Main Conference

The main conference had many interesting talks in all the tracks. The industry track and resource track were quite competitive this year. In terms of quality and number of submissions, they seemed equal to the research track to me this year. Also, the special track on LLMs for Knowledge Engineering was a great success.

I was a bit hesitant with respect to this clear theme of the conference, fearing lots of “we did LLM” talks, but that was not the case at all. Most papers showed genuine interest in the strength and weaknesses of various LLMs and how they can be used in several Semantic web tasks and pipelines. There was clearly a renewed interest in methodologies (Neon, Ontology Engineering 101, Methontology etc ) and how LLMs can fit here. There were for example several talks on how LLMS can be used to generate competency questions: (“Can LLMs Generate Competency Questions? [pdf] by Youssra Rebboud et al. and “The Role of Generative AI in Competency Question Retrofitting” [pdf] by Reham Alharbi et al.”).

Roderick presenting our Resource paper

Roderick van der Weerdt presented our -best Resource paper nominated- OfficeGraph: A Knowledge Graph of Office Building IoT Measurements [pdf]. Roderick did a great job presenting this nice result from the InterConnect project and it was well-received. The winner of the Resource track best paper award was however “PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips [pdf] by Nicolas Hubert et al (in my view deservedly so).

The in-use track also had very nice papers, including a quite holistic system to map the German Financial system with knowledge Graphs [pdf] by Markus Schröder et al. Oh, and I won an award 🙂

With more focus on applications, in use, resources, methods for knowledge engineering and of course LLMs, some topics seem to get less attention. Ironically, I missed both Semantics and the Web: Semantics and reasoning did not get a lot of attention in the talks I attended and most applications were about singular knowledge graphs, rather than distributed datasets. Maybe this means that we have solved most of the challenges around these two topics, but possibly it also means that these two elements are less important for actual implementation of Knowledge Graphs. It makes one wonder about the name of the conference though…

With a truly great demo and poster session (near the beach), a great dinner, really nice people and the wonderful surroundings, ESWC2024 was a great success. See you next year in Portoroz!?

Share This:

SEMMES keynote: more than one side to the story

I was honored to be asked to give the keynote address for the 2nd edition of the Workshop on Semantic Methods for Events and Stories (SEMMES), at ESWC2024. I talked about work on polyvocality in cultural heritage knowledge graphs:

There is more than one side to every story. This common saying is not only true for works of fiction. In the global data space that is the Semantic Web, views and perspectives from different people, organizations and cultures should be available. I identify three challenges towards such a polyvocal Semantic Web. I will talk about ways to identify various voices, to model different perspectives and to make these perspectives available to end users. I will give examples from the cultural heritage domain, both in how semantic technologies can be of use to make available various perspectives on people, objects and events there but also how insights from the domain can help to shape the polyvocal Semantic Web.

You can find my slides below

Share This:

HEDGE-IoT project kickoff

The HorizonEurope project HEDGE-IoT started January 2024. The 3.5 year project will build on existing technology to develop a Holistic Approach towards Empowerment of the DiGitalization of the Energy Ecosystem through adoption of IoT solutions. For VU, this project allows us to continue with the research and development initiated in the InterConnect project on data interoperability and explainable machine learning for smart buildings.

Researchers from the User-Centric Data Science group will participate in the project mostly in the context of the Dutch pilot, which will run in Arnhems Buiten, the former testing location of KEMA in the east of the Netherlands. In the pilot, we will collaborate closely with the other Dutch partners: TNO and Arnhems Buiten. At this site, an innovative business park is being realized that has its own power grid architecture, allowing for exchange of data and energy, opening the possibility for various AI-driven services for end-users.

VU will research a) how such data can be made interoperable and enriched with external information and knowledge and b) how such data can be made accessible to services and end-users through data dashboards that include explainable AI.

The image above shows the Arnhems Buiten buildings and the energy grid (source: Arnhems Buiten)

Share This:

SUMAC keynote on Knowledge Graphs for Cultural Heritage and Digital Humanities

I was honored to be invited as a keynote speaker for the 5th edition of the SUMAC 2023 workshop (analySis, Understanding and proMotion of heritAge Contents) held in conjunction with ACM Multimedia in Ottawa, Canada. In the keynote, I sketched how Knowledge Graphs as a technology can be applied to the cultural heritage domain with examples of opportunities for new types of research in the field of digital humanities specifically with respect to analyses and visualisation of such (multi-modal) data.

In the talk, I discussed the promises and challenges of designing, constructing and enriching knowledge graphs for cultural heritage and digital humanities and how such integrated and multimodal data can be browsed, queried or analysed using state of the art machine learning.

I also addressed the issue of polyvocality, where multiple perspectives on (historical) information are to be represented. Especially in contexts such as that of (post-)colonial heritage, representing multiple voices is crucial.

You can find the complete abstract of my talk here and the (compressed) presentation slides itself below.

Share This:

Best NIAA project award for VR project

The award for the Best Network Institute Academy Assistant project for this year goes to the project titled “Between Art, Data, and Meaning – How can Virtual Reality expand visitors’ perspectives on cultural objects with colonial background?” This project was carried out by VU students Isabel Franke and Stefania Conte, supervised by Thilo Hartmann and UCDS researchers Claudia Libbi and myself A project report and research paper is forthcoming but you can see the poster below.

Share This:

HAICu project funded

It has pleased NWO to award the HAICu consortium through the National Research Agenda programme. In the HAICu project, AI researchers, Digital Humanities researchers, heritage professionals and engaged citizens work together on scientific breakthroughs to open, link and analyze large-scale multimodal digital heritage collections in context.

At VU, researchers from the User-Centric Data Science group will research how to create compelling narratives as a way to present multiple perspectives in multimodal data and how to provide transparency regarding the origin of data and the ways in which it was created. These questions will be addressed in collaboration with the Museum for World Cultures on how citizen-contributed descriptions can be combined with AI-generated labels into polyvocal narratives around objects related to the Dutch colonial past in Indonesia. 

Share This: