Information Extraction and Knowledge Graph Creation from Handwritten Historical Documents

[This post is based on the Bachelor Project AI of Annriya Binoy]

In her bachelor thesis “Evaluating Methodologies for Information Extraction and Knowledge Graph Creation from Handwritten Historical Documents”, Annriya Binoy provides a systematic evaluation of various methodologies for extracting and structuring information from historical handwritten documents, with the goal of identifying the most effective strategies.

As a case study, the research investigates several methods on scanned pages from the National Archive of the Netherlands, specifically the service records and pension registers of the late 18th century and early 19th century of the Koninklijk Nederlands Indisch Leger (KNIL), see the example below. The task was defined as that of extracting birth events.


Four approaches are analyzed:

  1. Handwritten Text Recognition (HTR) using the Transkribus tool
  2. a combination of Large Language Models (LLM) and Regular Expressions (Regex),
  3. Regex alone
  4. Fuzzy Search

HTR and the LLM-Regex combination show strong performance and adaptability with F1 measure values of 0.88. While Regex alone delivers high accuracy, it lacks comprehensiveness. Fuzzy Search proves effective in handling transcription errors common in historical documents, offering a balance between accuracy and robustness. This research offers initial but practical solutions for the digitization and semantic enrichment of historical archives, and it also addresses the challenges of preserving contextual integrity when constructing knowledge graphs from extracted data.

More details can be found in Annriya’s thesis below.

Share This:

Exploring Culinary Links with NLP and Knowledge Graphs

[This post is based on Nour al Assali‘s bachelor AI thesis]

Nour’s research explores the use of Natural Language Processing (NLP) and Knowledge Graphs to investigate the historical connections and cultural exchanges within global cuisines. The thesis “Flavours of History: Exploring Historical and Cultural Connections Through Ingredient Analysis Using NLP and Knowledge Graphs” describes a method for analyzing ingredient usage patterns across various cuisines by processing a dataset of recipes. Its goal is to trace the diffusion and integration of ingredients into different culinary traditions. The primary aim is to establish a digital framework for addressing questions related to culinary history and cultural interactions.

The methodology involves applying NLP to preprocess recipe data, focusing on extracting and normalizing ingredient names. The pipeline contains steps for stop word removal, token- and lemmatization, character replacements etc.

With the results, a Knowledge Graph is constructed to map relationships between ingredients, recipes, and cuisines. The approach also includes visualizing these connections, with an interactive map and other tools designed to provide insights into the data and answer key research questions. The figure below shows a visualisation of top ingredients per cuisine.

Case studies on ingredients such as pistachios, tomatoes, basil, olives, and cardamom illustrate distinct usage patterns and origins. The findings reveal that certain ingredients—like pistachios, basil, and tomatoes—associated with specific regions have gained widespread international popularity, while others, such as olives and cardamom, maintain strong ties to their places of origin. This research underscores the influence of historical trade routes and cultural exchanges on contemporary culinary practices and offers a digital foundation for future investigations into culinary history and food culture.

The code and dataset used in this research are available on GitHub: https://github.com/Nour-alasali/BPAI. The complete thesis can be found below.

Share This:

Modeling Ontologies for Individual Artists

[This post presents research done by Daan Raven in the context of his Master Project Information Sciences]

There is a long tradition in the Cultural Heritage domain of using structured, machine-interoperable knowledge using semantic methods and tools. However, research into developing and using ontologies specific to works of art of individual artists is persistently lacking. Such knowledge graphs would improve access to heritage information by making reasoning and inferencing possible. In his research, Daan Raven developed and applied a re-usable method, building on the ‘Methontology’ method for ontology development. We describe the steps of specification, conceptualization, integration, implementation and evaluation in a case study concerning ceramic-glass sculptor Barbara Nanning.

This work was presented at Digital Humanities Benelux 2021. The abstract and presentation as well as other digital resources related to the project can be found below:

Below are some examples of competency questions with pointers to SPARQL queries in YASGUI.

Which artworks in the Verre Églomisé collection of Nanning are currently stored in her private collection?https://api.triplydb.com/s/wKZG4UFq5
Show me a timeline of all process that require the use of an Annealing Kilnhttps://api.triplydb.com/s/j4Qk0tHzK
 # Show me all process steps that require the use of an annealing kiln and that have a landing page
https://api.triplydb.com/s/N5mo4uTM3
Show me (in Gallery) all objects made by “Jiří Pačinek Glass Lindava” (person in Wikidata)https://api.triplydb.com/s/C6LsEgiZF
Show me (in Geo) the locations of creation steps for various works (uses geonames)https://api.triplydb.com/s/THTkhOYjd

Share This:

InTaVia project started

From November 1 2020, we are collaborating on connecting tangible and intangible heritage through knowledge graphs in the new Horizon2020 project “InTaVia“.

To facilitate access to rich repositories of tangible and intangible asset, new technologies are needed to enable their analysis, curation and communication for a variety of target groups without computational and technological expertise. In face of many large, heterogeneous, and unconnected heritage collections we aim to develop supporting technologies to better access and manage in/tangible CH data and topics, to better study and analyze them, to curate, enrich and interlink existing collections, and to better communicate and promote their inventories.

tangible and intagible heritage (img from project proposal)

Our group will contribute to the shared research infrastructure and will be responsible for developing a generic solution for connecting linked heritage data to various visualization tools. We will work on various user-facing services and develop an application shell and front-end for this connection
be responsible for evaluating the usability of the integrated InTaVia platform for specific users. This project will allow for novel user-centric research on topics of Digital Humanities, Human-Computer interaction and Linked Data service design.

screenshot of the virtual kickoff meeting

Share This:

Hearing (Knowledge) Graphs

[This post is based on Enya Nieland‘s Msc Thesis “Generating Earcons from Knowledge Graphs” ]

Three earcons with varying pitch, rythm and both pitch and rythm

Knowledge Graphs are becoming enormously popular, which means that users interacting with such complex networks are diversifying. This requires new and innovative ways of interacting. Several methods for visualizing, summarizing or exploring knowledge have been proposed and developed. In this student project we investigated the potential for interacting with knowledge graphs through a different modality: sound.

The research focused on the question how to generate meaningful sound or music from (knowledge) graphs. The generated sounds should provide users some insights into the properties of the network. Enya framed this challenge with the idea of “earcons” the auditory version of an icon.

Enya eventually developed a method that automatically produces these types of earcon for random knowledge graphs. Each earcon consist of three notes that differ in pitch and duration. As example, listen to the three earcons which are shown in the figure on the left.

Earcon where pitch varies
Earcon where note duration varies
Earcon where both pitch and rythm vary

The earcon parameters are derived from network metrics such as minimum, maximum and average indegree or outdegree. A tool with user interface allowed users to design the earcons based on these metrics.

The pipeline for creating earcons
The GUI

The different variants were evaluated in an extensive user test of 30 respondents to find out which variants were the most informative. The results show that indeed, the individual elements of earcons can provide insights into these metrics, but that combining them is confusing to the listener. In this case, simpler is better.

Using this tool could be an addition to a tool such as LOD Laundromat to provide an instant insight into the complexity of KGs. It could additionally benefit people who are visually impaired and want to get an insight into the complexity of Knowledge Graphs

Share This:

A look back at UCDS at ICT.Open2018

Two weeks ago, ICT.Open2018 was held in Amersfoort. This event brings together Computer Science researchers from all over the Netherlands and our research group was present with many posters and presentations.

We even won a prize! (Well, a 2nd place prize, but awesome nonetheless). Xander Wilcke presented work on using Knowledge Graphs for Machine Learning. He was awarded the runner-up prize for best poster presentation at ICTOpen2018. Congrats!

 

Ronald Siebes presented work in the ArchiMediaL project on reconstructing 4D street views from historical images.

Oana Inel presented her work on Named Entity Recognition and Gold Standard critiquing. She also demonstrated the Clariah MediaSuite.

Anca Dumitrache talked about using crowdsourcing as part of the Machine Learning life cycle.

Tobias Kuhn talked about Reliable Granular References to Changing Linked Data, which was previously published at ISWC2017.

Cristina Bucur introduced  Linkflows: enabling a web of linked semantic publishing workflows

I talked myself a bit about current work in the ABC-Kb Network Institute project

All in all, this was quite a nice edition of the yearly event for our group. See you next year in Amersfoort!

Share This: