Information Extraction and Knowledge Graph Creation from Handwritten Historical Documents

[This post is based on the Bachelor Project AI of Annriya Binoy]

In her bachelor thesis “Evaluating Methodologies for Information Extraction and Knowledge Graph Creation from Handwritten Historical Documents”, Annriya Binoy provides a systematic evaluation of various methodologies for extracting and structuring information from historical handwritten documents, with the goal of identifying the most effective strategies.

As a case study, the research investigates several methods on scanned pages from the National Archive of the Netherlands, specifically the service records and pension registers of the late 18th century and early 19th century of the Koninklijk Nederlands Indisch Leger (KNIL), see the example below. The task was defined as that of extracting birth events.


Four approaches are analyzed:

  1. Handwritten Text Recognition (HTR) using the Transkribus tool
  2. a combination of Large Language Models (LLM) and Regular Expressions (Regex),
  3. Regex alone
  4. Fuzzy Search

HTR and the LLM-Regex combination show strong performance and adaptability with F1 measure values of 0.88. While Regex alone delivers high accuracy, it lacks comprehensiveness. Fuzzy Search proves effective in handling transcription errors common in historical documents, offering a balance between accuracy and robustness. This research offers initial but practical solutions for the digitization and semantic enrichment of historical archives, and it also addresses the challenges of preserving contextual integrity when constructing knowledge graphs from extracted data.

More details can be found in Annriya’s thesis below.

Share This:

Paper about automatic labeling in IJDL

mompeltOur paper  “Evaluating Unsupervised Thesaurus-based Labeling of Audiovisual Content in an Archive Production Environment” was accepted for publication in the International Journal on Digital Libraries (IJDL). This paper, co-authored with Roeland Ordelman and Josefien Schuurman reports on a series of information extraction experiments carried out at the Netherlands Institute for Sound and Vision (NISV). Specifically, in the paper we report on a two-stage evaluation of unsupervised labeling of audiovisual content using subtitles. We look at how such an approach can provide acceptable results given requirements with respect to archival quality, authority and service levels to external users.

tess_alg

For this, we developed a text extraction pipeline (TESS), pictured here which extracts key terms and matches them to the NISV thesaurus, the GTAA. This journal paper is an extended version of the paper previously accepted at the TPDL conference and here provide an analysis of the term extraction after being taken into production, where we focus on performance variation with respect to term types and television programs. Having implemented the procedure in our production work-flow allows us to gradually develop the system further and to also assess the effect of the transformation from manual to automatic annotation from an end-user perspective.

The paper will appear on the Journal site shortly. A final draft version of the paper can be found here: deboer_ijdl2016evaluating_draft [PDF].

 

 

Share This: