[This blog post is based on the Master thesis Information Sciences of Bram Schmidt, conducted at the KNAW Humanities cluster and IISG. It reuses text from his thesis]
Place names (toponyms) are very ambiguous and may change over time. This makes it hard to link mentions of places to their corresponding modern entity and coordinates, especially in a historical context. We focus on historical Toponym Disambiguation approach of entity linking based on identified context toponyms.
The thesis specifically looks at the American Gazetteer. These texts contain fundamental information about major places in its vicinity. By identifying and exploiting these tags, we aim to estimate the most likely position for the historical entry and accordingly link it to its corresponding contemporary counterpart.
Therefore, in this case study, Bram Schmidt examined the toponym recognition performance of state-of-the-art Named Entity Recognition (NER) tools spaCy and Stanza concerning historical texts and we tested two new heuristics to facilitate efficient entity linking to the geographical database of GeoNames.
We tested our method against a subset of manually annotated records of the gazetteer. Results show that both NER tools do function insufficiently in their task to automatically identify relevant toponyms out of the free text of a historical lemma. However, exploiting correctly identified context toponyms by calculating the minimal distance among them proves to be successful and combining the approaches into one algorithm shows improved recall score.
From November 1 2020, we are collaborating on connecting tangible and intangible heritage through knowledge graphs in the new Horizon2020 project “InTaVia“.
To facilitate access to rich repositories of tangible and intangible asset, new technologies are needed to enable their analysis, curation and communication for a variety of target groups without computational and technological expertise. In face of many large, heterogeneous, and unconnected heritage collections we aim to develop supporting technologies to better access and manage in/tangible CH data and topics, to better study and analyze them, to curate, enrich and interlink existing collections, and to better communicate and promote their inventories.
Our group will contribute to the shared research infrastructure and will be responsible for developing a generic solution for connecting linked heritage data to various visualization tools. We will work on various user-facing services and develop an application shell and front-end for this connection be responsible for evaluating the usability of the integrated InTaVia platform for specific users. This project will allow for novel user-centric research on topics of Digital Humanities, Human-Computer interaction and Linked Data service design.
Authorship attribution is the process of correctly attributing a publication to its corresponding author, which is often done manually in real-life settings. This task becomes inefficient when there are many options to choose from due to authors having the same name. Authors can be defined by characteristics found in their associated publications, which could mean that machine learning can potentially automate this process. However, authorship attribution tasks introduce a typical class imbalance problem due to a vast number of possible labels in a supervised machine learning setting. To complicate this issue even more, we also use problematic data as input data as this mimics the type of available data for many institutions; data that is heterogeneous and sparse of nature.
The thesis searches for answers regarding how to automate authorship attribution with its known problems and this type of input data, and whether automation is possible in the first place. The thesis considers children’s literature and publications that can have between 5 and 20 potential authors (due to having the same exact name). We implement different types of machine learning methodologies for this method. In addition, we consider all available types of data (as provided by the National Library of the Netherlands), as well as the integration of contextual information.
Furthermore, we consider different types of computational representations for textual input (such as the title of the publication), in order to find the most effective representation for sparse text that can function as input for a machine learning model. These different types of experiments are preceded by a pipeline that consists out of pre-processing data, feature engineering and selection, converting data to other vector space representations and integrating linked data. This pipeline shows to actively improve performance when used with the heterogeneous data inputs.
Ultimately the thesis shows that automation can be achieved in up to 90% of the cases, and in a general sense can significantly reduce costs and time consumption for authorship attribution in a real-world setting and thus facilitate more efficient work procedures. While doing so, the thesis also finds the following key notions:
Between comparison of machine learning methodologies, two methodologies are considered: author classification and similarity learning. Author classification grants the best raw performance (F1. 0.92), but similarity learning provides the most robust predictions and increased explainability (F1. 0.88). For a real life setting with end users the latter is recommended as it presents a more suitable option for integration of machine learning with cataloguers, with only a small hit to performance.
The addition of contextual information actively increases performance, but performance depends on the type of information inclusion. Publication metadata and biographical author information are considered for this purpose. Publication metadata shows to have the best performance (predominantly the publisher and year of publication), while biographical author information in contrast negatively affects performance.
We consider BERT, word embeddings (Word2Vec and fastText) and TFIDF for representations of textual input. BERT ultimately grants the best performance; up to 200% performance increase when compared to word embeddings. BERT is a sophisticated language model with an applied transformer, which leads to more intricate semantic meaning representation of text that can be used to identify associated authors.
Based on surveys and interviews, we also find that end users mostly attribute importance to author related information when engaging in manual authorship attribution. Looking more in depth into the machine learning models, we can see that these primarily use publication metadata features to base predictions upon. We find that such differences in perception of information should ultimately not lead to negative experiences, as multiple options exist for harmonizing both parties’ usage of information.
This year’s SEMANTiCS conference was a weird one. As so many other conferences, we had to improvise to deal with the COVID-19 restrictions around travel and event organization. With the help of many people behind the scenes -including the wonderful program chairs Paul Groth and Eva Blomqvist- , we did have a relatively normal reviewing process for the Research and Innovation track. In the end, 8 papers were accepted for publication in this year’s proceedings. The authors were then asked to present their work in pre-recorded videos. These were shown in a very nice webinar, together with contributions from industry. All in all, we feel this downscaled version of Semantics was quite successful.
Last week, I attended the second workshop of the ARIAS working group of AI and the Arts. ARIAS is a platform for research on Arts and Sciences and as such seeks to build a bridge between these disciplines. The new working group is looking at the interplay between Arts and AI specifically. Interestingly, this is not only about using AI to make art, but also to explore what art can do for AI (research). The workshop fell under the thematic theme for ARIAS “Art of Listening to the Matter” and consisted of a number of keynote talks and workshop presentations/discussions.
UvA university professor Tobias Blanke kicked off the meeting with an interesting overview of the different ‘schools’ of AI and how they relate to the humanities. Quite interesting was the talk by Sabine Niederer (a professor of visual methodologies at HvA) and Andy Dockett . They presented the results of an experiment feeding Climate Fiction (cli-fi) texts to the famous GPT algorithm. The results were then aggregated, filtered and visualized in a number of rizoprint-like pamflets.
My favourite talk of the day was by writer and critic Flavia Dzodan. Her talk was quite incendiary as it presented a post-colonial perspective on the whole notion of data science. Her point being that data science only truly started with the ‘discoveries’ of the Americas, the subsequent slave-trade and the therefor required counting of people. She then proceeded by pointing out some of the more nefarious examples of identification, classification and other data-driven ways of dealing with humans, especially those from marginalized groups. Her activist/artistic angle to this problem was to me quite interesting as it tied together themes around representation, participation that appear in the field of ICT4D and those found in AI and (Digital) Humanities. Food for thought at least.
The afternoon was reserved for talks from three artists that wanted to highlight various views on AI and art. Femke Dekker, S. de Jager and Martina Raponi all showed various art projects that in some way used AI technology and reflected on the practice and philosophical implications. Again, here GPT popped up a number of times, but also other methods of visual analysis and generative models.
It is so nice when two often very distinct research lines come together. In my case, Digital Humanities and ICT for Development rarely meet directly. But they sure did come together when Gossa Lô started with her Master AI thesis. Gossa, a long-time collaborator in the W4RA team, chose to focus on the opportunities for Machine Learning and Natural Language Processing for West-African folk tales. Her research involved constructing a corpus of West-African folk tales, performing various classification and text generation experiments and even included a field trip to Ghana to elicit information about folk tale structures. The work -done as part of an internship at Bolesian.ai– resulted in a beautiful Master AI thesis, which was awarded a very high grade.
As a follow up, we decided to try to rewrite the thesis into an article and submit it to a DH or ICT4D journal. This proved more difficult. Both DH and ICT4D are very multidisciplinary in nature and the combination of both proved a bit too much for many journals, with our article being either too technical, not technical enough, or too much out of scope.
The paper examines how machine learning (ML) and natural language processing (NLP) can be used to identify, analyze, and generate West African folk tales. Two corpora of West African and Western European folk tales were compiled and used in three experiments on cross-cultural folk tale analysis:
In the text generation experiment, two types of deep learning text generators are built and trained on the West African corpus. We show that although the texts range between semantic and syntactic coherence, each of them contains West African features.
The second experiment further examines the distinction between the West African and Western European folk tales by comparing the performance of an LSTM (acc. 0.79) with a BoW classifier (acc. 0.93), indicating that the two corpora can be clearly distinguished in terms of vocabulary. An interactive t-SNE visualization of a hybrid classifier (acc. 0.85) highlights the culture-specific words for both.
The third experiment describes an ML analysis of narrative structures. Classifiers trained on parts of folk tales according to the three-act structure are quite capable of distinguishing these parts (acc. 0.78). Common n-grams extracted from these parts not only underline cross-cultural distinctions in narrative structures, but also show the overlap between verbal and written West African narratives.
[This post is based on Enya Nieland‘s Msc Thesis “Generating Earcons from Knowledge Graphs” ]
Knowledge Graphs are becoming enormously popular, which means that users interacting with such complex networks are diversifying. This requires new and innovative ways of interacting. Several methods for visualizing, summarizing or exploring knowledge have been proposed and developed. In this student project we investigated the potential for interacting with knowledge graphs through a different modality: sound.
The research focused on the question how to generate meaningful sound or music from (knowledge) graphs. The generated sounds should provide users some insights into the properties of the network. Enya framed this challenge with the idea of “earcons” the auditory version of an icon.
Enya eventually developed a method that automatically produces these types of earcon for random knowledge graphs. Each earcon consist of three notes that differ in pitch and duration. As example, listen to the three earcons which are shown in the figure on the left.
The earcon parameters are derived from network metrics such as minimum, maximum and average indegree or outdegree. A tool with user interface allowed users to design the earcons based on these metrics.
The different variants were evaluated in an extensive user test of 30 respondents to find out which variants were the most informative. The results show that indeed, the individual elements of earcons can provide insights into these metrics, but that combining them is confusing to the listener. In this case, simpler is better.
Using this tool could be an addition to a tool such as LOD Laundromat to provide an instant insight into the complexity of KGs. It could additionally benefit people who are visually impaired and want to get an insight into the complexity of Knowledge Graphs
The Virtual Human Rights Lawyer is a joint project of Vrije Universiteit Amsterdam and the Netherlands Office of the Public International Law & Policy Group to help victims of serious human rights violations obtain access to justice at the international level. It enables users to find out how and where they can access existing global and regional human rights mechanisms in order to obtain some form of redress for the human rights violations they face or have faced.
On 1 October 2019, the Horizon2020 Interconnect project has started. The goal of this huge and ambitious project is to achieve a relevant milestone in the democratization of efficient energy management, through a flexible and interoperable ecosystem where distributed energy resources can be soundly integrated with effective benefits to end-users.
To this end, its 51 partners (!) will develop an interoperable IOT and smart-grid infrastructure, based on Semantic technologies, that includes various end-user services. The results will be validated using 7 pilots in EU member states, including one in the Netherlands with 200 appartments.
The role of VU is to develop in close collaboration with TNO extend and validating the SAREF ontology for IOT as well as and other relevant ontologies. VU will lead a task on developing Machine Learning solutions on Knowledge graphs and extend the solutions towards usable middle layers for User-centric ML services in the pilots, specifically in the aforementioned Dutch pilot, where VU will collaborate with TNO and VolkerWessel iCity and Hyrde.
Last week, I attended the SEMANTiCS2019 conference in Karlsruhe, Germany. This was the 15th edition of the conference that brings together Academia and Industry around the topic of Knowledge Engineering and Semantic Technologies and the good news was that this year’s conference was the biggest ever with 426 unique participants.
I was not able to join the workshop day or the dbpedia day on monday and thursday respectively, but was there for the main programme. The first day opened with a keynote from Oracle’s Michael J. Sullivan about Hybrid Knowledge Management Architecture and how Oracle is betting on Semantic Technology to work in combination with data lake architectures.
The 2nd keynote by Michel Dumontier of Maastricht University covered the principles of FAIR publishing of data and current avances in actually measuring FAIRness of datasets.
During one of the parallel sessions I attended the presentation of the eventual best paper winner Robin Keskisärkkä, Eva Blomqvist, Leili Lind, and Olaf Hartig. RSP-QL*: Enabling Statement-Level Annotations in RDF Streams . This was a very nice talk for a very nice and readable paper. The paper describes the combination of current RDF stream reasoning language RSP-QL and how it can be extended with the principles of RDF* that allow for statements about statements without traditional re-ification. The paper nicely mixes formal semantics, an elegant solution, working code, and a clear use case and evaluation. Congratulations to the winners.
Other winners included the best poster, which was won by our friends over at UvA.
The second day for me was taken up by the Special Track on Cultural Heritage and Digital Humanities, which consisted of research papers, use case presentations and posters that relate to the use of Semantic technologies in this domain. The program was quite nice, as the embedded tweets below hopefully show.
All in all, this years edition of SEMANTICS was a great one, I hope next year will be even more interesting (I will be general chairing it).