As supervisor of many MSc and BSc theses, I find myself giving writing tips and guidelines quite often. Inspired by Jan van Gemert’s guidelines, I compiled my own document with tips and guidelines for writing an CS/AI/IS bachelor or master thesis. These are things that I personally care about and other lecturers might have different ideas. Also, this is by no means a complete list and I will use it as a living document. You can find it here: https://tinyurl.com/victorthesiswriting
It was great to see that one of this year’s Digital Humanities in Practice projects lead to a conversation between the students in that project Helene Ayar and Edith Brooks, their external supervisors Willemien Sanders (UU) and Mari Wigham (NISV) and an advisor for another project André Krouwel (VU). That conversation resulted in original research and CLARIAH MediaSuite data story “‘Who’s speaking?’- Politicians and parties in the media during the Dutch election campaign 2021” where the content of news programmes was analysed for politicians’ names, their gender and party affiliation.
The results are very interesting and subsequently appeared on Dutch news site NOS.nl, showing that right-wing politicians are more represented on radio and tv: “Onderzoek: Rechts domineert de verkiezingscampagne op radio en tv“. Well done and congratulations!
This year’s edition of the VU Digital Humanities in Practice course was of course a virtual one. In this course, students of the Minor Digital Humanities and Social Analytics put everything that they have learned in that minor in practice, tackling a real-world DH or Social Analytics challenge. As in previous years, this year we had wonderful projects provided and supervised by colleagues from various institutes. We had projects related to the Odissei and Clariah research infrastructures, projects supervised by KNAW-HUC, Stadsarchief Amsterdam, projects from Utrecht University, UvA, Leiden University and our own Vrije Universiteit. We had a project related to Kieskompas and even a project supervised by researchers from Bologna University. A wide variety of challenges, datasets and domains! We would like to thank all the supervisors and the students on making this course a success.
The compilation video below shows all the projects’ results. It combines 2-minute videos produced by each of the 10 student groups.
After a very nice virtual poster session, everybody got to vote on the Best Poster Award. The winners are group 3, whose video you can also see in the video above. Below we list all the projects and the external supervisors.
|1||Extracting named entities from Social Science data.||ODISSEI project / VU CS – Ronald Siebes|
|2||Gender bias data story in the Media Suite||CLARIAH project / UU / NISV – Mari Wigham Willemien Sanders|
|3||Food & Sustainability||KNAW-HUC – Marieke van Erp|
|4||Visualizing Political Opinion (kieskompas)||Kieskompas – Andre Krouwel|
|5||Kickstarting the HTR revolution||UU – Auke Rijpma|
|6||Reconstructing the international crew and ships of the Dutch West India Company||Stadsarchief Amsterdam – Pauline van den Heuvel|
|7||Enriching audiovisual encyclopedias||NISV – Jesse de Vos|
|8||Using Social Media to Uncover How Patients Cope||LIACS Leiden – Anne Dirkson|
|9||Covid-19 Communities||UvA – Julia Noordegraaf, Tobias Blanke, Leon van Wissen|
|10||Visualizing named graphs||Uni Bologna – Marilena Daquino|
Place names (toponyms) are very ambiguous and may change over time. This makes it hard to link mentions of places to their corresponding modern entity and coordinates, especially in a historical context. We focus on historical Toponym Disambiguation approach of entity linking based on identified context toponyms.
The thesis specifically looks at the American Gazetteer. These texts contain fundamental information about major places in its vicinity. By identifying and exploiting these tags, we aim to estimate the most likely position for the historical entry and accordingly link it to its corresponding contemporary counterpart.
Therefore, in this case study, Bram Schmidt examined the toponym recognition performance of state-of-the-art Named Entity Recognition (NER) tools spaCy and Stanza concerning historical texts and we tested two new heuristics to facilitate efficient entity linking to the geographical database of GeoNames.
We tested our method against a subset of manually annotated records of the gazetteer. Results show that both NER tools do function insufficiently in their task to automatically identify relevant toponyms out of the free text of a historical lemma. However, exploiting correctly identified context toponyms by calculating the minimal distance among them proves to be successful and combining the approaches into one algorithm shows improved recall score.
Authorship attribution is the process of correctly attributing a publication to its corresponding author, which is often done manually in real-life settings. This task becomes inefficient when there are many options to choose from due to authors having the same name. Authors can be defined by characteristics found in their associated publications, which could mean that machine learning can potentially automate this process. However, authorship attribution tasks introduce a typical class imbalance problem due to a vast number of possible labels in a supervised machine learning setting. To complicate this issue even more, we also use problematic data as input data as this mimics the type of available data for many institutions; data that is heterogeneous and sparse of nature.
The thesis searches for answers regarding how to automate authorship attribution with its known problems and this type of input data, and whether automation is possible in the first place. The thesis considers children’s literature and publications that can have between 5 and 20 potential authors (due to having the same exact name). We implement different types of machine learning methodologies for this method. In addition, we consider all available types of data (as provided by the National Library of the Netherlands), as well as the integration of contextual information.
Furthermore, we consider different types of computational representations for textual input (such as the title of the publication), in order to find the most effective representation for sparse text that can function as input for a machine learning model. These different types of experiments are preceded by a pipeline that consists out of pre-processing data, feature engineering and selection, converting data to other vector space representations and integrating linked data. This pipeline shows to actively improve performance when used with the heterogeneous data inputs.
Ultimately the thesis shows that automation can be achieved in up to 90% of the cases, and in a general sense can significantly reduce costs and time consumption for authorship attribution in a real-world setting and thus facilitate more efficient work procedures. While doing so, the thesis also finds the following key notions:
- Between comparison of machine learning methodologies, two methodologies are considered: author classification and similarity learning. Author classification grants the best raw performance (F1. 0.92), but similarity learning provides the most robust predictions and increased explainability (F1. 0.88). For a real life setting with end users the latter is recommended as it presents a more suitable option for integration of machine learning with cataloguers, with only a small hit to performance.
- The addition of contextual information actively increases performance, but performance depends on the type of information inclusion. Publication metadata and biographical author information are considered for this purpose. Publication metadata shows to have the best performance (predominantly the publisher and year of publication), while biographical author information in contrast negatively affects performance.
- We consider BERT, word embeddings (Word2Vec and fastText) and TFIDF for representations of textual input. BERT ultimately grants the best performance; up to 200% performance increase when compared to word embeddings. BERT is a sophisticated language model with an applied transformer, which leads to more intricate semantic meaning representation of text that can be used to identify associated authors.
- Based on surveys and interviews, we also find that end users mostly attribute importance to author related information when engaging in manual authorship attribution. Looking more in depth into the machine learning models, we can see that these primarily use publication metadata features to base predictions upon. We find that such differences in perception of information should ultimately not lead to negative experiences, as multiple options exist for harmonizing both parties’ usage of information.
- Nizar’s thesis can be found here.
- The code is found on Github https://github.com/KBNLresearch/Demosaurus/tree/kinderboeken/ML_Nizar
[This post is based on Enya Nieland‘s Msc Thesis “Generating Earcons from Knowledge Graphs” ]
Knowledge Graphs are becoming enormously popular, which means that users interacting with such complex networks are diversifying. This requires new and innovative ways of interacting. Several methods for visualizing, summarizing or exploring knowledge have been proposed and developed. In this student project we investigated the potential for interacting with knowledge graphs through a different modality: sound.
The research focused on the question how to generate meaningful sound or music from (knowledge) graphs. The generated sounds should provide users some insights into the properties of the network. Enya framed this challenge with the idea of “earcons” the auditory version of an icon.
Enya eventually developed a method that automatically produces these types of earcon for random knowledge graphs. Each earcon consist of three notes that differ in pitch and duration. As example, listen to the three earcons which are shown in the figure on the left.
The earcon parameters are derived from network metrics such as minimum, maximum and average indegree or outdegree. A tool with user interface allowed users to design the earcons based on these metrics.
The different variants were evaluated in an extensive user test of 30 respondents to find out which variants were the most informative. The results show that indeed, the individual elements of earcons can provide insights into these metrics, but that combining them is confusing to the listener. In this case, simpler is better.
Using this tool could be an addition to a tool such as LOD Laundromat to provide an instant insight into the complexity of KGs. It could additionally benefit people who are visually impaired and want to get an insight into the complexity of Knowledge Graphs
In the past year, together with Ingrid Vermeulen (VU Amsterdam) and Chris Dijkshoorn (Rijksmuseum Amsterdam), I had the pleasure to supervise two students from VU, Babette Claassen and Jeroen Borst, who participated in a Network Institute Academy Assistant project around art provenance and digital methods. The growing number of datasets and digital services around art-historical information presents new opportunities for conducting provenance research at scale. The Linked Art Provenance project investigated to what extent it is possible to trace provenance of art works using online data sources.
In the interdisciplinary project, Babette (Art Market Studies) and Jeroen (Artificial Intelligence) collaborated to create a workflow model, shown below, to integrate provenance information from various online sources such as the Getty provenance index. This included an investigation of potential usage of automatic information extraction of structured data of these online sources.
This model was validated through a case study, where we investigate whether we can capture information from selected sources about an auction (1804), during which the paintings from the former collection of Pieter Cornelis van Leyden (1732-1788) were dispersed. An example work , the Lacemaker, is shown above. Interviews with various art historian validated the produced workflow model.
The workflow model also provides a basic guideline for provenance research and together with the Linked Open Data process can possibly answer relevant research questions for studies in the history of collecting and the art market.
More information can be found in the Final report
[This post describes the research of Michelle de Böck and is based on her MSc Information Sciences thesis.]
Digitization of cultural heritage content allows for the digital archiving, analysis and other processing of that content. The practice of scanning and transcribing books, newspapers and images, 3d-scanning artworks or digitizing music has opened up this heritage for example for digital humanities research or even for creative computing. However, with respect to the performing arts, including theater and more specifically dance, digitization is a serious research challenge. Several dance notation schemes exist, with the most established one being Labanotation, developed in 1920 by Rudolf von Laban. Labanotation uses a vertical staff notation to record human movement in time with various symbols for limbs, head movement, types and directions of movements.
Where for musical scores, good translations to digital formats exist (e.g. MIDI), for Lanabotation, these are lacking. While there are structured formats (LabanXML, MovementXML), the majority of content still only exists either in non-digitized form (on paper) or in scanned images. The research challenge of Michelle de Böck’s thesis therefore was to identify design features for a system capable of recognizing Labanotation from scanned images.
Michelle designed such a system and implemented this in MATLAB, focusing on a few movement symbols. Several approaches were developed and compared, including approaches using pre-trained neural networks for image recognition (AlexNet). This approach outperformed others, resulting in a classification accuracy of 78.4%. While we are still far from developing a full-fledged OCR system for Labanotation, this exploration has provided valuable insights into the feasibility and requirements of such a tool.
Last friday, the students of the class of 2018/2019 of the course Digital Humanities and Social Analytics in Practice presented the results of their capstone internship project. This course and project is the final element of the Digital Humanities and Social Analytics minor programme in which students from very different backgrounds gain skills and knowledge about the interdisciplinary topic.
The course took the form of a 4-week internship at an organization working with humanities or social science data and challenges and student groups were asked to use these skills and knowledge to address a research challenge. Projects ranged from cleaning, indexing, visualizing and analyzing humanities data sets to searching for bias in news coverage of political topics. The students showed their competences not only in their research work but also in communicating this research through great posters.
The complete list of student projects and collaborating institutions is below:
- “An eventful 80 years’ war” at Rijksmuseum identifying and mapping historical events from various sources.
- An investigation into the use of structured vocabularies also at the Rijksmuseum
- “Collecting and Modelling Event WW2 from Wikipedia and Wikidata” in collaboration with Netwerk Oorlogsbronnen (see poster image below)
- A project where an search index for Development documents governed by the NICC foundation was built.
- “EviDENce: Ego Documents Events modelliNg – how individuals recall mass violence” – in collaboration with KNAW Humanities Cluster (HUC)
- “Historical Ecology” – where students searched for mentions of animals in historical newspapers – also with KNAW-HUC
- Project MIGRANT: Mobilities and connection project in collaboration with KNAW-HUC and Huygens ING
- Capturing Bias with media data analysis – an internal project at VU looking at indentifying media bias
- Locating the CTA Archive Amsterdam where a geolocation service and search tool was built
- Linking Knowledge Graphs of Symbolic Music with the Web – also an internal project at VU working with Albert Merono
In the context of our ArchiMediaL project on Digital Architectural History, a number of student projects explored opportunities and challenges around enriching the colonialarchitecture.eu dataset. This dataset lists buildings and sites in countries outside of Europe that at the time were ruled by Europeans (1850-1970).
Patrick Brouwer wrote his IMM bachelor thesis “Crowdsourcing architectural knowledge: Experts versus non-experts” about the differences in annotation styles between architecture historical experts and non-expert crowd annotators. The data suggests that although crowdsourcing is a viable option for annotating this type of content. Also, expert annotations were of a higher quality than those of non-experts. The image below shows a screenshot of the user study survey.
Rouel de Romas also looked at crowdsourcing , but focused more on the user interaction and the interface involved in crowdsourcing. In his thesis “Enriching the metadata of European colonial maps with crowdsourcing” he -like Patrick- used the Accurator platform, developed by Chris Dijkshoorn. A screenshot is seen below. The results corroborate the previous study that the in most cases the annotations provided by the participants do meet the requirements provided by the architectural historian; thus, crowdsourcing is an effective method to enrich the metadata of European colonial maps.