Authorship attribution is the process of correctly attributing a publication to its corresponding author, which is often done manually in real-life settings. This task becomes inefficient when there are many options to choose from due to authors having the same name. Authors can be defined by characteristics found in their associated publications, which could mean that machine learning can potentially automate this process. However, authorship attribution tasks introduce a typical class imbalance problem due to a vast number of possible labels in a supervised machine learning setting. To complicate this issue even more, we also use problematic data as input data as this mimics the type of available data for many institutions; data that is heterogeneous and sparse of nature.
The thesis searches for answers regarding how to automate authorship attribution with its known problems and this type of input data, and whether automation is possible in the first place. The thesis considers children’s literature and publications that can have between 5 and 20 potential authors (due to having the same exact name). We implement different types of machine learning methodologies for this method. In addition, we consider all available types of data (as provided by the National Library of the Netherlands), as well as the integration of contextual information.
Furthermore, we consider different types of computational representations for textual input (such as the title of the publication), in order to find the most effective representation for sparse text that can function as input for a machine learning model. These different types of experiments are preceded by a pipeline that consists out of pre-processing data, feature engineering and selection, converting data to other vector space representations and integrating linked data. This pipeline shows to actively improve performance when used with the heterogeneous data inputs.
Ultimately the thesis shows that automation can be achieved in up to 90% of the cases, and in a general sense can significantly reduce costs and time consumption for authorship attribution in a real-world setting and thus facilitate more efficient work procedures. While doing so, the thesis also finds the following key notions:
Between comparison of machine learning methodologies, two methodologies are considered: author classification and similarity learning. Author classification grants the best raw performance (F1. 0.92), but similarity learning provides the most robust predictions and increased explainability (F1. 0.88). For a real life setting with end users the latter is recommended as it presents a more suitable option for integration of machine learning with cataloguers, with only a small hit to performance.
The addition of contextual information actively increases performance, but performance depends on the type of information inclusion. Publication metadata and biographical author information are considered for this purpose. Publication metadata shows to have the best performance (predominantly the publisher and year of publication), while biographical author information in contrast negatively affects performance.
We consider BERT, word embeddings (Word2Vec and fastText) and TFIDF for representations of textual input. BERT ultimately grants the best performance; up to 200% performance increase when compared to word embeddings. BERT is a sophisticated language model with an applied transformer, which leads to more intricate semantic meaning representation of text that can be used to identify associated authors.
Based on surveys and interviews, we also find that end users mostly attribute importance to author related information when engaging in manual authorship attribution. Looking more in depth into the machine learning models, we can see that these primarily use publication metadata features to base predictions upon. We find that such differences in perception of information should ultimately not lead to negative experiences, as multiple options exist for harmonizing both parties’ usage of information.
The course took the form of a 4-week internship at an organization working with humanities or social science data and challenges and student groups were asked to use these skills and knowledge to address a research challenge. Projects ranged from cleaning, indexing, visualizing and analyzing humanities data sets to searching for bias in news coverage of political topics. The students showed their competences not only in their research work but also in communicating this research through great posters.
The complete list of student projects and collaborating institutions is below:
“An eventful 80 years’ war” at Rijksmuseum identifying and mapping historical events from various sources.
An investigation into the use of structured vocabularies also at the Rijksmuseum
“Collecting and Modelling Event WW2 from Wikipedia and Wikidata” in collaboration with Netwerk Oorlogsbronnen (see poster image below)
A project where an search index for Development documents governed by the NICC foundation was built.
“EviDENce: Ego Documents Events modelliNg – how individuals recall mass violence” – in collaboration with KNAW Humanities Cluster (HUC)
“Historical Ecology” – where students searched for mentions of animals in historical newspapers – also with KNAW-HUC
Project MIGRANT: Mobilities and connection project in collaboration with KNAW-HUC and Huygens ING
Capturing Bias with media data analysis – an internal project at VU looking at indentifying media bias
At the DHBenelux 2018 conference, students from the VU minor “Digital Humanities and Social Analytics” presented their final DH in Practice work. In this video, the students talk about their experience in the minor and the internship projects. We also meet other participants of the conference talking about the need for interdisciplinary research.
At the Digital Humanities Benelux 2017 conference, the e-humanities Events working group organized a panel with the titel “A Pragmatic Approach to Understanding and Utilizing Events in Cultural Heritage”. In this panel, researchers from Vrije Universiteit Amsterdam, CWI, NIOD, Huygens ING, and Nationaal Archief presented different views on Events as objects of study and Events as building blocks for historical narratives.
The session was packed and the introductory talks were followed by a lively discussion. From this discussion it became clear that consensus on the nature of Events or what typology of Events would be useful is not to be expected soon. At the same time, a simple and generic data model for representing Events allows for multiple viewpoints and levels of aggregations to be modeled. The combined slides of the panel can be found below. For those interested in more discussion about Events: A workshop at SEMANTICS2017 will also be organized and you can join!
Last week, the Volkswagen Stiftung-funded “Mixed Methods’ in the Humanities?” programme had its kickoff meeting for all funded projects in in Hannover, Germany. Our ArchiMediaL project on enriching and linking historical architectural and urban image collections was one of the projects funded through this programme and even though our project will only start in September, we already presented our approach, the challenges we will be facing and who will face them (our great team of post-docs Tino Mager, Seyran Khademi and Ronald Siebes). Other interesting projects included analysing of multi-religious spaces on the Medieval World (“Dhimmis and Muslims”); the “From Bach to Beatles” project on representing music and schemata to support musicological scholarship as well as the nice Digital Plato project which uses NLP technologies to map paraphrasing of Plato in the ancient world. An overarching theme was a discussion on the role of digital / quantitative / distant reading methods in humanities research. The projects will run for three years so we have some time to say some sensible things about this in 2020.
I received a good news letter from Volkswagen Stiftung who decided to award us a research grant for a 3-year Digital Humanities project named “ArchiMediaL” around architectural history. This project will be a collaboration between architecture historians from TU Delft, computer scientists from TU Delft and VU-Web and Media. A number of German scholars will also be involved as domain experts. The project will combine image analysis software with crowdsourcing and semantic linking to create networks of visual resources which will foster understanding of understudied areas in architectural history.
From the proposal:In the mind of the expert or everyday user, the project detaches the digital images from its existence as a single artifact and includes it into a global network of visual sources, without disconnecting it from its provenance. The project that expands the framework of hermeneutic analysis through a quantitative reference system, in which discipline-specific canons and limitations are questions. For the dialogue between the history of architecture and urban form this means a careful balancing of qualitative and quantitative information and of negotiating new methodological approaches for future investigation.
The CLARIN framework commissioned the production of dissemmination videos showcasing the outcomes of the individual CLARIN projects. One of these projects was the Dutch Ships and Sailors project, a collaboration between VU Computer Science, VU humanities and the Huygens Institute for National History. In this project, we developed a heterogeneous linked data cloud connecting many different maritime databases. This data cloud allows for new types of integrated browsing and new historical research questions. In the video, we (Victor de Boer together with historians Jur Leinenga and Rik Hoekstra) explain how the data cloud was formed and how it can be used by maritime historians.
This is a nice companion piece to the more technical description of the dataset which was published in the proceedings of ISWC 2014. The new version highlights more the general setup of the project and the considerations and innovations of the project from a historical point of view.
Since submission of this ‘mid-term project description’, the DSS data cloud has been expanding, and the ‘development’ version of the triple store now hosts six datasets thanks to the work of Jeroen Entjes (see the datacloud figure).
[This post was written by Jeroen Entjes and describes his Msc Thesis research]
The Dutch maritime supremacy during the Dutch Golden Age has had a profound influence on the modern Netherlands and possibly other places around the globe. As such, much historic research has been done on the matter, facilitated by thorough documentation done by many ports of their shipping. As more and more of these documentations are digitized, new ways of exploring this data are created.
This master project uses one such way. Based on the Dutch Ships and Sailors project digitized maritime datasets have been converted to RDF and published as Linked Data. Linked Data refers to structured data on the web that is published and interlinked according to a set of standards. This conversion was done based on requirements for this data, set up with historians from the Huygens ING Institute that provided the datasets. The datasets chosen were those of Archangel and Elbing, as these offer information of the Dutch Baltic trade, the cradle of the Dutch merchant navy that sailed the world during the Dutch Golden Age.
Along with requirements for the data, the historians were also interviewed to gather research questions that combined datasets could help solve. The goal of this research was to see if additional datasets could be linked to the existing Dutch Ships and Sailors cloud and if such a conversion could help solve the research questions the historians were interested in.
Data visualization showing shipping volume of different datasets.
As part of this research, the datasets have been converted to RDF and published as Linked Data as an addition to the Dutch Ships and Sailors cloud and a set of interactive data visualizations have been made to answer the research questions by the historians. Based on the conversion, a set of recommendations are made on how to convert new datasets and add them to the Dutch Ships and Sailors cloud. All data representations and conversions have been evaluated by historians to assess the their effectiveness.
This year’s third issue of E-Data and Research magazine features an article about the Dutch Ships and Sailors project. The article (in Dutch) describes how our project provides new ways of interacting with Dutch maritime data. So far, four datasets are present in the DSS data cloud but we are currently extending the dataset with two new datasets. More on that later…
In the same issue, there is an article about the workshop around newspaper data as provided by the National Library. This includes a picture of me presenting the DIVE project.