This post will provide all the participants of the Verrijkt Koninkrijk Hackathon with the information and data they need to start building great applications.
On this Piratepad, I suggest we all note down the progress and results: [http://piratepad.net/tzvFB5AGuk]
In this deliverable document [ dx1- deliverable pdf], you can find more detailed information about (the origin of) the data.
The Verrijkt Koninkrijk Data concerns Dr Loe de Jong’s Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog and was based on the PDFs as provided by NIOD at http://www.niod.knaw.nl/koninkrijk/ . The books have been OCRed and transformed to structured XML by researchers from the Universiteit van Amsterdam. This data is available through www.loedejongdigitaal.nl. A search interface is available at http://search.loedejongdigitaal.nl. A resolver server was installed which responds by presenting the structure (in XML) when presented with a URL. For example, http://resolver.loedejongdigitaal.nl/nl.vk.d.188.8.131.52 is resolved to the XML fragment of that paragraph. Removing the last number of the identifier (43) results in its broader section, etcetera. Paragraphs are the smallest logical units (also, a page is not a logical unit).
We provide two RDF ‘stepping stones’ into the book text. The ‘Back of the Book index’ and the ‘Named Entities index’. Both are SKOS vocabularies and consist of terms pointing to resolver.loedejongdigitaal.nl URIs. These vocabularies are linked to external sources as well. All RDF is available as Linked Data at the VK Semantic Layer at http://semanticweb.cs.vu.nl/verrijktkoninkrijk/ The base namespace for the VK/NIOD triples is http://purl.org/collections/nl/niod/ (abbreviated as niod:). Datasets, mapping sets and schemata are all loaded as separate named graphs (http://semanticweb.cs.vu.nl/verrijktkoninkrijk/browse/list_graphs).
A SPARQL endpoint is also available at http://semanticweb.cs.vu.nl/verrijktkoninkrijk/sparql/ with an interactive SPARQL editor available at http://semanticweb. cs.vu.nl/verrijktkoninkrijk/flint/ You can login with “hacker”/”hacker” (if needed).
Back of the book index (BotB index)
The BotB index consists of 15,234 SKOS Concepts, consolidated from the manual index. They link to RDF blank nodes using the niod:pageRef predicate. The blank node links to individual paragraphs using niod:parRef predicates. An example is shown below.
The BotB index is partially aligned with the NIOD thesaurus (see below), GeoNames, Cornetto and AATNed. The BotB index, the schema, the alignments are found in separate RDF turtle files
Named Entity index (NE index)
This SKOS vocabulary consists of 88,243 concepts, resulting of Named Entity recognition. The NEs are of type person, location, organisation, misc, product and event. They link into the text through direct niod:pRef links. An example is shown below.
The NE concepts are partially aligned with DBPedia (through wikilinks established during the NER process), with GeoNames (locations only), with GTAA and the NIOD thesaurus. There is also mapping to the BotB index (can be used to use a ‘higher quality’ subset).
Example SPARQL Queries
This page lists a number of sparql queries that exploit some of the links presented above. It accompanies a paper and deliverable.
Other data sources in the semantic layer
- NIOD thesaurus: This in-house thesaurus consists of 1241 concepts that are used to annotate various objects in the NIOD archives. This includes:
- BeeldbankWO2, an image archive with over 175,000 images gathered from various Dutch national war- and resistance-museums . Unfortunately, the high-quality images are not available as Open Data, only as thumbnails.
- Vier en Vijf Mei monument data: This data concerns all the war monuments listed by Comite vier en vijf mei as Linked Data (RDF).
- GeoNames (NL): only the Dutch subset of GeoNames
- DBpedia (partial): only the subset of resourses that are linked to from de NE index
Within the project, we are very much interested in the concept of pillarization and how Loe de Jong describes it. For this reason, we have added a turtle file what links Pillar concepts (Protestants, Jews, Communists, etc) to persons, organisations etc. found in the BotB index. This is a manual list of 60 links, which was semi-automatically expanded to 254 links. You can find the original list here and the expanded one here.
We did a number of analyses using this data, a (Dutch) PDF document describing the results can be found here [zuilen (pdf)].
Het Koninkrijk is licensed under the Creative Commons Naamsvermelding 3.0 Nederland licentie. The VK Linked Data are NIOD thesaurus are also available under that licenseV. The VM monument data is available under the CC0 Publieke Domein Dedicatie verklaring license