Building a Knowledge Base of Linguistic Resources for Latin //
Despite the increase in the quantity and coverage of linguistic resources, most of these are locked in data silos, which prevents users from honing both their individual and joint potential in interoperable ways.
A current approach to interlinking linguistic resources takes up the principles of the Linked Data paradigm, originally developed for the purposes of the Semantic Web. What this fervent area of research still lacks, however, is a fine-grained level of interaction between linguistic resources capable of stretching beyond descriptive metadata over to individual word occurrences in a text or entries in a lexicon.
The LiLa: Linking Latin project (2018-2023) was awarded funding from the European Research Council (ERC) to build a Knowledge Base of linguistic resources for Latin based on the Linked Data paradigm, i.e. a collection of multifarious data sets described with the same vocabulary of knowledge description and linked together following the Resource Description Framework (RDF).
The LiLa Knowledge Base uses the lemma as the ideal interface between lexical and textual resources. As a consequence, the core of LiLa consists of a collection of more than 200k Latin lemmas, called Lemma Bank: interoperability is achieved by linking all those entries in lexical resources and tokens in textual resources that point to the same lemma. The Lemma Bank is modeled as a collection of Lexical Forms of the Ontolex-lemon model. Lexical resources are connected to the Lemma Bank by linking their Lexical Entries to the corresponding lemma in the Lemma Bank, using the ontolex:canonicalForm property. Textual resources are connected by linking their tokens to the corresponding lemma in the Lemma Bank, using the lila:hasLemma property from the LiLa ontology.
In order to harmonize the different lemmatization criteria that can be found in resources, the Lemma Bank includes the graphical variants of the forms (modeled as ontolex:writtenRepresentation; e.g., sulfur/sulphur) and a specific subclass of the lila:lemma class called lila:hypolemma, which is used for participles and deadjectival adverbs, respectively connected to their base verb and adjective. Moreover, a specific property lila:lemmaVariant is used to connect to each other different forms that can be used to lemmatize the same lexical item (like, for instance, diametros and diametrus).
The linguistic resources for Latin interlinked so far in LiLa are the following:
- Word Formation Latin: a derivational morphology lexicon, where lemmas are analyzed into their formative components, and relationships between them are established on the basis of word formation rules;
- Latin Vallex 2.0 and Latin WordNet: a valency lexicon linking a valency frame to each Latin Wordnet Synset assigned to an entry;
- Lewis & Short dictionary: a Latin-English bilingual dictionary;
- Glossary of Latin loanwords from the Italian works of Dante Alighieri: a collection of Latin loanwords attested in the four Italian works by Dante Alighieri (Rime, Vita Nova, Convivio and Commedia);
- Velez dictionary: a Latin-Portuguese bilingual dictionary curated by Antonio Velez as an index for the Latin Grammar of Manuel Alvarez;
- Principal Parts Latin: a lexicon listing the principal parts of Latin lexemes, i.e., a set of wordforms from which all the other paradigm cells can be inferred;
- Etymological Dictionary of Latin & the Other Italic Languages: an etymological dictionary that covers the entire Latin lexicon of Indo-European origin. Via LiLa, Proto-Indoeuropean and Proto-Italic roots are accessible;
- Index Graecorum Vocabulorum in Linguam Latinam Translatorum: a collection of Ancient Greek loanwords in Latin;
- LatinAffectus: a prior polarity lexicon that can be used for Sentiment Analysis purposes;
- Neulateinische Wortliste: a lexical resource that collects entries from Latin texts written in Europe between 1300 and 1600.
The following textual resources are interlinked in the LiLa Knowledge Base: UDante Treebank; Querolus sive Aulularia; LASLA corpus; Augustini Confessiones; Computational Historical Semantics Corpus; Fibonacci’s Liber Abbaci (chapter VIII); Corpus for Latin Sociolinguistic Studies on Epigraphic textS (CLaSSES); Lucani Pharsalia; The Index Thomisticus Treebank.
The data page of the LiLa website reports the list of the resources interlinked by the project. For each resource, the links to the GitHub repository where the source data (most importantly, the Turtle RDF file) can be downloaded and to the page of the resource in LiLa are provided.
Currently the LiLa Knowledge Base includes approximately 145K entries from lexical resources and 3.5M tokens from textual resources, in a total of 158 works. Overall, LiLa consists of more than 70M RDF triples.
The project built a few services to query and populate the LiLa Knowledge Base:
(1) TextLinker, a tool that automatically tokenizes, lemmatizes, PoS-tags and links to LiLa the tokens of an input raw text in Latin;
(2) LISP, a graphical platform to run queries on the resources interlinked in the Knowledge Base;
(3) an interface for querying the Lemma Bank;
(4) a SPARQL access point with a number of ready-made queries on the resources made interoperable through LiLa.