LiLa: Linking Latin


Building a Knowledge Base of Linguistic Resources for Latin //

Despite the increase in the quantity and coverage of linguistic resources, most of these are locked in data silos, which prevents users from honing both their individual and joint potential in interoperable ways.

A current approach to interlinking linguistic resources takes up the principles of the Linked Data paradigm, originally developed for the purposes of the Semantic Web. What this fervent area of research still lacks, however, is a fine-grained level of interaction between linguistic resources capable of stretching beyond descriptive metadata over to individual word occurrences in a text or entries in a lexicon.

The LiLa: Linking Latin project (2018-2023) was awarded funding from the European Research Council (ERC) to build a Knowledge Base of linguistic resources for Latin based on the Linked Data paradigm, i.e. a collection of multifarious data sets described with the same vocabulary of knowledge description and linked together following the Resource Description Framework (RDF).

The LiLa Knowledge Base uses the lemma as the ideal interface between lexical and textual resources. As a consequence, the core of LiLa consists of a collection of more than 200k Latin lemmas, called Lemma Bank: interoperability is achieved by linking all those entries in lexical resources and tokens in textual resources that point to the same lemma. The Lemma Bank is modeled as a collection of Lexical Forms of the Ontolex-lemon model. Lexical resources are connected to the Lemma Bank by linking their Lexical Entries to the corresponding lemma in the Lemma Bank, using the ontolex:canonicalForm property. Textual resources are connected by linking their tokens to the corresponding lemma in the Lemma Bank, using the lila:hasLemma property from the LiLa ontology.

In order to harmonize the different lemmatization criteria that can be found in resources, the Lemma Bank includes the graphical variants of the forms (modeled as ontolex:writtenRepresentation; e.g., sulfur/sulphur) and a specific subclass of the lila:lemma class called lila:hypolemma, which is used for participles and deadjectival adverbs, respectively connected to their base verb and adjective. Moreover, a specific property lila:lemmaVariant is used to connect to each other different forms that can be used to lemmatize the same lexical item (like, for instance, diametros and diametrus).

The linguistic resources for Latin interlinked so far in LiLa are the following:

The following textual resources are interlinked in the LiLa Knowledge Base: UDante Treebank; Querolus sive Aulularia; LASLA corpus; Augustini Confessiones; Computational Historical Semantics Corpus; Fibonacci’s Liber Abbaci (chapter VIII); Corpus for Latin Sociolinguistic Studies on Epigraphic textS (CLaSSES); Lucani Pharsalia; The Index Thomisticus Treebank.

The data page of the LiLa website reports the list of the resources interlinked by the project. For each resource, the links to the GitHub repository where the source data (most importantly, the Turtle RDF file) can be downloaded and to the page of the resource in LiLa are provided.

Currently the LiLa Knowledge Base includes approximately 145K entries from lexical resources and 3.5M tokens from textual resources, in a total of 158 works. Overall, LiLa consists of more than 70M RDF triples.

The project built a few services to query and populate the LiLa Knowledge Base:

(1) TextLinker, a tool that automatically tokenizes, lemmatizes, PoS-tags and links to LiLa the tokens of an input raw text in Latin;

(2) LISP, a graphical platform to run queries on the resources interlinked in the Knowledge Base;

(3) an interface for querying the Lemma Bank;

(4) a SPARQL access point with a number of ready-made queries on the resources made interoperable through LiLa.

Leave a comment