Corpora of Italian language

This page is no longer actively maintained; an up-to-date overview of Italian language corpora can be found in the dedicated catalog section.

Go to Old Italian Corpora

Corpora of Italian usage

written | spoken | sectorial

Corpora of written Italian

CORIS/CODIS online
Corpus of contemporary written Italian, with around 100 million words. Processed and produced by R. Rossini Favretti (University of Bologna) in 1998, it requires a licence for access to the complete version.

Dizionario italiano multimediale e multilingue d'Ortografia e di Pronunzia della RAI (DOP) (Multimedial and multilingual Italian dictionary of Orthography and Pronunciation of the RAI)
Online version of the dictionary edited from 1959 by Bruno Migliorini, Carlo Tagliavini and Piero Fiorelli, revised and enlarged by P. Fiorelli and T. F. Borri, for the RAI headquarters in Florence, now available in multimedial version. The corpus, with over 92,000 lexical items of the Italian language and over 37,000 of around sixty different languages, also presents the phonetic recording of the items.

The la Repubblica Corpus
A very large corpus (of around 380 thousand words) of the lexis of the daily newspaper la Repubblica. In the project, curated by the University of Bologna, the corpus has been lemmatised, indexed and categorised by type and topic; the articles in the corpus are arranged in the following parts: title, subtitle, summaty, text.

Corpora of spoken Italian

Corpora e Lessici di Italiano Parlato e Scritto (CLIPS) (Corpora and Lexicons of Spoken and Written Italian)
A wide-ranging project co-ordinated by Federico Albano Leoni (CIRASS – Naples), the participants including the "Federico II" University of Naples, the Scuola Normale Superiore, the Fondazione Bordoni of Rome and the Istituto Superiore Poste e Telecomunicazioni (now ISCOM). Its authors are F. A. Leoni, F. Cutugno and R. Savy.
Published in 2006, it contains around 100 hours of spoken Italian of various types subdivided into 5 subcorpora (radio and television, dialogic, read, orthophonic, telephonic). By means of a preliminary sociolinguistical survey in the Italian territory, 15 cities were selected where the materials for the construction of the corpus were collected.
Direct access to the corpus.

KIParla
Corpus of spoken Italian;it collects over 100 hours of conversations in Italian recorded in Bologna and Turin, transcribed and aligned with the audio. The search interface enables the filtering of the results on the basis of sociodemographic parameters such as age, gender, city of provenance and study qualification.

Lessico di frequenza dell'Italiano Parlato (LIP) (Lexicon of frequency of Spoken Italian)
The LIP corpus, hosted on the web site of the Banca Dati dell'Italiano Parlato (BaDIP) (Database of Spoken Italian), is the most important and most utilised collection of texts of spoken Italian in linguistic research. Built in 1990-1992 by a group of linguists directed by Tullio De Mauro, it served to construct the first lexicon of frequency of spoken Italian. Its 469 texts, containing a total of around 490,000 words, were collected in four cities (Milan, Florence, Rome and Naples) and stem from five macro-classes and numerous subclasses of speech.
Direct access to the search interface of the lemmas.

Corpora of sectorial Italian

LinguaGiovani
A project on the language of young people curated by the Department of Romance Studies of the University of Padua and co-ordinated by Prof. M. Cortelazzo. The project, which aims to co-ordinate the research studies on the topic and to collect materials both published and unpublished, also involves the creation of an online dictionary of terms taken from the language of young people acquired through spontaneous online notifications.

Corpora of old Italian

Archivio Datini (Datini Archive)
A lemmatised corpus, curated by the Opera del Vocabolario Italiano, of the correspondence of Francesco Datini (1335-1410), consisting of almost 150,000 letters, complete with a comment in which references to the editorial notes are inserted.

Archivio digitale veneto: biblioteca online dei testi veneti dalle origini al XVIII sec. (Venetian digital archive: online library of Venetian texts from the origins to the 18th century)
This corpus of old Venetian literary texts, the project which is curated by Ivano Paccagnella and Andrea Cecchinato (Department of Linguistic and Literary Studies of the University of Padua), involves numerous authoritative collaborations on the part of other scholars. The database allows for the integral reading of texts, also providing introductory cards, bibliography, chronological data and the study of the texts with the use of an advanced search engine that enables the retrieval of philologically and linguistically creditable editions of some important authors of Venetian literature, including Ruzante and Andrea Calmo.

ARchivio TEstuale del SIciliano Antico (ARTESIA) (Textual Archive of Ancient Sicilian)
This corpus is part of the wider Progetto Artesia, an articulated tool for the study of medieval Sicilian.
The corpus includes texts of differing types, covering a chronological span ranging between the start of the 14th century, to which the first texts in vulgar Sicilian date back, and the first half of the 16th century, when Sicilian was being progressively replaced by Tuscan as the language of public communication. Project co-ordinated by Mario Pagano.

Corpus dei Classici LAtini VOlgarizzati (CLAVO) (Corpus of Vulgar Latin Classics)
The CLAVO database collects the Latin classics translated from the vernacular included in the DiVo corpus (DIzionario dei Volgarizzamenti) (Dictionary of the Vernacular) and makes available for consultation nigh on 100 texts of translated Latin. The Latin text is accompanied paragraph by paragraph by the vernacular text. Project curated by the Scuola Normale Superiore of Pisa and the Opera del Vocabolario Italiano.

Corpus Epistolare Ottocentesco Digitale (CEOD) (Digital nineteenth century epistolary corpus)
An epistolary corpus of around 1350 letters, almost entirely unpublished, by 75 different writers of varying social extraction, which document a considerable range of subjects, geographical provenance and socio-cultural level among the writers. Project co-ordinated by Massimo Palermo.

Corpus OVI dell'italiano antico (OVI corpus of old Italian)
Complete collection of old Italian texts made accessible by the Opera del Vocabolario Italiano (OVI), with 23 million occurrences for more than 450,000 distinct graphic forms. Brief citations may be downloaded for research purposes, but it is prohibited to download the texts. Scientific direction by Pär Larson, Elena Artale and Diego Dotto.

Corpus ReMediA - REpertorio di MEDIcina Antica (Corpus ReMediA – Repertory of Ancient Medicine)
A corpus of medico-scientific texts (in particular various medical and surgical treatises and prescriptions), in the diverse Romance languages or in vulgar Latin, curated by Elena Artale and Ilaria Zamuner.

Corpus Taurinense: an old Italian corpus
A collection of Florentine texts of the 13th century, ordered by lemma, parts of speech, literary genre and philological forms. Scientific direction by Manuel Barbera and Carla Marello.

Morfologia dell'Itaiano in DIAcronia (MIDIA) (Morphology of Italian in diachrony)
A corpus of texts written in the Italian language ranging from the start of the 13th to the first half of the 20th century. It includes over 7 million occurrences taken from around 800 texts. Carried out thanks to the Prin 2009 project "La storia della formazione delle parole in italiano” (“The history of the formation of words in Italian"), funded by the MIUR, MIDIA offers research tools that facilitate the extraction of data, useful in particular for the study of the formation of words in Italian from the diachronic viewpoint but also usable for various other types of linguistic research.

Tesoro della lingua italiana delle origini (TLIO) (Treasury of the Italian language from its origins)
Online version of the renowned historical dictionary of Italian curated by the Opera del Vocabolario Italiano (OVI), an institute of the Consiglio Nazionale delle Ricerche (National Research Council) sited at the Accademia della Crusca of Florence. The online version, with over 12,000 items, is based on the OVI's textual corpus of old Italian. This textual database offers efficacious research tools, including an interface for the retrieval of items/forms/editors/definitions present in the corpus, a database containing the bibliographical data of the authors cited, the bibliography cited in the items and other data.