Database and Digital Sourcebook

Transkribus models

Update: 13rd May 2020 – Noscemus GM v2 (fully revised and extendend version) now available

A revised and extended version of our Transkribus model is now available. For the 2nd version a substantial amount of new pages was added, including prints from the 15th century, texts set in Fraktur and texts with a considerable amount of Greek passages.

The model is available for every registered user in Transkribus and called: Noscemus GM 2.2. Noscemus GM 2.2 comprises at present (May 2020) around 1,600 fully corrected pages and is able to read texts set in Antiqua-based typefaces from the 15th, 16th, 17th and 18th century with a high level of accuracy. Although it is tailored towards transcribing (Neo-)Latin texts, Noscemus GM 2.2 also provides convincing results for other languages such as French, Italian and English. The Noscemus model is therefore able to offer help not only to Neo-Latinists, but to all kind of researchers dealing with larger text corpora from the Early Modern Period.

[…] 

In the model’s current state there remain a handful of known issues: There are occasional inconsistencies in the transcription of quotation marks and diacritics; the error rate for the transcription of Greek words or passages is still high.

 

Update: 15th December 2019 – Noscemus GM v1 published

The first Transkribus model of NOSCEMUS trained by Stefan Zathammer was published on 15th December 2019. It is available for every registered user in Transkribus and called: Noscemus GM v1. The model is able to read texts set in Antiqua-based typefaces from the 16th, 17th and 18th century with a high level of accuracy and consistently outperforms most of the standard OCR engines. Although it is tailored towards transcribing (Neo-)Latin texts, Noscemus GM v1 also provides convincing results for other languages such as French, Italian and English. The Noscemus model is therefore able to offer help not only to Neo-Latinists, but to all kind of researchers dealing with larger text corpora from the Early Modern Period.

The model is based on training data from the project’s Digital Sourcebook and comprises at present (December 2019) around 1,000 fully corrected pages. In order to give the user a maximum of freedom, standardizations in the transcription process have been kept to a minimum. Normalizations have been implemented only in the following cases: ligatures (e.g. æ, œ, ct, ff) and abbreviations (e.g. -que, -us, -tur, …mm…, …nn…) have been expanded, long s (ſ) transcribed as a normal s, and small caps transcribed as majuscules.

In the model’s current state there remain a handful of known issues: There are occasional inconsistencies in the transcription of quotation marks; the error rate for the transcription of Greek words or passages is still high; to a lesser degree the same applies to words set in (German) Fraktur.

For more information see also the post on the Transkribus HP

How to guides: Official Wiki | LaTeX-Ninja (English) | B. Denicolò (German)


 Semantic Database
Heffter, Museum disputatorium
Heffter, Johann Carl, Museum disputatorium, vol. 1, Zittau, 1756.

A tripartite semantic database for authors, works (constituting the centerpiece) and secondary literature is compiled by all team members and serves as a working tool for all of them in turn. Representativity is ensured by using the categories of era, literary form and scientific discipline as a heuristic grid. The database keeps growing over the whole project duration and will comprise c. 1,500 works in the end.

Link to the Database

 

Digital Sourcebook

From the works listed in the database, c. 200 particularly typical items will be published online, resulting in a digital sourcebook – the first systematic selection of early modern scientific literature in Latin, providing a clear idea of the whole breadth of the field. Each work will be presented in a short introduction based on the informations given in the database. In addition to a facsimile, the text will be converted into a digitally searchable format, making use of the transcription platform Transkribus run by the Digitisation and Digital Preservation Group (DEA) of the University of Innsbruck. If a freely available translation exists, a link to it will be added. Each item will be referenced to similar datasets in the database so that it can be used as a starting point for research in a certain field. After the end of the project, the database and the sourcebook will remain accessible via the Central Computer Service of the University of Innsbruck and the research data repository Zenodo.

Link to the Digital Sourcebook

Nach oben scrollen