Modern Search in Old Newspapers

Old newspapers are among the most important research resources for academics and journalists, and they present a fascinating pool of information for historically interested people. During the project “Europeana Newspapers”, Günter Mühlberger and his team at the University of Innsbruck converted 10 million historic newspaper pages into searchable full-text.
Europeana pic en
Image: Peepholes into history – Europeana Newspapers offers access to historic newspapers from 23 European countries to everyone. Credit: www.europeana.eu

“Considering the good will and moderation of the Western powers, peace is probable; however, considering the Russian attitude towards the Eastern Question and the demands made by Petersburg politics, war is probable. The latter would result in a completely restructured European map.”– This was French statesman Édouard Drouyn de Lhuys’ estimation about the probabilities of making peace in the Crimean War, published in the Viennese newspaper “Morgen-Post” on April 14th, 1855. The fact that his words – interesting particularly in view of the current political situation - uttered 160 years ago can be repeated here, is the result of a European-wide digitalization campaign. With a massive collaborative effort of all participating parties of the project “Europeana Newspapers”, newspapers, sometimes even going back to the 17th century, have been digitized and are now accessible in full-text version. In addition, a search mask was developed, enabling academics, journalists and interested people to access historic newspapers from 25 libraries in 23 countries in an uncomplicated way and free-of-charge. The research group Digitalization and Electronic Archiving at the University of Innsbruck was a major project partner and significantly contributed to developing and technically implementing new state-of-the-art digitalization tools.

Few clicks only

Accessing information in archives has now become an easier process as searches can be made by using short entries and few clicks only. Searching in archives used to be a time-consuming and extremely formal procedure: One had to go to the particular library, request the newspapers and then read them on-site by following strict terms of use. Günter Mühlberger, group leader of the group Digitalization and Electronic Archiving at the Institute for German Studies, believes that the project Europeana Newspapers will make a big difference for conducting modern humanistic research. “We now have search possibilities far exceeding the classic library catalogues, where your search is limited to the date or the title of the article,” says Mühlberger, whose team was responsible for converting a total of 10 million newspaper pages into full-text.

The team in Innsbruck possesses the expertise in the field of character recognition as well as the infrastructure for such an undertaking: Approximately 300 terra byte of scan data submitted from all over Europe had to be processed and handled. “We look at two years of pure processing time at 32 cores,” describes technical project leader Günter Hackl this huge effort. He proudly adds that the biggest cluster in the field of optical character recognition (OCR, see box) throughout Europe is located at the University of Innsbruck.

High textual quality

Dealing with issues and challenges of character recognition of old typefaces has been a topic at the University of Innsbruck for many years with Günter Mühlberger greatly contributing to realizing and improving OCR software. “We were project coordinator of the METADATA ENGINE project (METAe), in whose framework the first OCR software for Fraktur – a form of blackletter typeface, a sub-type of Gothic letters – was developed. For METAe Gregor Retti, Birgit Stehno and Alexander Egger developed the metadata standard ALTO, which encodes data from character recognition to machine-readable format. In the meantime, ALTO has become international standard. The Library of Congress is now its official maintenance agency and advocates the use of the software,” says Mühlberger about the success of one of the previous scientific digitalization projects. The group’s reputation in this field qualified the Innsbruck team to take part in the demanding Europeana project.
Newspaper pages are particularly challenging because of their complicated layout and paper quality: “Running text is recognized the easiest, ads and titles are more complicated,” says Mühlberger. The results read for themselves: On average, precision amounts to about 80 percent, which means that eight out of ten words are identified correctly. According to Mühlberger, this is a pre-requisite for a useful keyword search.

Centralized data

Another reason why research through the Europeana browser is extremely practical: All data is centralized in one location, which allows for searching newspapers published in various European countries for place names, persons or certain key words. This also provides exciting new ways for cross-national and comparative research. Mühlberger, a graduate of German Studies, considers himself as mediator between the humanities and computer sciences as well as between archives and libraries and their users. “Digital Humanities will only be possible and

Europeana: borderless browsing

For three years, partners in 18 institutions from all over Europe worked together closely as main project partners to realize the vision of borderless browsing, which they called “Europeana Newspapers”. 11 associated and 35 networking partners completed the international collaboration. The main coordinator was Berlin State Library with the University of Innsbruck and the National Library of Austria as Austrian’s institutional project partners. The project ended in March 2015 and was mainly funded under the European Commission’s Competitiveness and Innovation Framework Programme (CIP 2007-2014). Info: http://www.europeana-newspapers.eu/