An Archive Unearthed Newspaper Navigator indexes visual elements in archival text.

Published

May 13, 2020

Reading time

1 min read

An algorithm indexed photos, ads, and other images embedded in 170 years of American newspapers.

What’s new: Created by researchers at the University of Washington and U.S. Library of Congress, Newspaper Navigator uses object recognition to organize visual features in 16 million pages of newspapers dating back to 1789. The tool makes it easy to search this archive — and hopefully others before long — for visual elements.
How it works: The researchers fine-tuned Faster R-CNN to flag seven types of visual newspaper content from cartoons to maps.

They trained the system on Beyond Words, an annotated archive of World War I-era newspapers. The dataset also includes transcriptions of headlines and captions to help calibrate optical character recognition.
The researchers added labels for headlines and advertisements.
The system uses optical character recognition to append titles and captions to illustrations and photos. It also produces machine-readable versions of headlines.

Behind the news: A number of researchers are using AI to mine the mountains of information locked in digitized newspapers and other historical sources.

PageNet recognizes page boundaries in handwritten historical documents.
Swiss researchers devised dhSegment, a neural network that helps with a range of tasks related to historical images such as analyzing a document’s layout and detecting the ornamental letter illustrations that begin chapters in many old texts.
The University of Lincoln-Nebraska’s Aida project seeks out poetry in old newspapers.

Why it matters: Newspapers are invaluable resources for historians, journalists, and other researchers. Newspaper Navigator’s creators open-sourced their work so it can be used to search other digital archives.

We’re thinking: Sometimes we have a hard time finding old GIFs from The Batch. Maybe the Library of Congress could give us a hand too?

Subscribe to The Batch