About This Corpus

Historical Encyclopaedia Britannica Digital Archive

The Corpus

This digital corpus contains OCR-processed text from seven editions of the Encyclopaedia Britannica published between 1771 and 1860. The editions represent a remarkable evolution of human knowledge during the Enlightenment and early Industrial Revolution.

Editions Included

Sources

The source PDF documents come from two collections:

Technical Details

Text extraction was performed using OLMoCR (Optical Layout Model OCR), a state-of-the-art vision-language model developed by the Allen Institute for AI. OLMoCR preserves document structure and provides character-level page number mapping, enabling precise provenance tracking for each article.

Usage

This corpus is provided for research and educational purposes. Individual articles can be downloaded in Markdown format for offline use.

Data Format

The underlying data is available in JSONL format with full provenance including:

Acknowledgments

This project was made possible by the open data policies of the National Library of Scotland and the Internet Archive's commitment to universal access to knowledge.