Historical Encyclopaedia Britannica Digital Archive
This digital corpus contains OCR-processed text from seven editions of the Encyclopaedia Britannica published between 1771 and 1860. The editions represent a remarkable evolution of human knowledge during the Enlightenment and early Industrial Revolution.
The source PDF documents come from two collections:
Text extraction was performed using OLMoCR (Optical Layout Model OCR), a state-of-the-art vision-language model developed by the Allen Institute for AI. OLMoCR preserves document structure and provides character-level page number mapping, enabling precise provenance tracking for each article.
This corpus is provided for research and educational purposes. Individual articles can be downloaded in Markdown format for offline use.
The underlying data is available in JSONL format with full provenance including:
This project was made possible by the open data policies of the National Library of Scotland and the Internet Archive's commitment to universal access to knowledge.