About This Corpus

Historical Encyclopaedia Britannica Digital Archive

The Corpus

This digital corpus contains the complete text of eight editions of the Encyclopaedia Britannica published between 1771 and 1860, covering nearly a century of knowledge from the Scottish Enlightenment through the early Victorian era. The corpus includes over 80,000 articles and 23,000 cross-references totalling approximately 120 million words.

Editions Included

1st Edition (1771) — The original three-volume work published in Edinburgh by Andrew Bell and Colin Macfarquhar
2nd Edition (1778–1783) — Expanded to ten volumes
3rd Edition (1797) — Eighteen volumes plus supplement
4th Edition (1810) — Twenty volumes
5th Edition (1817) — Twenty volumes with corrections
6th Edition (1823) — Twenty volumes, edited by Charles Maclaren
7th Edition (1842) — Twenty-one volumes plus index
8th Edition (1860) — Twenty-two volumes

Sources

The source PDF documents come from two collections:

National Library of Scotland — Digitised Collections: Encyclopaedia Britannica
Internet Archive — Historical book digitisation project

Technical Pipeline

The text extraction and article segmentation pipeline consists of several stages:

OCR: Raw text was extracted from scanned PDFs using OLMoCR (Allen Institute for AI), a state-of-the-art vision-language model that preserves document layout structure.
Paragraph splitting: OCR output was segmented into paragraphs, preserving page boundaries and character offsets.
LLM classification: A large language model (DeepSeek-R1-Distill-Llama-70B) classified each paragraph boundary as an article start, continuation, cross-reference, or section header.
Article assembly: Paragraphs were assembled into complete articles based on LLM boundary classifications, with fragment merging and deduplication across overlapping source files.
Export: Final deduplicated articles were exported as JSONL with full provenance (source file, character offsets, edition, volume).

Article Types

Articles: Full encyclopaedia entries, from short definitions to multi-thousand-word treatises on subjects like Chemistry, Agriculture, or Medicine
Cross-references: Redirects from one headword to another (e.g., "COLOUR. See OPTICS.")
Treatises: Long-form articles (over 5,000 words or with multiple sub-sections), identified heuristically

Data Format

The underlying data is available in JSONL format with fields including:

Article title and unique ID
Edition and volume
Article type (article or cross-reference)
Character offsets within source OCR text
Word count, paragraph count
Sub-sections, keywords, and author attribution (where detected)

Acknowledgements

This project was made possible by the open data policies of the National Library of Scotland and the Internet Archive. OCR processing was performed on Compute Canada HPC infrastructure (Nibi cluster, H100 GPUs). Article boundary detection used the DeepSeek-R1-Distill-Llama-70B model served via vLLM on USask's Plato cluster (A100 80GB GPU).