Historical Encyclopaedia Britannica Digital Archive
The Corpus
This digital corpus contains the complete text of eight editions of the
Encyclopaedia Britannica published between 1771 and 1860, covering nearly a century
of knowledge from the Scottish Enlightenment through the early Victorian era.
The corpus includes over 80,000 articles and 23,000 cross-references totalling
approximately 120 million words.
Editions Included
1st Edition (1771) — The original three-volume work published in Edinburgh by Andrew Bell and Colin Macfarquhar
2nd Edition (1778–1783) — Expanded to ten volumes
3rd Edition (1797) — Eighteen volumes plus supplement
4th Edition (1810) — Twenty volumes
5th Edition (1817) — Twenty volumes with corrections
6th Edition (1823) — Twenty volumes, edited by Charles Maclaren
7th Edition (1842) — Twenty-one volumes plus index
8th Edition (1860) — Twenty-two volumes
Sources
The source PDF documents come from two collections:
The text extraction and article segmentation pipeline consists of several stages:
OCR: Raw text was extracted from scanned PDFs using
OLMoCR (Allen Institute for AI),
a state-of-the-art vision-language model that preserves document layout structure.
Paragraph splitting: OCR output was segmented into paragraphs,
preserving page boundaries and character offsets.
LLM classification: A large language model
(DeepSeek-R1-Distill-Llama-70B) classified each paragraph boundary as an
article start, continuation, cross-reference, or section header.
Article assembly: Paragraphs were assembled into complete
articles based on LLM boundary classifications, with fragment merging and
deduplication across overlapping source files.
Export: Final deduplicated articles were exported as JSONL
with full provenance (source file, character offsets, edition, volume).
Article Types
Articles: Full encyclopaedia entries, from short definitions
to multi-thousand-word treatises on subjects like Chemistry, Agriculture, or Medicine
Cross-references: Redirects from one headword to another
(e.g., "COLOUR. See OPTICS.")
Treatises: Long-form articles (over 5,000 words or with multiple
sub-sections), identified heuristically
Data Format
The underlying data is available in JSONL format with fields including:
Article title and unique ID
Edition and volume
Article type (article or cross-reference)
Character offsets within source OCR text
Word count, paragraph count
Sub-sections, keywords, and author attribution (where detected)
Acknowledgements
This project was made possible by the open data policies of the National Library
of Scotland and the Internet Archive. OCR processing was performed on Compute Canada
HPC infrastructure (Nibi cluster, H100 GPUs). Article boundary detection used the
DeepSeek-R1-Distill-Llama-70B model served via vLLM on USask's Plato cluster
(A100 80GB GPU).