About This Corpus

Historical Encyclopaedia Britannica Digital Archive

The Corpus

This digital corpus contains the complete text of eight editions of the Encyclopaedia Britannica published between 1771 and 1860, covering nearly a century of knowledge from the Scottish Enlightenment through the early Victorian era. The corpus includes over 80,000 articles and 23,000 cross-references totalling approximately 120 million words.

Editions Included

Sources

The source PDF documents come from two collections:

Technical Pipeline

The text extraction and article segmentation pipeline consists of several stages:

  1. OCR: Raw text was extracted from scanned PDFs using OLMoCR (Allen Institute for AI), a state-of-the-art vision-language model that preserves document layout structure.
  2. Paragraph splitting: OCR output was segmented into paragraphs, preserving page boundaries and character offsets.
  3. LLM classification: A large language model (DeepSeek-R1-Distill-Llama-70B) classified each paragraph boundary as an article start, continuation, cross-reference, or section header.
  4. Article assembly: Paragraphs were assembled into complete articles based on LLM boundary classifications, with fragment merging and deduplication across overlapping source files.
  5. Export: Final deduplicated articles were exported as JSONL with full provenance (source file, character offsets, edition, volume).

Article Types

Data Format

The underlying data is available in JSONL format with fields including:

Acknowledgements

This project was made possible by the open data policies of the National Library of Scotland and the Internet Archive. OCR processing was performed on Compute Canada HPC infrastructure (Nibi cluster, H100 GPUs). Article boundary detection used the DeepSeek-R1-Distill-Llama-70B model served via vLLM on USask's Plato cluster (A100 80GB GPU).