Data Ethics and Privacy

OCAP, sensitive data, and cloud-based AI tools

The Core Issue

Claude Code, Codex, and every other cloud-based AI tool sends your data to remote servers for processing. When you ask Claude Code to read a CSV file, the contents of that file leave your computer. This is fine for many datasets. It is not fine for all of them.

Before using any AI tool with your data, ask: who has rights over this data, and what am I allowed to do with it?

OCAP Principles

Historians working with Indigenous data in Canada need to understand OCAP — the First Nations principles of Ownership, Control, Access, and Possession:

Ownership: First Nations communities collectively own their cultural knowledge, data, and information
Control: First Nations communities have the right to control all aspects of data management and research processes
Access: First Nations communities must have access to data about themselves and their communities
Possession: First Nations communities must have physical control of data

Sending Indigenous community data to Anthropic’s or OpenAI’s servers almost certainly violates the Possession and Control principles. The data leaves the community’s physical control and is processed by a third-party corporation, regardless of what that corporation’s privacy policy says.

This is not a technical limitation to work around. It is a matter of Indigenous data sovereignty.

If your research involves Indigenous data, consult with the relevant community and your institution’s research ethics board before using any cloud-based AI tool with that data.

What Subscriber Privacy Actually Means

AI providers vary widely in how they handle your data, and the details matter:

Anthropic uses an opt-in model: your data is not used for training unless you choose to allow it. This applies to both free and paid tiers.
OpenAI, xAI, Perplexity, and others use an opt-out model: your data is used for training by default, even on paid plans, unless you find the setting and turn it off. Many users do not know this.

Do not assume that paying for a subscription protects your data. Check the specific provider’s settings and policies.

Even with training opt-out enabled, your data is not private in the way that data on your own computer is private:

Your data is transmitted over the internet to the provider’s servers
It is processed on infrastructure you do not control
It may be stored temporarily for safety monitoring
It is subject to the provider’s terms of service, which can change
It is subject to the legal jurisdiction where the servers are located (typically the United States)

A paid subscription is not a data agreement. If your institution requires specific data handling guarantees — as is common for health data, personnel records, or data covered by research ethics protocols — you need a formal agreement between your institution and the AI provider.

What Is Safe to Use

There is no simple checklist. Every dataset requires you to consider who has rights over it, what restrictions apply, and whether sending it to a cloud service is appropriate.

As a rough guide, data that you created yourself — your own notes, your own writing, your own code — is straightforward. Open government datasets explicitly published for reuse (like the Canada boundaries data we use in this workshop) are generally fine, but check the licence. Beyond that, the answer depends on the specific data and context.

“Publicly available” does not mean “free to use however you like.” Newspaper archives may be under copyright. Government records are sometimes restricted or not open by default in all jurisdictions. Indigenous communities that shared data online did not consent to that data being ingested by AI companies. The fact that something is technically accessible does not settle the ethical question.

Use with care:

Archival materials with access restrictions or donor agreements
Oral history transcripts (informed consent may not have covered AI processing)
Data collected under research ethics protocols — check whether your protocol permits cloud processing
Unpublished research data
Published materials that may be under copyright (newspaper archives, published books, etc.)

Do not use without explicit authorization:

Data produced by or about Indigenous communities — even if publicly available online. OCAP principles apply to the relationship, not just the access method.
Data held in trust under OCAP or community data governance agreements
Health or medical records
Personal information about living individuals
Data covered by FIPPA, PIPEDA, or institutional data governance policies
Data subject to NDA or contractual restrictions

Practical Strategies

Work with metadata instead of data

You can often describe your data to Claude Code without giving it the actual records:

I have a dataset with columns: name, community, date_of_birth,
traditional_territory. I need to model this in CIDOC-CRM.
What classes and properties should I use?

Claude Code can help you design a data model, write transformation scripts, and debug code without ever seeing the sensitive data itself.

Use synthetic or anonymized data

Create a small sample dataset with fictional entries that mirrors the structure of your real data. Use that for developing your workflow with Claude Code, then run the final scripts on your real data locally, without AI assistance.

Run scripts locally

Claude Code can write a Python script that transforms your data. You can then run that script yourself, on your own machine, without Claude Code. The script does not send data anywhere — it is just a file on your computer.

Institutional agreements

Some Canadian universities are negotiating enterprise agreements with AI providers that include data residency and processing guarantees. Check with your institution’s IT or research services office. These agreements may allow cloud processing of data that would otherwise be off-limits.

The Bottom Line

These tools are powerful, but they are cloud services operated by American corporations. Treat them the way you would treat any other cloud service: useful for work that is already public or that you own outright, but not a place to put data that belongs to someone else or that carries legal or ethical restrictions.

When in doubt, describe your data to the tool instead of showing it. You get most of the benefit with none of the risk.