# What is TerraCotta? AI crawler guide

Canonical URL: https://trakkr.ai/bots/terracotta
Published: 2026-06-11
Last updated: 2026-06-11

Learn what TerraCotta is, who operates it, its verified user-agent, robots.txt posture, and how blocking it can affect AI search, citations, training, or agent visibility.

Ceramic AI crawler token for downloading data used to train LLMs.

## What is TerraCotta?

TerraCotta is a web crawler operated by Ceramic AI that downloads publicly available data for training large language models. It identifies itself with the user-agent token TerraCotta and honors the Robots Exclusion Protocol, meaning it respects robots.txt directives. The crawler is documented on GitHub under the CeramicTeam organization, and its sole purpose is to collect training material for Ceramic AI's LLM development. Site owners can control its access through standard robots.txt rules.

## What it's for

If you allow TerraCotta, your site's content may be used to train Ceramic AI's language models. Blocking it prevents your pages from being included in that training pipeline, which could affect how Ceramic AI's models understand or represent your domain.

## How to handle TerraCotta

To prevent TerraCotta from crawling your site, add a disallow rule for the TerraCotta user-agent in your robots.txt file. The crawler respects robots.txt, so this is the primary control mechanism. No additional steps are required.

## robots.txt rule

User-agent: TerraCotta
Disallow: /

## Blocking cost

Blocking TerraCotta may exclude your content from Ceramic AI's training data, potentially reducing your site's representation in models built by Ceramic AI.

## Examples

- TerraCotta visits a news website and downloads article text for LLM training.
- TerraCotta crawls a public documentation site to collect technical writing samples.
- TerraCotta fetches pages from an e-commerce site to learn product descriptions.

## Related bots

- LAIONDownloader: Also tracked as a training crawler.
- AI2Bot: Also tracked as a training crawler.
- CCBot: Also tracked as a training crawler.
- cohere-training-data-crawler: Also tracked as a training crawler.
- img2dataset: Also tracked as a training crawler.
- PanguBot: Also tracked as a training crawler.
- Applebot-Extended: Also tracked as a training crawler.
- Google-Extended: Also tracked as a training crawler.
- GPTBot: Also tracked as a training crawler.
- AI Training Opt-Out: TerraCotta is a training crawler tied to this policy decision.
- Robots.txt: Robots.txt is the control file used to allow or block TerraCotta.

## Frequently Asked Questions

### What does TerraCotta do?

TerraCotta is a crawler from Ceramic AI that downloads web content to train large language models.

### Does TerraCotta obey robots.txt?

Yes, TerraCotta honors the Robots Exclusion Protocol and will follow disallow rules set in robots.txt.

### How can I block TerraCotta?

Add a User-agent: TerraCotta line followed by Disallow: / in your robots.txt file to block it completely.

### Who operates TerraCotta?

TerraCotta is operated by Ceramic AI, and its documentation is available on GitHub under the CeramicTeam organization.

### What happens if I block TerraCotta?

Blocking TerraCotta prevents your site's content from being used in Ceramic AI's LLM training, which may affect how their models handle your domain.

## Data And Sources

- [Ceramic AI documentation](https://github.com/CeramicTeam/CeramicTerracotta) - Primary source for TerraCotta crawler details.
