# What is ICC-Crawler? AI crawler guide

Canonical URL: https://trakkr.ai/bots/icc-crawler
Published: 2026-06-11
Last updated: 2026-06-11

Learn what ICC-Crawler is, who operates it, its verified user-agent, robots.txt posture, and how blocking it can affect AI search, citations, training, or agent visibility.

NICT crawler for data used in artificial intelligence technologies and third-party research/commercial uses.

## What is ICC-Crawler?

ICC-Crawler is a web crawler operated by NICT, Japan's National Institute of Information and Communications Technology. It collects publicly available web data to build datasets used in artificial intelligence technologies and for third-party research and commercial applications. The crawler identifies itself with the user-agent token ICC-Crawler and respects the Robots Exclusion Protocol, meaning it will obey directives placed in a site's robots.txt file. Its activity is focused on gathering text and other content that can support machine learning model development and broader AI research initiatives.

## What it's for

For a site owner, ICC-Crawler's visits mean your content may be included in datasets that fuel AI development and research, potentially benefiting innovation but also raising considerations about how your data is used downstream. Allowing the crawler can contribute to scientific and commercial AI advancements, while blocking it ensures your site's content is not harvested for these purposes.

## How to handle ICC-Crawler

To prevent ICC-Crawler from accessing your site, add a disallow rule for the user-agent token ICC-Crawler in your robots.txt file. Because the crawler honors robots.txt, this will stop it from crawling any disallowed paths. If you wish to permit crawling, simply omit the rule or explicitly allow it.

## robots.txt rule

User-agent: ICC-Crawler
Disallow: /

## Blocking cost

Blocking ICC-Crawler may prevent your site's content from appearing in AI training datasets and research corpora, which could reduce its potential influence on future AI systems and limit visibility in AI-driven applications.

## Examples

- ICC-Crawler visits a news website and collects article text for inclusion in a multilingual language model training set.
- A research institution allows ICC-Crawler to gather open-access papers, which are then used to improve academic search and summarization tools.
- An e-commerce site blocks ICC-Crawler, so its product descriptions are not included in a dataset for training product recommendation algorithms.

## Related bots

- CCBot: Also tracked as a training crawler.
- Ai2Bot-Dolma: Also tracked as a training crawler.
- LAIONDownloader: Also tracked as a training crawler.
- AI2Bot: Also tracked as a training crawler.
- img2dataset: Also tracked as a training crawler.
- ClaudeBot: Also tracked as a training crawler.
- GPTBot: Also tracked as a training crawler.
- SBIntuitionsBot: Also tracked as a training crawler.
- VelenPublicWebCrawler: Also tracked as a training crawler.
- Robots.txt: Robots.txt is the control file used to allow or block ICC-Crawler.
- AI Training Opt-Out: ICC-Crawler is a training crawler tied to this policy decision.

## Frequently Asked Questions

### Who operates ICC-Crawler?

ICC-Crawler is operated by NICT, the National Institute of Information and Communications Technology in Japan.

### What is the purpose of ICC-Crawler?

It crawls the web to collect data for artificial intelligence technologies and for use in third-party research and commercial applications.

### Does ICC-Crawler respect robots.txt?

Yes, ICC-Crawler honors the Robots Exclusion Protocol, so it will follow any disallow rules set for its user-agent token.

### How can I block ICC-Crawler?

Add a 'User-agent: ICC-Crawler' line followed by 'Disallow: /' in your robots.txt file to prevent it from crawling your site.

### What happens if I block ICC-Crawler?

Blocking ICC-Crawler will stop it from collecting your site's content for AI datasets, which may exclude your data from research and commercial AI development.

## Data And Sources

- [ICC-Crawler source reference](https://github.com/ai-robots-txt/ai.robots.txt/blob/main/table-of-bot-metrics.md) - Source used to verify ICC-Crawler.
