OpenAI has a web crawler, GPTBot, that is used to crawl websites to gather information to then train its Generative Pre-trained Transformer, ChatGPT.
If OpenAI crawling your site’s content to then use it to train ChatGPT without your consent doesn’t sit well with you—you can block the crawler via your site’s robots.txt file.
robots.txt instruction to block GPTBot from crawling your site
Within your robots.txt you’ll want to add these instructions in order to block GPTBot from crawling your site’s content.
User-agent: GPTBot
Disallow: /
User-agent: GPTBot
indicates which bot the instructions are for. In this case, the bot is OpenAI’s crawler, GPTBot.
Disallow: /
tells the bot that all pages for the site are disallowed. In other words stay out of this site.
Where can I find my site’s robots.txt file?
In this section, I assume that you have access to your site’s files that are hosted on a server. If you don’t have access to your site’s files, then reach out to the webmaster or developer of your site.
The robots.txt file should be found at your site’s root.
So, let’s take my website which is located at
https://michaelsoolee.com
My site’s robots.txt file could then be found at
https://michaelsoolee.com/robots.txt
What that means is a file called robots.txt should sit at the very top i.e. not in a folder of your site’s directory.
.
├── about
│ └── index.html
├── contact
│ └── index.html
├── index.html
└── robots.txt
I’ve added the directions in my site’s robots.txt file, is my site safe from being used to train ChatGPT?
¯\_(ツ)_/¯
I caught wind of GPTBot on August 9, 2023, but who knows when it was actually released into the wild Internet.
For all I know, OpenAI has already crawled my site prior to my implementing the instructions in my robots.txt.
If OpenAI will abide by these instructions, then it should prevent GPTBot from crawling future content from the point in which you add the robots.txt instruction and beyond.