How to block ChatGPT from scraping your website.

#webdev #tutorial #ai #chatgpt

Introduction

In today's digital day and age, web scraping and parsing have become common techniques for various applications, including data collection, content analysis, and most presently teaching AI models. While these practices can be legitimate and beneficial, there are instances where you don't want a language model trained on your data, leading to concerns about privacy, security, and unauthorized data extraction. In this article, we will discuss how to block ChatGPT (OpenAI), a popular AI language model, from scraping and using your website to train their AI models.

Understanding ChatGPT

ChatGPT is an AI language model developed by OpenAI, known for its ability to generate human-like text based on input prompts. It is widely used for various applications, including chatbots, content generation, and data analysis. ChatGPT can access and parse information from websites through plugins and OpenAIs web crawler, making it essential for website owners who don't want their data trained on by AI models to block its access.

Blocking ChatGPT from your website

Method 1. Robots.txt file

The robots.txt file is a standard method for communicating with web crawlers and bots, including ChatGPT. You can specify which parts of your website are off-limits for crawling by adding rules to this file. To block ChatGPT from your entire website, add the following lines to your robots.txt file:

User-agent: ChatGPT-User
User-agent: GPTBot
Disallow: /

Or if you would like to block ChatGPT only from certain places on your website you can do the following:

User-agent: ChatGPT-User
User-agent: GPTBot
Allow: /can-be-scraped/
Disallow: /will-be-blocked

However, one major downside of the robots.txt method is that it can't actually enforce the instructions. While OpenAI themselves state they will not crawl your website if you disallow their User-agent and I believe this to be true, you can never be sure.

Method 2. Web Application Firewall (WAF)

A web application firewall is a specific form of application firewall that filters, monitors, and blocks traffic to your website. In the case of ChatGPT and its GPTBot web-crawler you have 2 options to block access to your website, you're able to use both if you want to.

The first method is similar to the robots.txt method; blocking its user-agent. OpenAI web crawler will always have the user-agent of "GPTBot" and all ChatGPT plugins will have the user-agent "ChatGPT-User". By blocking the user-agents "GPTBot" & "ChatGPT-User" you will be blocking its access to your website. If you use Cloudflare, here is a guide to user-agent blocking: https://developers.cloudflare.com/waf/tools/user-agent-blocking/
The second method: blocking IP Ranges. OpenAI has published the IP Ranges they use for GPTBot & ChatGPT plugins. You can find them here: https://openai.com/gptbot-ranges.txt. Those are just applicable to OpenAIs web crawler, if you would also like to block the ChatGPT plugins you should also block 23.98.142.176/28. By blocking these IP Ranges you can be confident ChatGPT will be unable to scrape your website, even if they decide to ignore robots.txt or change their user-agent in the future.

Written by: Rico van Zelst

Thank you for reading! If you enjoyed this post and want to explore more of my work, visit my personal website at rico.sh. There, you'll find in-depth articles, tutorials, and resources on web development. Don't miss out – let's continue this journey together!"