DEV Community

Mert Yazıcıoğlu
Mert Yazıcıoğlu

Posted on • Edited on

Filtering Googlebot IPs from a list of IP addresses

Hey there dev.to community!

I recently needed a simple script to filter Googlebot IPs from a list of IP addresses to be able to extract actual Googlebot visits from an access log.

Thankfully, Google provides a method to make sure a visitor is actually Googlebot.

If you have a similar need, there you go:

#/bin/bash
#
# Performs reverse and forward DNS lookups to list Googlebot's IPs, given a list
# of IP addresses as a file. Useful for filtering access logs to find out actual
# Googlebot visits.
#
# An implementation of https://support.google.com/webmasters/answer/80553?hl=en

while IFS='' read -r IP_ADDRESS || [[ -n "$IP_ADDRESS" ]];
do
    IS_GOOGLEBOT=0
    REVERSE_LOOKUP="$(host $IP_ADDRESS)"
    echo "$REVERSE_LOOKUP" | grep -E "google.com.$|googlebot.com.$" > /dev/null && IS_GOOGLEBOT=1

    if [[ IS_GOOGLEBOT -eq 1 ]]; then
        FORWARD_LOOKUP="$(host $(echo "$REVERSE_LOOKUP" | cut -d " " -f 5) | cut -d " " -f 4)"
        if [[ "$FORWARD_LOOKUP" = "$IP_ADDRESS" ]];
        then
            echo $IP_ADDRESS
        fi
    fi
done < "$1"
Enter fullscreen mode Exit fullscreen mode

You may save it as something like filter-googlebot-ips.sh and provide a file with a list of IP addresses to filter (each on a single line), as an argument. Like so:

$ ./filter-googlebot-ips.sh access-log-ips.txt > googlebot-ips.txt
Enter fullscreen mode Exit fullscreen mode

This will perform reverse and forward DNS lookups for each of the IP addresses and print out the verified Googlebot IPs to STDOUT, which you can write to a file like in the example above.

Hope it helps someone out there! 🙌

PS: Here is a GitHub Gist if you prefer that.

Top comments (0)