I love scraping data.
I can write a script that in a few seconds can pull data from a site, filter out all the html tags and javascript mumbo jumbo, and spit out the exact data that I want in a beautiful, useable format (preferably JSON).
Without web scraping that would take me HOURS of copy and pasting.
One frustrating part about web scraping though is that generally site owners don't want you scraping their site. Which is totally fair enough.
However, if you're still hell bent on web scraping, you can use what's known as a 'proxy' to hide your IP address.
This makes it much harder for websites to stop you scraping them.
A proxy works by tunnelling all your requests through a seperate server.
For the site owner, it looks like it's the seperate server that's making the request, and they are. But then they are relaying that request right back to you, sneaky!
Today I'm going to show you how to use any commercial VPN (NordVPN, ExpressVPN etc) with the requests library in Python to level up your web scraping game.
First off, we're going to import the libraries we want to use. In this tutorial we're just going to use the requests library.
import requests
Using a proxy with the requests library is done with the following structure;
requests.get(url, proxies=proxy)
That's it. How damn easy is that!
So what is that 'proxy' object we passed into the get function?
The proxy object is a dictionary that maps each protocol (http, https, ftp etc) to a specific proxy in the following format;
proxy = {
'http': "username:password@host",
'https': "username:password@host"
}
Now we just need to fill in the blanks here. I'm using NordVPN but any popular VPN service will work (ExpressVPN, SurfShark etc).
Your username and password will be the same as the one you use to login to your VPN.
Notice in the proxy string, the characters : and @ are used to seperate the username, password and host. If you have these characters in your username or password, the interpreter will get confused and the proxy won't work.
For this reason we need to encode our username and password, more info can be found on that here. For reference, @ becomes %40 and : becomes %3A.
Now we just need to fill in the 'host' part of the string.
Navigating to your VPN providers website, there should be a section that lists all their servers, with NordVPN there's a 'servers' link on the homepage that gives you all the information you need;
Using the above information, the host we're going to use is au473.nordvpn.com
.
So our full proxy object becomes;
proxy = {
'http': "tom%40gmail.com:password123@au473.nordvpn.com",
'https': "tom%40gmail.com:password123@au473.nordvpn.com"
}
These aren't my real login details, but you knew that.
Putting it all together we get;
import requests
proxy = {
'http': "tom%40gmail.com:password123@au473.nordvpn.com",
'https': "tom%40gmail.com:password123@au473.nordvpn.com"
}
requests.get('https://google.com',proxies=proxy)
And that's it! Now all the requests you make will LOOK like they're coming from NordVPN, cool huh!
We've managed to turn any VPN service into a proxy with a few short lines of code.
Hopefully you've learnt something new today :)
If you want to be EVEN more stealthy when web scraping, I'll be writing more articles here on the topic, so be sure to follow me to stay updated!
Top comments (13)
Hey Tom,
This doesn't work.
I'm getting "requests.exceptions.ProxyError: HTTPSConnectionPool(host='httpbin.org', port=443): Max retries exceeded with url: /ip (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))" as an error
Just an update on this: it turns out the "proxy technology" was causing the timeout error.
I've now connected successfully using the method discussed in the article.
Hey KBubu, how did you solve the problem please? I get the same error
Hey KBubu,
I get the same error. Ive done everything as described in the article. Actually it should work but it doesnt. Could you please describe how you solved the problem?
Hey Matze0900, did you solve the problem? I get the same error..
Tom you forgot to mention the server needs to be proxy enabled.. i think because of your luck, the first server that you chose happened to be a proxy server as well which caused you to connect without problems, but you need to first check its a proxy server or not by clicking on show advanced option on the Nord page and check the proxy server option.
now this brings a bigger problem, how to find proxy servers of nord? their website only shows the top one, and since there are not that many, its not really useful.
I've followed every step of the thread but for someone reason it doesn't let me connect. I do have @ in my password which i have replaced with %40. It doesn't connect and i get impossible to estasblish connection with the server
Thanks tom looking everywhere for this... i have a question, so say you want to use 10 different vpns, wouldnt be so difficult to setup right?
i.e connect to 1st vpn connection -> google.com
connect to 2nd vpn connection -> yahoo.com
etc etc
Hi, thanks for the great post.
So, if I understand correctly, every request will go to au473.nordvpn.com and then to the end domain of the request, but does it mean the IP changes per each request?
Thanks
Yeah it works sometime and sometime it throws the error(mentioned by others). Using the same account with same server. Anyway, thanks :)
Hi Tom. Awesome article. I'm wondering if this is possible using a web automation tool like puppeteer.
I get response 200 but when I checked the ip address with request.get('ifconfig.me', proxies=proxy). text
I still see the same ip
Dont public this , lol! Nice work Btw. I need a vpn that supports changing of i.p, aka nordvpn, but also has a list of ip's aka nordvpn from alternate sources.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.