Web scraping has always been a Power Automate Desktop functionality, but did you know you can also use Power Automate Cloud?
Let me call out a few important caveats:
The Cloud approach doesn't work on all sites, it only works on static or php generated html files. Anything rendered on the client won't work, for that you need a headless browser or a full browser (that's where Power Automate Desktop comes in). Additionally if the content is very dynamic (i.e. continually changing dom id's/names) it can also be very difficult.
Another thing to be aware of the legal limitations of web scraping as a whole, example scraping personal information is not allowed. Additionally as sites don't want to be scraped they play a few tricks sites to stop it (Some even embedded hidden personal data to make it illegal to scrape).
With that out the way let me show you how.
As an example site I'm going to use https://www.gov.uk/vehicle-tax-rate-tables, I could want to extract the emissions table to update a cost forecast, links to navigate too other pages, when last update, or any other piece of information.
To get the web page we are going to use the HTTP
connector, with minimum configuration (just the Method and URL)
You should see the html content in the body return.
If you get an error, or JavaScript loaded return then this approach won't work, as it requires the browser to run the script to render the page, example below is a google search:
There are now 3 approaches we can use to extract the data from the page:
- Substring Expression (only for very simple pages/extracts)
- Script (more complex but must have repeated structure)
- GPT (most powerful but has challenges)
1. Substring Expression
If you can this is the method to go with, it works consistently and is easiest to create. Though it has a single key requirement for it to work, the data you are grabbing must have a consistent dom element (e.g. <div id="data">
).
So in the example if we wanted the emissions table we are lucky as there is only one table on the page, so we can extract the dom element <table>
.
We add another </table>
at the end as we use this as our end of data so gets cut off
Expression
substring(
body('HTTP')
,
indexOf(
body('HTTP')
,
'<table>'
)
,
sub(indexOf(
body('HTTP')
,
'</table>'
)
,
indexOf(
body('HTTP')
,
'<table>'
)
)
)
Substring requires, text, start character number, number of characters, so we pass the http body, we use indexOf to find the character number of <table>
. Then to calculate the length to return we use another indexOf to find the </table>
. Though this returns the number of characters from the start of the text, not the start of the <table>
, so we subtract the <table>
character number from it.
2. Script
Power Automate expressions are a little limited, so we can extend its functionality with Office Scripts. We don't need to actually use an Excel file, just use it as a place holder to send and return data. For more about Office Script's check out my previous blog - How to Master Power Automate Scripts.
With the Power of JavaScript/TypeScript there are multiple ways to extract the data, but the most useful is a simple regex.
Lets say in our example we want all the links (maybe to pass onto another web scrape stage), we can pass the http body to an office script, which then uses regex to return all links.
An empty excel file is used with the below script
3. GPT
There will be certain times when the data or web page structure is too irregular, and when it is we have an ace up our sleeve, the Create text with GPT
, I've done a full blog about what it can do here.
This connector can use natural language prompts to extract data from context data. But there is one big issue, there is a token/character limit for context data, so we can't send the full webpage.
A token is approximately a 4 character word, but it's not particularly easy to calculate on the fly, so I go with a safe 5000-character limit.
To fix this we need to combine with either of the first 2 approaches to make the context data shorter.
In the demo example I'm going to look for the contents list, I know it will always be near the top so I want to start my context data at the title (which I know will always be the same). Then pass the next 5000 characters.
Expression
substring(
body('HTTP')
,
indexOf(
body('HTTP')
,
'Vehicle tax rates'
)
,
5000
)
And we get:
Below shows how simple all the approaches are (from an actions side):
You might also combine them in different ways, like gpt connector converting the table, or a script to remove markup tags to decrease the tokens
As you can see, in certain circumstances it is possible to web scrape with a Cloud Flow, but just remember:
- Check the sites are legal to scrape
- Make sure the site is suitable
- Check your account has enough Power Platform API calls
- If using GPT you have enough AI Builder Credits (its free while in preview but that wont last)
- And there is always Power Automate Desktop (Nice blog how here)
Top comments (1)
Make the most of the premium connector licensing and create more powerful and efficient power-automate flow (assuming the lic is based on per flow model / per user model )