This post was originally published on my blog, jacklyons.me
Just recently I was asked to scrape a Wordpress blog for a client to audit of all their posts. Naturally, the first thought was to just export all the posts, however, after a quick google I stumbled upon the Wordpress REST API. Using the API allows you to make direct requests to any wordpress site and retreive a list of blog posts as a JSON object.
Give it a try right now. Punch this into your browser and you should get a list of my 10 most recent blog posts:
https://jacklyons.me/wp-json/wp/v2/posts
It's that easy! Inside each post object there is a huge amount of data. You can extract things like post date, post status, and much more. The API documetation states that you can only retreive a maximum of 100 posts per request. In this post I'll show you how to create a function that will get all your posts in a single go! This can be helpful when the site you're scraping has hundreds or thousands of posts.
Below I created a super simple HTML snippet that you can copy and paste into a basic HTML file. Note that I'm using some modern browser and ES2017 features so you'll have to use Chrome or Firefox. Also, it may take a little while if you are scraping a site with a few hundred or thousand posts.
If you have any questions, comments or feedback to improve, please just leave a comment :)
Top comments (11)
To anyone stumbling across this post and trying the above code. Make sure you give enough time for the posts to be pulled. I kept thinking it wasn't working when it was actually still loading them all.
You can also get the
x-wp-totalpages
by making a HEAD request for the posts URL (/wp-json/wp/v2/posts/). This will return all the headers for the request and none of the content. If you need the total number of posts, there's another header you can get,x-wp-total
.Nice, I've taken a similar approach on posts from multiple sources before. I would just like to point out that this use case is one of the ones where the backend, assuming you control it [ even though it's PHP (gross) ] is probably the better way to go ( primarily because the response time will be much faster than doing multiple queries ). You can use a filter on the
rest_[your-post-type]_collections_params
like soWow! this is great! can i do this in react?
in componentDidMount(){} i guess.
regards
Sure if you wanna do this in react just pop it in a lifecycle hook. Let me know if you have any issues :)
hey thanks for the above code, but I have a doubt. Whenever I am sending the get request, the response that comes back has the required data. But when we look at title and content of each post, it has some pagebuilder in it, any idea how to remove it?
[vc_row type=”full_width_background” full_screen_row_position=”middle” scene_position=”center” text_color=”dark” text_align=”left” top_padding=”5%” bottom_padding=”13%” id=”team” overlay_strength=”0.3″ shape_divider_position=”bottom” bg_image_animation=”none” shape_type=””][vc_column column_padding=”no-extra-padding” column_padding_position=”all” background_color_opacity=”1″ background_hover_color_opacity=”1″ column_link_target=”_self” column_shadow=”none” column_border_radius=”none” width=”1/1″ tablet_width_inherit=”default” tablet_text_alignment=”default” phone_text_alignment=”default” column_border_width=”none” column_border_style=”solid” bg_image_animation=”none”][vc_column_text]
It looks like this.
Awesome Jack!
Quick question: How can I authenticate to access a secure site that I have a login and account for?
Not sure tbh ... but I'm sure some googling might provide an answer :) Otherwise you would need to make the content publicly accessible
I am trying to fetch data from my free Wordpress blog but its not working
Can you share your code or the error message?
Wow, a wealth of data. Thank you!