This blog was originally posted to Crawlbase Blog
Houzz is a platform where homeowners, designers, and builders come together to find products, inspiration, and services. It’s one of the top online platforms for home renovation, interior design, and furniture shopping. With over 65 million unique users and 10 million product listings, Houzz is a treasure trove of data for businesses, developers, and researchers. The platform offers insights that can be used to build e-commerce, do market research, or analyze design trends.
In this blog, we’ll walk you through how to scrape Houzz search listings and product pages using Python. We’ll show you how to optimize your scraper using Crawlbase Smart Proxy so you can scrape smoothly and efficiently even from websites with anti-scraping measures.
Let’s get started!
Why Scrape Houzz Data?
Scraping Houzz data can be beneficial for a variety of reasons. With its large collection of home products, furniture, and decor, Houzz offers a lot of data that can help businesses and individuals make informed decisions. Following are some of the reasons to scrape Houzz data.
- Market Research: If you’re in the home decor or furniture industry, you can analyze product trends, pricing strategies and customer preferences by scraping product details and customer reviews from Houzz.
- Competitor Analysis: For e-commerce businesses, scraping Houzz will give you competitor pricing, product availability and customer ratings so you can stay competitive.
- Product Data Aggregation: If you’re building a website or app that compares products across multiple platforms, scrape Houzz to include its massive product catalog in your data.
- Customer Sentiment Analysis: Collect reviews and ratings to analyze customer sentiment about specific products or brands. Help brands improve their offerings or help buyers make better decisions.
- Data-Driven Decisions: Scrape Houzz to make informed decisions on what products to stock, how to price them and what customers are looking for.
Key Data Points to Extract from Houzz
When scraping from Houzz, you can focus on several key pieces of information. Here are the data points to extract from Houzz:
- Name: The product name.
- Price: The product price.
- Description: Full details on features and materials.
- Images: High res images of the product.
- Ratings and Reviews: Customer feedback on product.
- Specifications: Dimensions, materials etc.
- Seller: Information on the seller or store.
- Company: Business name.
- Location: Business location.
- Phone: Business phone number.
- Website: Business website.
- Email: Business email (if on website).
Setting Up Your Python Environment
To get started scraping Houzz data you need to set up your Python environment. This involves installing Python, the necessary libraries and an Integrated Development Environment (IDE) to make coding easier.
Installing Python and Required Libraries
First, you need to install Python on your computer. You can download the latest version from python.org. After installing open a terminal or command prompt to make sure Python is installed by typing:
python --version
Next, you’ll need to install the libraries for web scraping. The two main ones are requests
for fetching web pages and BeautifulSoup
for parsing the HTML. Install these by typing:
pip install requests beautifulsoup4
These libraries are essential for extracting data from Houzz's HTML structure and making the process smooth.
Choosing an IDE
An IDE makes writing and managing your Python code easier. Some popular options include:
- Visual Studio Code: A lightweight, free editor with great extensions for Python development.
- PyCharm: A dedicated Python IDE with many built-in features for debugging and code navigation.
- Jupyter Notebook: Great for interactive coding and seeing your results immediately.
Choose the IDE that suits you and your coding style. Once your environment is set up you’ll be ready to start building your Houzz scraper.
Scraping Houzz Search Listings
In this section, we will focus on scraping Houzz search listings, which display all the products on the site. We will cover how to find CSS selectors by inspecting the HTML, write a scraper to extract data, handle pagination, and store the data in a JSON file.
Inspecting the HTML Structure
First of all, you need to inspect the HTML of the Houzz page from which you want to scrape product listings. For example, to scrape bathroom vanities and sink consoles, use the URL:
https://www.houzz.com/products/bathroom-vanities-and-sink-consoles/best-sellers--best-sellers
Open the developer tools in your browser and navigate to this URL.
Here are some key selectors to focus on:
-
Product Title: Found in an
<a>
tag with classhz-product-card__product-title
which contains the product name. -
Price: In a
<span>
tag with classhz-product-price
which displays the product price. -
Rating: In a
<span>
tag with classstar-rating
which shows the product’s average rating (accessible via thearia-label
attribute). -
Image URL: The product image is in an
<img>
tag and you can get the URL from thesrc
attribute. -
Product Link: Each product links to its detailed page in an
<a>
tag which can be accessed via thehref
attribute.
By looking at these selectors you can target the data you need for your scraper.
Writing the Houzz Search Listings Scraper
Now that you know where the data is located, let's write the scraper. The following code uses the requests
library to fetch the page and BeautifulSoup
to parse the HTML.
import requests
from bs4 import BeautifulSoup
def scrape_houzz_search_listings(url):
products = []
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
for item in soup.select('div[data-container="Product List"] > div.hz-product-card'):
title = item.select_one('a.hz-product-card__product-title').text.strip() if item.select_one('a.hz-product-card__product-title') else 'N/A'
price = item.select_one('span.hz-product-price').text.strip() if item.select_one('span.hz-product-price') else 'N/A'
rating = item.select_one('span.star-rating')['aria-label'].replace('Average rating: ', '') if item.select_one('span.star-rating') else 'N/A'
image_url = item.find('img')['src'] if item.find('img') else 'N/A'
product_link = item.find('a')['href'] if item.find('a') else 'N/A'
product_data = {
'title': title,
'price': price,
'rating': rating,
'image_url': image_url,
'product_link': product_link,
}
products.append(product_data)
else:
print(f'Failed to retrieve the page: {response.status_code}')
return products
Handling Pagination
To scrape multiple pages, we need to implement a separate function that will handle pagination logic. This function will check if there is a “next page” link and return the URL for that page. We can then loop through all the listings.
Here’s how you can write the pagination function:
def get_next_page_url(soup):
next_button = soup.find('a', class_='next-page')
return next_button['href'] if next_button else None
We will call this function in our main scraping function to continue fetching products from all available pages.
Storing Data in a JSON File
Next, we'll create a function to save the scraped data into a JSON file. This function can be called after retrieving the listings.
def save_to_json(data, filename='houzz_products.json'):
with open(filename, 'w') as json_file:
json.dump(data, json_file, indent=4)
print(f'Data saved to {filename} successfully!')
Complete Code Example
Now, let’s combine everything, including pagination, into a complete code snippet.
import requests
from bs4 import BeautifulSoup
import json
def scrape_houzz_search_listings(url):
products = []
while url:
print(f'Scraping {url}')
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
for item in soup.select('div[data-container="Product List"] > div.hz-product-card'):
title = item.select_one('a.hz-product-card__product-title').text.strip() if item.select_one('a.hz-product-card__product-title') else 'N/A'
price = item.select_one('span.hz-product-price').text.strip() if item.select_one('span.hz-product-price') else 'N/A'
rating = item.select_one('span.star-rating')['aria-label'].replace('Average rating: ', '') if item.select_one('span.star-rating') else 'N/A'
image_url = item.find('img')['src'] if item.find('img') else 'N/A'
product_link = item.find('a')['href'] if item.find('a') else 'N/A'
product_data = {
'title': title,
'price': price,
'rating': rating,
'image_url': image_url,
'product_link': product_link,
}
products.append(product_data)
# Handle pagination
url = get_next_page_url(soup)
else:
print(f'Failed to retrieve the page: {response.status_code}')
break
return products
def get_next_page_url(soup):
next_button = soup.find('a', class_='hz-pagination-link--next')
return 'https://www.houzz.com' + next_button['href'] if next_button else None
def save_to_json(data, filename='houzz_products.json'):
with open(filename, 'w') as json_file:
json.dump(data, json_file, indent=4)
print(f'Data saved to {filename} successfully!')
# Main function to run the scraper
if __name__ == '__main__':
start_url = 'https://www.houzz.com/products/bathroom-vanities-and-sink-consoles/best-sellers--best-sellers'
listings = scrape_houzz_search_listings(start_url)
save_to_json(listings)
This complete scraper will extract product listings from Houzz, handling pagination smoothly.
Example Output:
[
{
"title": "The Sequoia Bathroom Vanity, Acacia, 30\", Single Sink, Freestanding",
"price": "$948",
"rating": "4.9 out of 5 stars",
"image_url": "https://st.hzcdn.com/fimgs/abd13d5d04765ce7_1626-w458-h458-b1-p0--.jpg",
"product_link": "https://www.houzz.com/products/the-sequoia-bathroom-vanity-acacia-30-single-sink-freestanding-prvw-vr~170329010"
},
{
"title": "Bosque Bath Vanity, Driftwood, 42\", Single Sink, Undermount, Freestanding",
"price": "$1,249",
"rating": "4.699999999999999 out of 5 stars",
"image_url": "https://st.hzcdn.com/fimgs/4b81420b03f91a0a_3904-w458-h458-b1-p0--.jpg",
"product_link": "https://www.houzz.com/products/bosque-bath-vanity-driftwood-42-single-sink-undermount-freestanding-prvw-vr~107752516"
},
{
"title": "Render Bathroom Vanity, Oak White",
"price": "$295",
"rating": "4.5 out of 5 stars",
"image_url": "https://st.hzcdn.com/fimgs/4b31b0e601395a74_7516-w458-h458-b1-p0--.jpg",
"product_link": "https://www.houzz.com/products/render-bathroom-vanity-oak-white-prvw-vr~176775440"
},
{
"title": "The Wailea Bathroom Vanity, Single Sink, 42\", Weathered Fir, Freestanding",
"price": "$1,354",
"rating": "4.9 out of 5 stars",
"image_url": "https://st.hzcdn.com/fimgs/81e1d4ca045d1069_1635-w458-h458-b1-p0--.jpg",
"product_link": "https://www.houzz.com/products/the-wailea-bathroom-vanity-single-sink-42-weathered-fir-freestanding-prvw-vr~188522678"
},
.... more
]
Next, we will explore how to scrape individual product pages for more detailed information.
Scraping Houzz Product Pages
After scraping the search listings, next we gather more information from individual product pages. This will give us more info about each product, including specs and extra images. In this section, we will look at the HTML of a product page, write a scraper to extract the data and then store that data in a JSON file.
Inspecting the HTML Structure
To scrape product pages, you first need to look at the HTML structure of a specific product page.
https://www.houzz.com/products/the-sequoia-bathroom-vanity-acacia-30-single-sink-freestanding-prvw-vr~170329010
Open the developer tools in your browser and navigate to this URL.
Here are some key selectors to focus on:
-
Product Title: Within a
span
with classview-product-title
. -
Price: Within a
span
with classpricing-info__price
. -
Description: Within a
div
with classvp-redesign-description
. -
Images: Additional images within
img
tags withindiv.alt-images__thumb
.
Knowing this is key to writing your scraper.
Writing the Houzz Product Page Scraper
Now that we know where to find the data, we can create a function to scrape the product page. Here’s how you can write the code to extract the necessary details:
import requests
from bs4 import BeautifulSoup
def scrape_houzz_product_page(url):
response = requests.get(url)
product_data = {}
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.select_one('span.view-product-title').text.strip() if soup.select_one('span.view-product-title') else 'N/A'
price = soup.select_one('span.pricing-info__price').text.strip() if soup.select_one('span.pricing-info__price') else 'N/A'
description = soup.select_one('div.vp-redesign-description').text.strip() if soup.select_one('div.vp-redesign-description') else 'N/A'
image_urls = [img['src'] for img in soup.select('div.alt-images__thumb > img')] if soup.select('div.alt-images__thumb > img') else 'N/A'
product_data = {
'title': title,
'price': price,
'description': description,
'image_urls': image_urls,
'product_link': url
}
else:
print(f'Failed to retrieve the product page: {response.status_code}')
return product_data
Storing Data in a JSON File
Just like the search listings, we can save the data we scrape from the product pages into a JSON file for easy access and analysis. Here’s a function that takes the product data and saves it in a JSON file:
def save_product_to_json(product_data, filename='houzz_product.json'):
with open(filename, 'w') as json_file:
json.dump(product_data, json_file, indent=4)
print(f'Product data saved to {filename} successfully!')
Complete Code Example
To combine everything we've discussed, here's a complete code example that includes both scraping individual product pages and saving that data to a JSON file:
import requests
from bs4 import BeautifulSoup
import json
def scrape_houzz_product_page(url):
response = requests.get(url)
product_data = {}
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.select_one('span.view-product-title').text.strip() if soup.select_one('span.view-product-title') else 'N/A'
price = soup.select_one('span.pricing-info__price').text.strip() if soup.select_one('span.pricing-info__price') else 'N/A'
description = soup.select_one('div.vp-redesign-description').text.strip() if soup.select_one('div.vp-redesign-description') else 'N/A'
image_urls = [img['src'] for img in soup.select('div.alt-images__thumb > img')] if soup.select('div.alt-images__thumb > img') else 'N/A'
product_data = {
'title': title,
'price': price,
'description': description,
'image_urls': image_urls,
'product_link': url
}
else:
print(f'Failed to retrieve the product page: {response.status_code}')
return product_data
def save_product_to_json(product_data, filename='houzz_product.json'):
with open(filename, 'w') as json_file:
json.dump(product_data, json_file, indent=4)
print(f'Product data saved to {filename} successfully!')
# Main function to run the product page scraper
if __name__ == '__main__':
product_url = 'https://www.houzz.com/product/204153376'
product_details = scrape_houzz_product_page(product_url)
save_product_to_json(product_details)
This code will scrape detailed information from a single Houzz product page and save it to a JSON file.
Example Output:
{
"title": "The Sequoia Bathroom Vanity, Acacia, 30\", Single Sink, Freestanding",
"price": "$948",
"description": "The 30\" Sequoia single sink bathroom vanity will be the centerpiece of your bathroom remodel. Skillfully constructed of 100% solid fir wood to last a lifetime. Wood is skillfully finished with raised grain to give a distressed and reclaim wood look. One solid wood dovetail drawer with full extension glides gives you all the necessary storage room for your daily toiletries, coupled with a quartz countertop.Solid fir wood constructionBeautiful chevron front door designSolid wood dovetail drawers boxSoft closing drawer with full extension glidesWood finished to prevent warping, cracking and withstand bathroom humidity levelsWhite quartz countertopAssembled dimensions: 30 in. W x 22 in. D x 34.50 in. HBlack hardwarePre drilled for 8 inch widespread faucetFinished in Weathered Fir - rustic and reclaim wood look.",
"image_urls": [
"https://st.hzcdn.com/fimgs/abd13d5d04765ce7_1626-w100-h100-b0-p0--.jpg",
"https://st.hzcdn.com/fimgs/9c617c9c04765ce8_1626-w100-h100-b0-p0--.jpg",
"https://st.hzcdn.com/fimgs/7af1287304765cea_1626-w100-h100-b0-p0--.jpg",
"https://st.hzcdn.com/fimgs/a651c05404765ced_1626-w100-h100-b0-p0--.jpg",
.... more
],
"product_link": "https://www.houzz.com/products/the-sequoia-bathroom-vanity-acacia-30-single-sink-freestanding-prvw-vr~170329010"
}
In the next section, we will discuss how to optimize your scraping process with Crawlbase Smart Proxy.
Optimizing with Crawlbase Smart Proxy
When scraping sites like Houzz, IP blocks and CAPTCHAs can slow you down. Crawlbase Smart Proxy helps bypass these issues by rotating IPs and handling CAPTCHAs automatically. This allows you to scrape data without interruptions.
Why Use Crawlbase Smart Proxy?
- IP Rotation: Avoid IP bans by using a pool of thousands of rotating proxies.
- CAPTCHA Handling: Crawlbase automatically bypasses CAPTCHAs, so you don’t have to solve them manually.
- Increased Efficiency: Scrape data faster by making requests without interruptions from rate limits or blocks.
- Global Coverage: You can scrape data from any location by selecting proxies from different regions worldwide.
How to Add It to Your Scraper?
To integrate Crawlbase Smart Proxy, modify your request URL to route through their API:
import requests
# Replace _USER_TOKEN_ with your Crawlbase Token
# You can get one by creating an account on Crawlbase
CRAWLBASE_API_URL = 'http://_USER_TOKEN_@smartproxy.crawlbase.com:8012'
def scrape_houzz_product_page(url):
crawlbase_url = CRAWLBASE_API_URL + url
response = requests.get(crawlbase_url)
# Scraper code as before
This will ensure your scraper can run smoothly and efficiently while scraping Houzz.
Optimize Houzz Scraper with Crawlbase
Houzz provides valuable insights for your projects. You can explore home improvement trends and analyze market prices. By following the steps in this blog, you can easily gather important information like product details, prices, and customer reviews.
Using Python libraries like Requests and BeautifulSoup simplifies the scraping process. Plus, using Crawlbase Smart Proxy helps you access the data you need without facing issues like IP bans or CAPTCHAs.
If you're interested in exploring scraping from other e-commerce platforms, feel free to explore the following comprehensive guides.
📜 How to Scrape Amazon
📜 How to scrape Walmart
📜 How to Scrape AliExpress
📜 How to Scrape Zalando
📜 How to Scrape Costco
If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Good luck with your scraping journey!
Frequently Asked Questions
Q. Is it legal to scrape product data from Houzz?
Yes, scraping product data from Houzz is allowed as long as you follow their terms of service. Make sure to read Houzz’s TOS and respect their robots.txt
file so you scrape responsibly and ethically.
Q. Why should I use a proxy like Crawlbase Smart Proxy for scraping Houzz?
Using a proxy like Crawlbase Smart Proxy prevents IP bans which can happen if you make too many requests to a website in a short span of time. Proxies also bypass CAPTCHA challenges and geographic restrictions so you can scrape data from Houzz or any other website smoothly.
Q. Can I scrape both product listings and product details from Houzz?
Yes, you can scrape both. In this blog, we’ve demonstrated how to extract essential information from Houzz’s search listings and individual product pages. By following similar steps, you can extend your scraper to gather various data points, such as pricing, reviews, specifications, and even business contact details.
Top comments (0)