This blog was initially posted to Crawlbase Blog
Healthline.com is one of the top health and wellness websites, offering detailed articles, tips, and insights from experts. From article lists to in-depth guides, it has content for many uses. Whether you’re researching, building a health database, or analyzing wellness trends, scraping data from Healthline can be super useful.
But scraping a dynamic website like healthline.com is not easy. The site uses JavaScript to render its pages so traditional web scraping methods won’t work. That’s where Crawlbase Crawling API comes in. Crawlbase handles JavaScript rendered content seamlessly, makes the whole scraping process easy.
In this blog, we will cover why you might want to scrape Healthline.com, the key data points to target, how to scrape it using the Crawlbase Crawling API with Python, and how to store the scraped data in a CSV file. Let’s get started!
Scraping Healthline.com Articles Listings
To scrape article listings from healthline.com, we’ll use the Crawlbase Crawling API for dynamic JavaScript rendering. Let’s break this down step by step, with professional yet easy-to-understand code examples.
1. Inspecting the HTML Structure
Before writing code, open healthline.com and navigate to an article listing page. Use the browser’s developer tools (usually accessible by pressing F12
) to inspect the HTML structure.
Example of an article link structure:
<div class="css-1hm2gwy">
<div>
<a
class="css-17zb9f8"
data-event="|Global Header|Search Result Click"
data-element-event="INTERNAL LINK|SECTION|Any Page|SEARCH RESULTS|LINK|/health-news/antacids-increase-migraine-risk|"
href="https://www.healthline.com/health-news/antacids-increase-migraine-risk"
>
<span class="ais-Highlight">
<span class="ais-Highlight-nonHighlighted">Antacids Associated with Higher Risk of </span>
<em class="ais-Highlight-highlighted">Migraine</em>
<span class="ais-Highlight-nonHighlighted">, Severe Headaches</span>
</span>
</a>
</div>
<div class="css-1evntxy">
<span class="ais-Highlight">
<span class="ais-Highlight-nonHighlighted"
>New research suggests that people who take antacids may be at greater risk for
</span>
<em class="ais-Highlight-highlighted">migraine</em>
<span class="ais-Highlight-nonHighlighted"> attacks and severe headaches.</span>
</span>
</div>
</div>
Identify elements such as:
-
Article titles: Found in an
a
tag with classcss-17zb9f8
. -
Links: Found in
href
attribute of an atag
. -
Description: Found in an
div
element with classcss-1evntxy
.
2. Writing the Healthline.com Listing Scraper
We’ll use the Crawlbase Crawling API to fetch the page content and BeautifulSoup to parse it. We will use the ajax_wait
and page_wait
parameters provided by Crawlbase Crawling API to handle JS content. You can read about these parameters here.
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})
options = {
'ajax_wait': 'true',
'page_wait': '5000'
}
def scrape_article_listings(url):
response = crawling_api.get(url, options)
if response['headers']['pc_status'] == '200':
html_content = response['body'].decode('utf-8')
soup = BeautifulSoup(html_content, 'html.parser')
articles = []
for item in soup.find_all('a', class_='article-link'):
article_title = item.text.strip()
article_url = "https://www.healthline.com" + item['href']
articles.append({'title': article_title, 'url': article_url})
return articles
else:
print(f"Failed to fetch the page: {response['headers']['pc_status']}")
return []
# Example usage
url = "https://www.healthline.com/articles"
article_listings = scrape_article_listings(url)
print(article_listings)
3. Storing Data in a CSV File
You can use the pandas
library to save the scraped data into a CSV file for easy access.
import pandas as pd
def save_to_csv(data, filename):
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
print(f"Data saved to {filename}")
4. Complete Code
Combining everything, here’s the full scraper:
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import pandas as pd
# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})
options = {
'ajax_wait': 'true',
'page_wait': '5000'
}
def scrape_article_listings(url):
response = crawling_api.get(url, options)
if response['headers']['pc_status'] == '200':
html_content = response['body'].decode('utf-8')
soup = BeautifulSoup(html_content, 'html.parser')
articles = []
for item in soup.find_all('a', class_='article-link'):
article_title = item.text.strip()
article_url = "https://www.healthline.com" + item['href']
articles.append({'title': article_title, 'url': article_url})
return articles
else:
print(f"Failed to fetch the page: {response['headers']['pc_status']}")
return []
def save_to_csv(data, filename):
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
print(f"Data saved to {filename}")
# Example usage
url = "https://www.healthline.com/search?q1=migraine"
articles = scrape_article_listings(start_url)
save_to_csv(articles, 'healthline_articles.csv')
healthline_articles.csv
Snapshot:
Scraping Healthline.com Article Page
After collecting the listing of articles, the next step is to scrape details from individual article pages. Each article page typically contains detailed content, such as the title, publication date, and main body text. Here’s how to extract this data efficiently using the Crawlbase Crawling API and Python.
1. Inspecting the HTML Structure
Open an article page from healthline.com in your browser and inspect the page source using developer tools (F12
).
Look for:
-
Title: Found in
<h1>
tag with classcss-6jxmuv
. -
Byline: Found in a
div
with attributedata-testid="byline"
. -
Body Content: Found in
article
tag with classarticle-body
.
2. Writing the Healthline.com Article Scraper
We’ll fetch the article’s HTML using the Crawlbase Crawling API and extract the desired information using BeautifulSoup.
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})
options = {
'ajax_wait': 'true',
'page_wait': '5000'
}
def scrape_article_page(url):
response = crawling_api.get(url, options)
if response['headers']['pc_status'] == '200':
html_content = response['body'].decode('utf-8')
soup = BeautifulSoup(html_content, 'html.parser')
# Extracting details
title = soup.find('h1', class_='article-title').text.strip()
byline = soup.find('time').get('datetime', '').strip()
content = ' '.join([p.text.strip() for p in soup.find_all('p')])
return {
'url': url,
'title': title,
'byline': byline,
'content': content
}
else:
print(f"Failed to fetch the page: {response['headers']['pc_status']}")
return None
# Example usage
article_url = "https://www.healthline.com/articles/understanding-diabetes"
article_details = scrape_article_page(article_url)
print(article_details)
3. Storing Data in a CSV File
After scraping multiple article pages, save the extracted data into a CSV file using pandas
.
4. Complete Code
Here’s the combined code for scraping multiple articles and saving them to a CSV file:
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup:
import pandas as pd
# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})
def scrape_article_page(url):
response = crawling_api.get(url)
if response['headers']['pc_status'] == '200':
html_content = response['body'].decode('utf-8')
soup = BeautifulSoup(html_content, 'html.parser')
# Extracting details
title = soup.find('h1', class_='article-title').text.strip()
byline = soup.find('time').get('datetime', '').strip()
content = ' '.join([p.text.strip() for p in soup.find_all('p')])
return {
'url': url,
'title': title,
'byline': byline,
'content': content
}
else:
print(f"Failed to fetch the page: {response['headers']['pc_status']}")
return None
def save_article_data_to_csv(data, filename):
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
print(f"Data saved to {filename}")
# Example usage
article_urls = [
"https://www.healthline.com/health-news/antacids-increase-migraine-risk",
"https://www.healthline.com/health/migraine/what-to-ask-doctor-migraine"
]
articles_data = [scrape_article_page(url) for url in article_urls if scrape_article_page(url)]
save_article_data_to_csv(articles_data, 'healthline_articles_details.csv')
healthline_articles_details.csv
Snapshot:
Final Thoughts
Scraping healthline.com can unlock valuable insights by extracting health-related content for research, analysis, or application development. Using tools like the Crawlbase Crawling API makes this process easier, even for websites with JavaScript rendering. With the step-by-step guidance provided in this blog, you can confidently scrape article listings and detailed pages while handling complexities like pagination and structured data storage.
Always remember to use the data responsibly and ensure your scraping activities comply with legal and ethical guidelines, including the website’s terms of service. If you want to do more web scraping, check out our guides on scraping other key websites.
Top comments (0)