Crawlbase

Posted on Dec 11, 2024 • Edited on Jan 8 • Originally published at crawlbase.com

How to scrape Healthline

#scrapehealthline

This blog was initially posted to Crawlbase Blog

Healthline.com is one of the top health and wellness websites, offering detailed articles, tips, and insights from experts. From article lists to in-depth guides, it has content for many uses. Whether you’re researching, building a health database, or analyzing wellness trends, scraping data from Healthline can be super useful.

But scraping a dynamic website like healthline.com is not easy. The site uses JavaScript to render its pages, so traditional web scraping methods won’t work. That’s where Crawlbase Crawling API comes in. Crawlbase handles JavaScript-rendered content seamlessly, making the whole scraping process easy.

In this blog, we will cover why you might want to scrape Healthline.com, the key data points to target, how to scrape it using the Crawlbase Crawling API with Python, and how to store the scraped data in a CSV file. Let’s get started!

Scraping Healthline.com Articles Listings

To scrape article listings from healthline.com, we’ll use the Crawlbase Crawling API for dynamic JavaScript rendering. Let’s break this down step by step with professional yet easy-to-understand code examples.

1. Inspecting the HTML Structure

Before writing code, open healthline.com and navigate to an article listing page. Use the browser’s developer tools (usually accessible by pressing F12) to inspect the HTML structure.

Example of an article link structure:

<div class="css-1hm2gwy">
  <div>
    <a
      class="css-17zb9f8"
      data-event="|Global Header|Search Result Click"
      data-element-event="INTERNAL LINK|SECTION|Any Page|SEARCH RESULTS|LINK|/health-news/antacids-increase-migraine-risk|"
      href="https://www.healthline.com/health-news/antacids-increase-migraine-risk"
    >
      <span class="ais-Highlight">
        <span class="ais-Highlight-nonHighlighted">Antacids Associated with Higher Risk of </span>
        <em class="ais-Highlight-highlighted">Migraine</em>
        <span class="ais-Highlight-nonHighlighted">, Severe Headaches</span>
      </span>
    </a>
  </div>
  <div class="css-1evntxy">
    <span class="ais-Highlight">
      <span class="ais-Highlight-nonHighlighted"
        >New research suggests that people who take antacids may be at greater risk for
      </span>
      <em class="ais-Highlight-highlighted">migraine</em>
      <span class="ais-Highlight-nonHighlighted"> attacks and severe headaches.</span>
    </span>
  </div>
</div>

Identify elements such as:

Article titles: Found in an a tag with class css-17zb9f8.
Links: Found in href attribute of an a tag.
Description: Found in an div element with class css-1evntxy.

2. Writing the Healthline.com Listing Scraper

We’ll use the Crawlbase Crawling API to fetch the page content and BeautifulSoup to parse it. We will use the ajax_wait and page_wait parameters provided by Crawlbase Crawling API to handle JS content. You can read about these parameters here.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

options = {
    'ajax_wait': 'true',
    'page_wait': '5000'
}

def scrape_article_listings(url):
    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        soup = BeautifulSoup(html_content, 'html.parser')

        articles = []
        for item in soup.find_all('a', class_='article-link'):
            article_title = item.text.strip()
            article_url = "https://www.healthline.com" + item['href']
            articles.append({'title': article_title, 'url': article_url})

        return articles
    else:
        print(f"Failed to fetch the page: {response['headers']['pc_status']}")
        return []

# Example usage
url = "https://www.healthline.com/articles"
article_listings = scrape_article_listings(url)
print(article_listings)

3. Storing Data in a CSV File

You can use the pandas library to save the scraped data into a CSV file for easy access.

import pandas as pd

def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

4. Complete Code

Combining everything, here’s the full scraper:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import pandas as pd

# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

options = {
    'ajax_wait': 'true',
    'page_wait': '5000'
}

def scrape_article_listings(url):
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        soup = BeautifulSoup(html_content, 'html.parser')
        articles = []
        for item in soup.find_all('a', class_='article-link'):
            article_title = item.text.strip()
            article_url = "https://www.healthline.com" + item['href']
            articles.append({'title': article_title, 'url': article_url})
        return articles
    else:
        print(f"Failed to fetch the page: {response['headers']['pc_status']}")
        return []

def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

# Example usage
url = "https://www.healthline.com/search?q1=migraine"
articles = scrape_article_listings(start_url)
save_to_csv(articles, 'healthline_articles.csv')

healthline_articles.csv Snapshot:

Scraping Healthline.com Article Page

After collecting the listing of articles, the next step is to scrape details from individual article pages. Each article page typically contains detailed content, such as the title, publication date, and main body text. Here’s how to extract this data efficiently using the Crawlbase Crawling API and Python.

1. Inspecting the HTML Structure

Open an article page from healthline.com in your browser and inspect the page source using developer tools (F12).

Look for:

Title: Found in <h1> tag with class css-6jxmuv.
Byline: Found in a div with attribute data-testid="byline".
Body Content: Found in article tag with class article-body.

2. Writing the Healthline.com Article Scraper

We’ll fetch the article’s HTML using the Crawlbase Crawling API and extract the desired information using BeautifulSoup.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'CRAWLBASE_JS_TOKEN'})

options = {
    'ajax_wait': 'true',
    'page_wait': '5000'
}

def scrape_article_page(url):
    response = crawling_api.get(url, options)
    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extracting details
        title = soup.find('h1', class_='article-title').text.strip()
        byline = soup.find('time').get('datetime', '').strip()
        content = ' '.join([p.text.strip() for p in soup.find_all('p')])

        return {
            'url': url,
            'title': title,
            'byline': byline,
            'content': content
        }
    else:
        print(f"Failed to fetch the page: {response['headers']['pc_status']}")
        return None

# Example usage
article_url = "https://www.healthline.com/articles/understanding-diabetes"
article_details = scrape_article_page(article_url)
print(article_details)

3. Storing Data in a CSV File

After scraping multiple article pages, save the extracted data into a CSV file using pandas.

4. Complete Code

Here’s the combined code for scraping multiple articles and saving them to a CSV file:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup:
import pandas as pd

# Initialize Crawlbase Crawling API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_article_page(url):
    response = crawling_api.get(url)
    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extracting details
        title = soup.find('h1', class_='article-title').text.strip()
        byline = soup.find('time').get('datetime', '').strip()
        content = ' '.join([p.text.strip() for p in soup.find_all('p')])

        return {
            'url': url,
            'title': title,
            'byline': byline,
            'content': content
        }
    else:
        print(f"Failed to fetch the page: {response['headers']['pc_status']}")
        return None

def save_article_data_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

# Example usage
article_urls = [
    "https://www.healthline.com/health-news/antacids-increase-migraine-risk",
    "https://www.healthline.com/health/migraine/what-to-ask-doctor-migraine"
]

articles_data = [scrape_article_page(url) for url in article_urls if scrape_article_page(url)]
save_article_data_to_csv(articles_data, 'healthline_articles_details.csv')

healthline_articles_details.csv Snapshot:

Final Thoughts

Scraping healthline.com can unlock valuable insights by extracting health-related content for research, analysis, or application development. Using tools like the Crawlbase Crawling API makes this process easier, even for websites with JavaScript rendering. With the step-by-step guidance provided in this blog, you can confidently scrape article listings and detailed pages while handling complexities like pagination and structured data storage.

Always remember to use the data responsibly and ensure your scraping activities comply with legal and ethical guidelines, including the website’s terms of service. If you want to do more web scraping, check out our guides on scraping other key websites.

DEV Community

How to scrape Healthline

Scraping Healthline.com Articles Listings

1. Inspecting the HTML Structure

2. Writing the Healthline.com Listing Scraper

3. Storing Data in a CSV File

4. Complete Code

Scraping Healthline.com Article Page

1. Inspecting the HTML Structure

2. Writing the Healthline.com Article Scraper

3. Storing Data in a CSV File

4. Complete Code

Final Thoughts

Top comments (0)

Read next

High-Performance Search Engines in Python: 7 Battle-Tested Implementation Methods

The Importance of Code Reviews in Software Development Teams 💻👥💸

Understanding Agile Methodologies: A Developer's Perspective 📚💻👨‍💻

Microservices and Microfrontends: Modularizing the Backend and Frontend