DEV Community

ProspexAI
ProspexAI

Posted on

ScrapeGraphAI:Revolutionizing Web Scraping with AI

ScrapeGraphAI is an innovative web scraping Python library that leverages Large Language Models (LLMs) and direct graph logic to create sophisticated scraping pipelines for websites and local documents. This guide will be divided into three parts to provide a detailed overview, setup instructions, use cases, and real-world examples.

Introduction to ScrapeGraphAI and Setup

Overview:
ScrapeGraphAI stands out in the data-intensive digital landscape by integrating LLMs and modular graph-based pipelines. This library automates the scraping of data from various sources like websites and local files (XML, HTML, JSON, etc.).

Why Choose ScrapeGraphAI?
Traditional scraping tools often require manual configuration and struggle with changing website structures. ScrapeGraphAI, powered by LLMs, adapts to these changes, reducing the need for constant developer intervention and ensuring continuous functionality.

Features:

  • Supports multiple LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models via Ollama.
  • Flexible and low-maintenance, adapting to website structure changes automatically.

Library Diagram:
The diagram illustrates the high-level architecture of ScrapeGraphAI, showcasing its nodes, graphs, and models.

ScrapeGraphAI Overview

Installation:

  1. Prerequisites:

    • Python >= 3.9
    • Pip
    • Ollama (for local models)
  2. Install the Library:

   pip install scrapegraphai
Enter fullscreen mode Exit fullscreen mode
  1. Optional: Using Rye for Dependency Management:
   rye pin 3.10
   rye sync
   rye build
Enter fullscreen mode Exit fullscreen mode
  1. Additional Requirements for WSL on Windows:
   sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
Enter fullscreen mode Exit fullscreen mode

Using ScrapeGraphAI for Different Scenarios

1. SmartScraperGraph with Local Models:
Configure the graph for local models and run it to scrape a list of projects from a website.

Example:

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",
        "base_url": "http://localhost:11434",
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",
    },
    "verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions",
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)
Enter fullscreen mode Exit fullscreen mode

2. SearchGraph with Mixed Models:
Combine Groq for LLM and Ollama for embeddings to scrape multiple pages based on search results.

Example:

from scrapegraphai.graphs import SearchGraph

graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",
        "api_key": "GROQ_API_KEY",
        "temperature": 0
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",
    },
    "max_results": 5,
}

search_graph = SearchGraph(
    prompt="List me all the traditional recipes from Chioggia",
    config=graph_config
)

result = search_graph.run()
print(result)
Enter fullscreen mode Exit fullscreen mode

3. SpeechGraph with OpenAI:
Generate an audio summary from a website using OpenAI’s LLM and TTS models.

Example:

from scrapegraphai.graphs import SpeechGraph

graph_config = {
    "llm": {
        "api_key": "OPENAI_API_KEY",
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": "OPENAI_API_KEY",
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "audio_summary.mp3",
}

speech_graph = SpeechGraph(
    prompt="Make a detailed audio summary of the projects.",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
print(result)
Enter fullscreen mode Exit fullscreen mode

Advanced Use Cases and Custom Configurations

Advanced Configurations:
Explore the various configurations and customizations possible with ScrapeGraphAI, such as proxy settings, detailed logging, and handling authentication.

Node Types:
Understand the different nodes available in ScrapeGraphAI, like ConditionalNode, FetchNode, ParseNode, and more, which allow for fine-grained control over the scraping process.

Use Cases:

  • E-commerce: Scrape product details, reviews, and pricing from multiple websites.
  • Real Estate: Gather property listings, descriptions, and prices for market analysis.
  • News Aggregation: Collect news articles, summaries, and metadata for a news aggregator.

Future Roadmap:
ScrapeGraphAI’s development roadmap includes integrating more LLM APIs, improving documentation, and adding features like proxy rotation and enhanced search engine support.

Conclusion

ScrapeGraphAI offers a flexible and powerful solution for web scraping, harnessing the capabilities of LLMs to adapt to changing web structures and simplify data extraction. By following this guide, you can set up and use ScrapeGraphAI effectively for a variety of use cases, ensuring efficient and reliable data scraping.

Sources:

- ScrapeGraphAI GitHub Repository

Top comments (0)