ScrapeGraphAI is an innovative web scraping Python library that leverages Large Language Models (LLMs) and direct graph logic to create sophisticated scraping pipelines for websites and local documents. This guide will be divided into three parts to provide a detailed overview, setup instructions, use cases, and real-world examples.
Introduction to ScrapeGraphAI and Setup
Overview:
ScrapeGraphAI stands out in the data-intensive digital landscape by integrating LLMs and modular graph-based pipelines. This library automates the scraping of data from various sources like websites and local files (XML, HTML, JSON, etc.).
Why Choose ScrapeGraphAI?
Traditional scraping tools often require manual configuration and struggle with changing website structures. ScrapeGraphAI, powered by LLMs, adapts to these changes, reducing the need for constant developer intervention and ensuring continuous functionality.
Features:
- Supports multiple LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models via Ollama.
- Flexible and low-maintenance, adapting to website structure changes automatically.
Library Diagram:
The diagram illustrates the high-level architecture of ScrapeGraphAI, showcasing its nodes, graphs, and models.
Installation:
-
Prerequisites:
- Python >= 3.9
- Pip
- Ollama (for local models)
Install the Library:
pip install scrapegraphai
- Optional: Using Rye for Dependency Management:
rye pin 3.10
rye sync
rye build
- Additional Requirements for WSL on Windows:
sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
Using ScrapeGraphAI for Different Scenarios
1. SmartScraperGraph with Local Models:
Configure the graph for local models and run it to scrape a list of projects from a website.
Example:
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"verbose": True,
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their descriptions",
source="https://perinim.github.io/projects",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
2. SearchGraph with Mixed Models:
Combine Groq for LLM and Ollama for embeddings to scrape multiple pages based on search results.
Example:
from scrapegraphai.graphs import SearchGraph
graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": "GROQ_API_KEY",
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"max_results": 5,
}
search_graph = SearchGraph(
prompt="List me all the traditional recipes from Chioggia",
config=graph_config
)
result = search_graph.run()
print(result)
3. SpeechGraph with OpenAI:
Generate an audio summary from a website using OpenAI’s LLM and TTS models.
Example:
from scrapegraphai.graphs import SpeechGraph
graph_config = {
"llm": {
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
},
"tts_model": {
"api_key": "OPENAI_API_KEY",
"model": "tts-1",
"voice": "alloy"
},
"output_path": "audio_summary.mp3",
}
speech_graph = SpeechGraph(
prompt="Make a detailed audio summary of the projects.",
source="https://perinim.github.io/projects/",
config=graph_config,
)
result = speech_graph.run()
print(result)
Advanced Use Cases and Custom Configurations
Advanced Configurations:
Explore the various configurations and customizations possible with ScrapeGraphAI, such as proxy settings, detailed logging, and handling authentication.
Node Types:
Understand the different nodes available in ScrapeGraphAI, like ConditionalNode
, FetchNode
, ParseNode
, and more, which allow for fine-grained control over the scraping process.
Use Cases:
- E-commerce: Scrape product details, reviews, and pricing from multiple websites.
- Real Estate: Gather property listings, descriptions, and prices for market analysis.
- News Aggregation: Collect news articles, summaries, and metadata for a news aggregator.
Future Roadmap:
ScrapeGraphAI’s development roadmap includes integrating more LLM APIs, improving documentation, and adding features like proxy rotation and enhanced search engine support.
Conclusion
ScrapeGraphAI offers a flexible and powerful solution for web scraping, harnessing the capabilities of LLMs to adapt to changing web structures and simplify data extraction. By following this guide, you can set up and use ScrapeGraphAI effectively for a variety of use cases, ensuring efficient and reliable data scraping.
Sources:
Top comments (0)