Web scraping is an important tool for businesses and organizations today because it allows them to gather valuable information about their customers, competitors, the market, and more.
Web scraping automates the process of extracting valuable data from websites, transforming it from raw, unstructured content into a usable format like spreadsheets, JSON, or databases. This eliminates the time-consuming and error-prone task of manual data collection, making web scraping a cost-effective and powerful tool.
Imagine the possibilities of gathering your competitor pricing data in real-time, monitoring brand mentions across social media platforms, or collecting market research insights - all without lifting a finger. Web scraping enables you to have these possibilities and empowers your business with a significant competitive advantage.
But how do you use this powerful technique? This article will guide you through the entire process of automating data collection with Apify, a user-friendly platform specifically designed for web scraping.
In this article, I will walk you through everything, from crafting your initial scraping script (Actor) using the Apify SDK for TypeScript to deploying it to the Apify Actors Store for seamless data collection, and then, I will show you how to run your deployed Actor on the Apify platform. With Apify, you don't need to be a programming pro to harness the power of web scraping and start gaining insights.
If you are excited learning about automated data collection using APify as much as I am writing this article, then let's dive right in 🚀
What is Apify?
According to the Apify documentation,
Apify is a cloud platform that helps you build reliable web scrapers, fast, and automate anything you can do manually in a web browser.
Apify is designed to tackle large-scale and demanding web scraping and automation tasks. It offers a user-friendly interface and an API for accessing its features:
- Compute instances (Actors): Enables you to run dedicated programs to handle your scraping or automation needs.
- Storage: Allows you to conveniently store both the requests you send and the results you obtain.
- Proxies: Apify proxies play a crucial role in web scraping, allowing you to anonymize your scraping activities, evade IP address tracking, access geo-location-specific content, and more.
- Scheduling: This allows you to automate your scraping actors and tasks to run at specific times or pre-defined intervals without the need for manual initiations.
- Webhooks: Apify also allows you to integrate multiple Apify Actors or external systems with your Actor or task run, you can send alerts when your Actor run succeeds or fails.
While Apify itself is a platform, it works seamlessly with Crawlee, an open-source library for web scraping. This means you can run Crawlee scraping jobs locally or on your preferred cloud infrastructure, offering flexibility alongside Apify's platform features.
Introducing the Apify SDK for JavaScript
Apify provides a powerful toolkit called the Apify SDK for JavaScript, designed to streamline the creation of Actors. These Actors function as serverless microservices, capable of running on the Apify platform or independently.
Previously, the Apify SDK offered a blend of crawling functionalities and Actor building features. However, a recent update separated these functionalities into two distinct libraries: Crawlee and Apify SDK v3. Crawlee now houses the web scraping and crawling tools, while Apify SDK v3 focuses solely on features specific to building Actors for the Apify platform. This distinction allows for a clear separation of concerns and enhances the development experience for various use cases.
Image source: Apify Docs
What is an Actor?
According to the Apify Docs,
Actors are serverless programs running in the cloud. They can perform anything from simple actions (such as filling out a web form or sending an email) to complex operations (such as crawling an entire website or removing duplicates from a large dataset). Actor runs can be as short or as long as necessary. They could last seconds, hours, or even infinitely.
In the next section, I will walk you through building your scraper with Apify.
Prerequisites
To follow along with this article, you should satisfy the following conditions:
- An account with a Git provider like GitHub, GitLab, or Bitbucket to store your code repository.
- Node.js and NPM or Yarn are installed locally to manage dependencies and run commands.
- Basic terminal/command line knowledge to run commands for initializing projects, installing packages, deploying sites, etc.
- Apify CLI installed globally by running this command:
npm -g install apify-cli
- An account with Apify. Create a new account on the Apify Platform.
Building Your Scraper with Apify
The first step to building your scrapper is to choose a code template from the host of templates provided by Apify. Head over to the Actor templates repository and choose one.
The Actor templates help you quickly set up your web scraping projects, saving you development time and giving you immediate access to all the features the Apify platform has to offer.
For this article, I will be using the TypeScript Starter template as shown in the screenshot above. This comes with Nodejs, Cheerio, Axios
Click on your chosen template and you will be redirected to the page specific to that template, then click on "Use locally". This will display a popup with instructions on how to create your actor using your chosen template.
Since I have all conditions in the prerequisites satisfied, I will go ahead and create a new Actor using the TypeScript Starter template. For this, I will run the following commands in my terminal:
apify create my-actor -t getting_started_typescript
The above command uses the Apify CLI to create a new actor called my-actor
using the TypeScript Starter template and then generates a bunch of files and folders. Below is my folder structure:
├───.actor
├───.vscode
├───.gitignore
├───.dockerignore
├───node_modules
├───package.json
├───package-lock.json
├───README.md
├───tsconfig.json
├───src
│ └───main.ts
└───storage
├───datasets
│ └───default
├───key_value_stores
│ └───default
│ └───INPUT.json
└───request_queues
└───default
Meanings of Selected Files
The main.ts
file acts as the main script for your Apify project. It's written in TypeScript and uses several libraries to achieve its goal:
-
Fetching and Parsing Data:
- It starts by importing libraries like
axios
to fetch data from the web andcheerio
to parse the downloaded content (usually HTML).
- It starts by importing libraries like
-
Getting User Input:
- It retrieves the URL you provide as input using
Actor.getInput
. This URL likely points to the webpage you want to scrape data from.
- It retrieves the URL you provide as input using
-
Scraping Headings:
- The script then fetches the webpage content using the provided URL and parses it with Cheerio.
- It extracts all heading elements (h1, h2, h3, etc.) and stores their level (h1, h2, etc.) and text content in an array.
-
Saving Results:
- Finally, it saves the extracted headings (including level and text) to Apify's Dataset, which acts like a storage container for your scraped data.
-
Exiting Cleanly:
- The script exits gracefully using
Actor.exit()
to ensure proper termination of your Apify scraping process.
- The script exits gracefully using
The script fetches a webpage based on your input URL, extracts all headings, and stores them in Apify's Dataset for further use or analysis.
The .actor/actor.json
file is where you will set up information about the Actor such as the name
, version
, build tag
, environment variables
, and more.
The .actor/input_schema.json
file defines the input of the Actor. In this case, I am using the Apify URL (https://www.apify.com). The content of this file is as shown below:
{
"title": "Scrape data from a web page",
"type": "object",
"schemaVersion": 1,
"properties": {
"url": {
"title": "URL of the page",
"type": "string",
"description": "The URL of website you want to get the data from.",
"editor": "textfield",
"prefill": "https://www.apify.com"
}
},
"required": ["url"]
}
Next, change the directory into the my-actor
directory by running this command:
# Go into the project directory
cd my-actor
Next, it is time to run your Actor locally. To do that, run this command:
# Run it locally
apify run
When you run this script, the Apify Actor will save the extracted headings in a special storage area called a Dataset. This Dataset lets you keep track of all the data you collect over time. A new Dataset is automatically created for each time you run the script, ensuring your data stays organized.
Think of a Dataset like a table. Each heading you extract becomes a row in the table, with its level (h1, h2, etc.) and text content acting as separate columns. You can even view this data as a table within Apify, making it easy to browse and understand. Apify allows you to export your headings in a variety of formats, including JSON, CSV, and Excel. This lets you use the data in other applications or analyze it further with your favorite tools.
Deploying and Running the Scraper
In this section, I will walk you through deploying your Actor to the APify Actor Store. To do this, we will make use of the Apify CLI.
First, you need to connect your Apify account with your Actor locally. You will need to provide your Apify API token to complete these actions.
To get your Apify API token, navigate to the API console. Then click on "Settings" > "Integrations". Then copy your API token
Next, on your terminal, change the directory into the root directory of your created Actor, then run this command to sign in:
apify login -t YOUR_APIFY_TOKEN
Replace YOUR_APIFY_TOKEN
with the actual token you just copied.
How to Deploy Your Actor
Apify CLI provides a command that you can use to deploy your Actor to the Apify Actor store
apify push
This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under My Actors. Running this command will output some logs on your terminal similar to what is shown in the screenshot below:
Next, we will display the newly created Actor on the Apify platform and execute it there. To do this, navigate to My Actors page.
To run our Actor, click on it ("My Actor"), then click on "Build & Start"
The Actor tabs allow you to see the Code, Last build, Input, and Last run.
You can also export your dataset by clicking on the Export button. The supported formats include, JSON
, CSV
, XML
, Excel
, HTML Table
, RSS
, and JSONL
.
Wrapping Up
Through this article, you have learned how to automate data collection using Apify and JavaScript (TypeScript). You have also learned how to use Apify's Actor templates to scrape websites and efficiently store the extracted data in Apify's Dataset. With Apify handling the infrastructure and functionalities like scheduling and proxies, you can focus on crafting the core scraping logic using familiar JavaScript libraries like Cheerio and Puppeteer.
Apify offers a vast library of documentation and a supportive community to guide you on your path to becoming a web scraping expert.
Get started with Apify today 🚀
Further Readings
If you enjoyed reading this article, do check out other great pieces from the Apify Blog and also, check the following links to learn more:
Top comments (0)