In our previous project, we used python's requests-html library to scrape through the website (thehackernews.com) to go and fetch all the latest articles from every category. The project was just simply printing out the title & link of every article to the terminal.
In this project, we will serve the scraped data through an API using python 's FastAPI framework. We will also do some tweaks to the way the scraping is done to make it easy to serve the data & in activating the scraping.
So let's get started by cloning the THN_scraper project to our system.
git clone https://github.com/VishnuDileesh/THN_scraper
After moving into the folder, let's create a virtual environment to install our packages.
python3 -m venv env
source env/bin/activate
After activating the environment, let's install the needed packages from our requirements.txt
pip3 install -r requirements.txt
Now let's try running the project by calling the following command in our terminal.
python3 main.py
On running the project we will get the above output in our terminal. Now that means we are all good with setting up our base project. Now let's get started in serving the data through API.
We will start by renaming our main.py file to scraper.py
Once the renaming is done we need to make few tweaks in our scraper.py file
def scrapeData():
datas = []
for category in categories:
category = CategoryScrape(f'{baseURL}{category}', category)
category.scrapeArticle()
def getScrapedData():
return datas
Here, we are putting the scrape loop in a function named scrapeData, so that we could call the function from our API endpoint, to do the real scraping of the data. Also, we are creating another function named getScrapedData to receive the scraped data in our API endpoint to serve.
Now, all our work is done here in the scraper.py file. Let's move forward by creating a new file named main.py, which will house all our API code.
Let's not forget to install FastAPI and uvicorn in our project.
pip3 install fastapi uvicorn
Now when we have everything installed, let's also update our requirements.txt file to reflect our newly installed packages
pip3 freeze > requirements.txt
from fastapi import FastAPI, BackgroundTasks
from scraper import scrapeData, getScrapedData
app = FastAPI()
@app.get("/")
async def index():
""" index route """
return {
"get-data": "visit /get-data to get scraped data",
"scrape-data": "visit /scrape-data to activate scraping"
}
In our main.py, we start by importing FastAPI and BackgroundTasks from our installed fastapi framework. Then we need to import the functions scrapeData and getScrapedData from our scraper.py file.
Then we create an app object by initializing the import FastAPI(), we start by creating our first route, which will simply return a very simple JSON response.
@app.get("/get-data")
async def get_data():
""" Get all scraped data as in json by visiting /get-data """
return getScrapedData()
@app.get("/scrape-data")
async def scrape_data(background_tasks: BackgroundTasks):
""" On doing a get request to '/scrape-data' you Activates scraping """
background_tasks.add_task(scrapeData)
return {"Status": "Activated Scraping"}
In our get-data route, we call getScrapedData function, which was imported from scraper.py file. The route will return all the data we scraped from the website (thehackernews.com)
In scrape-data route, we pass the scrapeData function, which we imported from scraper.py into the background_tasks. Which in return activates our scrapeData function, making the script run and do all the scraping magic to get the latest hacking news articles.
Now, we have our API ready to be served on a web server. We will use uvicorn as our web server. Let's create a new file named run.py, which will have the webserver configuration.
import uvicorn
if __name__ == "__main__":
uvicorn.run("main:app",
host='0.0.0.0',
port=8000,
reload=False,
debug=False,
workers=25)
Here in our run.py file, we are importing our installed unicorn. And then we are initializing the file, calling the function uvicorn.run, in which we are passing in our app (which we initialized in main.py file), and also we are specifying host and port for uvicorn to serve. And we are setting reload and debug to false, as they are development settings. With the workers parameter we can set the number of workers to be run for the server to handle.
So, that's all the code we needed for the project. Now we can go ahead and run the project by running the command in terminal
python3 run.py
Link to project GitHub repo:
VishnuDileesh / THN_scraper_api
A web API that scrapes and list out latest articles in all categories from the website (thehackernews.com)
Happy Coding, Keep Coding
Connect with me on LinkedIn
Connect with me on Instagram
Top comments (0)