LinkedIn is the world's largest professional network on the internet. You can use LinkedIn to find the right job or internship, connect and strengthen professional relationships, and learn the skills you need to succeed in your career. A complete LinkedIn profile can help you connect with opportunities by showcasing your unique professional story through experience, skills, and education.
In this tutorial, Let's look at how to implement a web scraper to gather job details and company profiles from a posted jobs list on Linkedin and save them in a .JSON
file using Python.
This tutorial is a complete beginner guide to web scraping using python.
What is Web Scraping?
Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API. In most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.
Getting started
In order to complete this task, we need these widely used python libraries in web scraping.
The first one, Selenium
is used to navigate web pages and interact with them. The other one is widely used to scrape data from web pages.
So let's install them.
pip install selenium
pip install beautifulsoup4
Installing the web driver
To work with Selenium
you need to install the web driver for your browser. WebDriver is an open source tool for automated testing of web apps across many browsers. It provides capabilities for navigating web pages, user input, JavaScript execution, and more. If you are using chrome, you can download the driver by this link.
It's important to check your browser version before downloading the driver.
Task explanation
On LinkedIn, a user account is not compulsory to search for jobs. We can simply navigate to linkedin/jobs and search for any job vacancies available in your area.
So that our task is to automatically search for jobs on Linkedin and save the job list and company profiles as.JSON
.
Navigating to webpage
As mentioned earlier, we use Selenium
for navigation purposes. Let's import it into our program.
from selenium import webdriver
Then we need to establish our web driver as a driver object.
driver = webdriver.Chrome(location)
Replace location
with your web driver location
. Also see other supported browsers.
Now we can use the get()
method to locate the website by URL.
driver.get("https://www.linkedin.com/jobs") #URL
If you run the program now, you can see that it spins up a new browser and navigates to the URL. You'll also notice that it has a address bar saying it is being controlled by automated test software.
Interacting with the page
The next step is to search for jobs. First, let's save the job title and location that we want to search for in separate strings.
job_title = 'software engineer'
job_location = 'sri lanka'
A web page consists of HTML elements. In order to interact with it, we need to find the elements we need to act on and then find the selector or locator information for those elements of interest. The easiest way is to Inspect
pages using developer tools. Place the cursor anywhere on the webpage, right-click to open a pop-up menu, then select the Inspect
option. In the Elements
window, move the cursor over the DOM structure of the page until it reaches the desired element. From there, we can find the HTML tag, the defined attribute, and the attribute values.
Next, we need to pass this information to the selenium web driver to simulate user actions on elements. Selenium provides various find_element
methods to find elements based on their attribute/value criteria or selector value that we supply in our script. For that, the By
class needs to import from Selenium.
from selenium.webdriver.common.by import By
These are the various ways the attributes are used to locate elements on a page.
find_element(By.ID, "id")
find_element(By.NAME, "name")
find_element(By.XPATH, "xpath")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.TAG_NAME, "tag name")
find_element(By.CLASS_NAME, "class name")
find_element(By.CSS_SELECTOR, "css selector")
In our case I am using XPATH
method to locate the input tags of the title and location.
XPATH
is the language used for locating nodes in an XML document. As HTML
can be an implementation of XML
(XHTML), Selenium users can leverage this powerful language to target elements in their web applications.
You can copy the XPATH
by right clicking on the element.
Now save them in separate variables.
search_title = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[1]/input')
search_location = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[2]/input')
Now to pass the string values to the inputs, we can use send_keys()
method.
search_title.send_keys(job_title)
search_location.clear()
search_location.send_keys(job_location,Keys.ENTER)
The location input is sometimes auto-filled by default with a location based on the IP address. clear()
method is used clear any default values in the input.
Keys.ENTER
argument is used to send ENTER key after passing the input values. Before that Keys
should be imported.
from selenium. webdriver. common. keys import Keys
It is important to stop the program for some time as the search results should be properly loaded before the next steps. For this, we can use the inbuilt time
library.
import time
time.sleep(3) #sleeps for 3 seconds
Finally we can get the UL
which contains the job list by calling the By.CLASS_NAME
method.
jobs_list = driver.find_element(By.CLASS_NAME,'jobs-search__results-list')
Scrape data from the web page
Now we can use BeautifulSoup
to scrape necessary data from the job list. Let's import is first.
from bs4 import BeautifulSoup
As an initial step, we need to direct the job_list
to BeautifulSoup
.
soup = BeautifulSoup(jobs_list.get_attribute('outerHTML'), 'html.parser')
Similar to Selenium
we can retrieve all li
items in the UL
by the tag name into a list. See the Bs4 documentation for more information.
jobs = soup('li')
Now let's make a list of the information we need to extract from every job item.
- job title
- location
- link to the job details
- link to the company profile
We use a for loop to iterate through each item in the jobs list and retrieve information.
data=[]
for job in jobs:
item ={}
item["job_title"] = job.find("h3",class_="base-search-card__title").text.strip(" \n")
item["company"] = job.find("h4",class_="base-search-card__subtitle").text.strip(" \n")
item["location"] = job.find("span",class_="job-search-card__location").text.strip(" \n")
job_details = job.find("a",class_="base-card__full-link")
item["job_details"] = job_details["href"].split('?', 1)[0]
company_profile = job.find("a",class_="hidden-nested-link")
item["company_profile"] = company_profile.attrs["href"].split('?', 1)[0]
data.append(item)
In the above code, I declared an empty data
array to store the information. And In every iteration in the for loop, I used find
methods to locate the elements which we need to get data from.
text
method is used to get the innerHTML values of the HTML tags and the attrs[]
method is used to get the attribute values of an element. split()
& split()
methods are used for basic text formatting.
In every iteration, all the information is saved in a separate JSON
object.
Save data in a JSON file
Now that all the retreived data is passed to the data
array, we can use the built in json
library to save them in a new json file.
import json
with open("jobs.json", "w") as writeJSON:
json.dump(data, writeJSON, indent=4)
The final output will look like this.
Conclusion
The final program
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium. webdriver. common. keys import Keys
from bs4 import BeautifulSoup
import json
import time
job_title = 'software engineer' #replace job title
job_location = 'sri lanka' #replace location
driver = webdriver.Chrome('webdriver/chrome/chromedriver') #replace the webdriver location
driver.get("https://www.linkedin.com/jobs")
search_title = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[1]/input')
search_location = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[2]/input')
search_title.send_keys(job_title)
search_location.clear()
search_location.send_keys(job_location,Keys.ENTER)
time.sleep(3)
jobs_list = driver.find_element(By.CLASS_NAME,'jobs-search__results-list')
soup = BeautifulSoup(jobs_list.get_attribute('outerHTML'), 'html.parser')
jobs = soup('li')
data=[]
for job in jobs:
item ={}
item["job_title"] = job.find("h3",class_="base-search-card__title").text.strip(" \n")
item["company"] = job.find("h4",class_="base-search-card__subtitle").text.strip(" \n")
item["location"] = job.find("span",class_="job-search-card__location").text.strip(" \n")
job_details = job.find("a",class_="base-card__full-link")
item["job_details"] = job_details["href"].split('?', 1)[0]
company_profile = job.find("a",class_="hidden-nested-link")
item["company_profile"] = company_profile.attrs["href"].split('?', 1)[0]
data.append(item)
with open("jobs.json", "w") as writeJSON:
json.dump(data, writeJSON, indent=4)
driver.quit()
With this program, you can easily scrape job details and company profile URLs on Linkedin.
You can also download the program via my Github repository.
Thank You
Top comments (0)