Oxylabs for Oxylabs

Posted on Jan 3, 2023 • Edited on Jun 2, 2023 • Originally published at dev.to

News Scraping: Everything You Need to Know

#webdev #javascript #programming #beginners

Public news data can be beneficial for various companies to stay ahead of their competition. However, for companies whose core business isn’t news aggregation or analysis, reading and analyzing articles from thousands of news outlets worldwide is bound to take a lot of unnecessary time, regardless of the articles’ importance. Fortunately, news scraping addresses this problem.

In this post, we talk about everything you need to know about news scraping, including its benefits and use cases as well as how you can use Python to create an article scraper.

After you’re done reading, don’t forget to leave a comment below with any questions, suggestions, or impressions you might have.

What is news scraping?

News scraping is a subset of web scraping that mainly targets public online media websites. It refers to automatically extracting news updates and releases from news articles and websites. It also relates to extracting public news data from the news results tab on SERPs or dedicated news aggregator platforms.

On the other hand, web scraping or web data extraction is the automatic retrieval of data from any website.

From a business point of view, news websites contain plenty of crucial public data, from reviews about newly released products to coverage of a company’s financial results and other vital announcements. These websites also cover several topics and industries, including technology, finance, fashion, science, health, politics, and more.

Benefits of news scraping

The benefits of news scraping include:

Risk identification and mitigation
Source of up-to-date, reliable, and verified information
Improves operations
Improves compliance

Risk identification and mitigation

A recent McKinsey article discussing risk and resilience proposed the use of digital technologies that integrate real-time data from several sources, including weather forecasts, to run scenarios to come up with the most effective solution to a problem. In doing so, the article indirectly recommended using news scraping as a source of real-time public data that can then be used to identify and mitigate risks.

Scraping public news websites increases a company’s ability to anticipate, predict, and observe threats more accurately and quickly.

Source of up-to-date, reliable, and verified information

News websites mainly strive to maintain credibility through their coverage of emerging news. They often have fact-checking departments and libraries against which to verify certain aspects of their updates. In this regard, public news scraping provides companies with access to up-to-date, accurate, and reliable information.

Improve operations

Companies don’t operate in a vacuum, meaning external factors can easily impact them. In this regard, scraping public news websites is a critical tool that ensures they constantly stay updated on emerging trends. It acts as a platform to make informed improvements to operations in a way that leverages favorable trends or counters unfavorable ones.

Improves compliance

News websites cover a wide latitude of topics, including regulations that have already been passed or those still awaiting enactment. Moreover, in some cases, the author of a news article even discusses the implications of such laws on whole industries and even interviews experts for a better picture.

Thus, when companies scrape public news articles and gather news regarding proposed or newly enacted regulations, they can better prepare for their implications, thereby improving compliance.

Use cases of news scraping

News scraping provides access to real-time updates on several issues and topics, which can be used in the following ways:

Reputation monitoring
Obtain competitive intelligence
Discover industry trends
Unearth fresh ideas
Content strategy improvement

Reputation monitoring

According to a 2020 Weber Shandwick study, companies with strong reputations enjoy customer loyalty, competitive advantage, better relationships with partners and suppliers, the attraction of high-quality talent, high employee retention, new market opportunities, higher stock price, and more. More specifically, 76% of a company’s market value is attributed to company reputation.

Media coverage may be positive or negative. Although the saying goes that ‘any publicity is good publicity,’ bad publicity can easily damage people’s perception of a company, significantly affecting its reputation. It could tank the market value substantially. Further, with most companies (87%) holding that customers’ perceptions are the most important to their reputation, it’s important to arrest a problem before it develops even further. Online reputation management and review monitoring are considered crucial processes for every company.

News scraping allows companies to monitor every newly published public news article and, therefore, their reputation.

Obtain competitive intelligence

The business world is synonymous with competition. This makes avenues of collecting the much-needed competitive intelligence all the more important.

Multiple news articles cover topics such as product launches, rebranding initiatives, mergers and acquisitions, financial results, and more. Thus, scraping news websites that cover such business-oriented topics offers insights about competitors. It’s a convenient way of obtaining competitive intelligence.

Discover industry trends

Many factors and impactful events could impact a company’s operations. As such, businesses must develop a mechanism that enables them to monitor trends and emerging issues.

Public news articles are a perfect place to start. They contain information that highlights where a particular industry is headed. For instance, articles summarizing market research reports offer insights into the current status of the industry and factors that are likely to promote growth throughout the forecast period. By web scraping all the public news articles containing such information, companies can discover new industry trends that, in turn, enhance competitiveness.

Additionally, by web scraping articles containing news data about their competitors, businesses can easily establish operational similarities, which automatically point to the industry trends.

Unearth fresh ideas

News websites publish insightful articles that contain input from industry experts or that are authored by acclaimed figures in their respective fields. For companies, such posts can be a source of ideas regarding new opportunities. They can also contain pointers on how to leverage such opportunities. Such articles can help businesses augment their ideation process.

Scraping public news websites provides a reliable way to automatically access these vital resources and, therefore, unearth fresh ideas.

Content strategy improvement

News websites aren’t limited only to conventional media outlets but also include newswire sites and public relations (PR) websites that distribute press releases and provide regular article-based coverage of client companies.

In this regard, companies can gain insights into how they can improve their communication and content strategy using news scraping. Simply put, this process highlights the best industry practices and what can make a company’s PR stand distinct.

How to scrape news data?

When it comes to public news scraping, Python offers one of the easiest ways to get started, especially given that it is an object-oriented language. Basically, scraping public news data involves two steps – downloading the webpage and parsing the HTML.

One of the most popular libraries to download web pages is Requests. This library can be installed using the pip command on Windows. On Mac and Linux, we suggest using the pip3 command to ensure that you’re using Python3. So, you should open the terminal and run the following command:

pip3 install requests

Create a new Python file and enter the following code:

import requests

response = requests.get(https://quotes.toscrape.com')

print(response.status_code)

If you run this code, it will print the HTTP status code. If the web page is successfully downloaded, the status code will be 200. To access the HTML of the web page, access the text attribute of the response object.

print(response.text) # Prints the entire HTML of the webpage.

The HTML returned by response.text is a string. This needs to be parsed into a Python object that can be queried for specific data. There are multiple libraries for parsing available for Python. This example uses the lxml, along with the Beautiful Soup library. Beautiful Soup works as a wrapper over the parser. This makes extracting data from HTML efficient.

To install these libraries, use the pip command. You should open the terminal and enter the following:

pip3 install lxml beautifulsoup4

In the code file, import Beautiful Soup and create an object as follows:

from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com')

soup = BeautifulSoup(response.text, 'lxml')

In this example, we’re working with a website with quotes. If you’re working with any other site, this method will still work. The only thing that will change is how to locate the element. To locate an HTML element, find() method can be used. This method takes the tag name and returns the first match.

title = soup.find('title')

The text inside this tag can be extracted using the get_text() method.

print(title.get_text()) # Prints page title.

To fine-tune it further, other attributes such as class, id, etc. can be used as well.

soup.find('small',itemprop="author")

Note that to use the class attribute, you should use the class_ because class is a reserved keyword in Python.

soup.find('small',class_="author")

Similarly, to get more than one element, the find_all() method can be used. If these quotes are considered as news headlines, you can simply get all the elements in headline using the following statement:

headlines = soup.find_all(itemprop="text")

You should note that the object headlines is a list of tags. To extract the text from these tags, a for loop can help you:

for headline in headlines:

print(headline.get_text())

It’s important to mention that scraping public news data isn’t very difficult. However, when collecting large amounts of public data, you can face issues such as IP blocks or CAPTCHAs. International news websites also provide their content according to the country. In this case, you should think about using Residential or Datacenter proxies.

Is it legal to scrape news websites?

Web scraping is one of the least time-consuming methods to access large amounts of the latest public news articles and monitor multiple news websites. In fact, with the increased sophistication of article scrapers, it has become increasingly possible to bypass anti-scraping measures that websites put in place to stop web scraping efforts.

The unmatched convenience of news scraping, or web scraping in general, however, doesn’t negate the existence of a few legal questions regarding the practice. So, is it legal to scrape news websites or is web scraping legal?

Well, as our legal team would say, it depends. Web scraping isn’t illegal as such, but it totally depends on the intention behind the practice. As long as web scraping news websites doesn’t violate any laws or infringe any intellectual property rights, regarding the data you intend to scrape or the source target, it should be considered as a legal activity. Accordingly, before engaging in any scraping activities, you should get appropriate professional legal advice regarding your specific situation.

Conclusion

Web scraping news websites provides a convenient and fast route of extracting real-time, reliable, and accurate data about competitors, the weather, economic environment, and more. To create tools that scrape news articles, Python is an ideal programming language that provides this capability, on top of multiple other benefits such as its extensive libraries and more. And with news scraping being legal and ethical when used appropriately and for the right purpose, companies can enjoy the benefits of this noble practice to monitor their reputation, gather competitive intelligence, unearth fresh ideas, and more.

Click here and check out a repository on GitHub to find the complete code used in this post.

Found the post helpful? Leave your impressions in the comments below and click the like button!

DEV Community