Introduction
A word cloud (also called tag cloud or weighted list) is a visual representation of text data. Words are usually single words, and the importance of each is shown with font size or color. This article will discuss how to generate a Word Cloud using Python.
📚 Python Libraries
wordcloud
Counter
re
os
Input File
For the input file, you need a file that contains text. You can use a site like Project Gutenberg to find books that are available online. You can generate word clouds from famous books such as Alice in Wonderland by Lewis Carroll or Dracula by Bram Stoker. The possibilities are endless. This project used The Raven by Edgar Allan Poe. Simply copy the contents into a text file and save it in the same directory as the Python script.
Implementation
We need to write a function that will open this text file, iterate through the words, remove punctuation, and count the frequency of each word. We must also make sure to ignore word case, words that do not contain all alphabets, and common words like "and"
or "the"
. We will follow these 5 basic steps to implement the word cloud.
- Import Necessary libraries
- Open Text file
- Clean Text file
- Generate Word cloud
- Save Word cloud
Step 1: Import the necessary libraries
from wordcloud import WordCloud
from collections import Counter
import re
import os
- The WordcCloud library is what generate the wordcloud.
- The
Counter
module is used to create adictionary
that will count the frequency of each word. - The
re
module used to remove punctuations from the text file. - The
os
module is used for file handling
Step 2: Open the text file
def get_file(filename):
with open(filename, encoding='utf-8') as file_object:
content = [word.lower().strip() for word in file_object]
return ' '.join(content)
The get_file()
function will open the text file using UTF-8
encoding, and returns content of file.
Step 3: Clean Text file
def clean_file(data):
data = re.sub(r'[^\w\s]', '', data)
stopwords = ('a', 'an', 'and', 'as', 'at', 'but', 'by', 'from', 'he', 'him', 'i', 'is', 'my', 'of', 'or',
'on', 'said', 'that', 'the', 'there', 'this', 'to', 'with')
return Counter([word for word in data.split() if word not in stopwords])
The clean_file()
function takes 1 parameter data
which is the text file that will be passed to it. The re
module is used to remove punctuation marks from the text. Additionally, any stopwords
(i.e. commonly used words) will be removed as well. The results are returned in a Counter
object which is used to count the frequency of the remaining words in the file.
Step 4: Generate Wordcloud
def generate_wordcloud(data):
return WordCloud(height=800, width=1200).generate_from_frequencies(data)
The generate_wordcloud()
function takes 1 parameter, data
, which is the Counter
object. It uses this word frequency hashmap to generate a Wordcloud
image 800x1200
pixels in size. The result is the following Wordcloud image:
Step 5: Save Wordcloud
def save_wordcloud(data, filename):
data.to_file(os.path.join(filename))
print(f'{filename} has been successfully saved.')
Finally we want to save this Wordcloud image. The save_wordcloud()
function takes two parameters, data
and filename
. data
is the wordcloud object that will be saved, and filename
is that name the file will be saved as.
Implementing the Code
def main():
# Get the path of the text file
raven_path: str = os.path.join('the_raven.txt')
# Open this text file:
raven_file: str = get_file(raven_path)
# Clean the text file:
process_file: dict = clean_file(raven_file)
# Generate wordcloud
raven_cloud: [Wordcloud] = generate_wordcloud(process_file)
# Save wordcloud image as 'raven_cloud.jpg'
save_wordcloud(raven_cloud, 'raven_cloud.jpg')
if __name__ == '__main__':
main()
The Full Code
import os
import re
from collections import Counter
from wordcloud import WordCloud
def get_file(filename):
with open(filename, encoding='utf-8') as fo:
content = [i.lower().strip() for i in fo]
return ' '.join(content)
def clean_file(data):
data = re.sub(r'[^\w\s]', '', data)
stopwords = ('a', 'an', 'and', 'as', 'at', 'but',
'by', 'from', 'he', 'him', 'i', 'is',
'my', 'of', 'or', 'on', 'said', 'that',
'the', 'there', 'this', 'to', 'with')
return Counter([word for word in data.split()
if word not in stopwords])
def generate_wordcloud(data):
return WordCloud(height=800, width=1200).generate_from_frequencies(data)
def save_wordcloud(data, filename):
data.to_file(os.path.join(filename))
print(f'{filename} has been successfully saved.')
def main():
raven_path = os.path.join('the_raven.txt')
raven_file = get_file(raven_path)
process_file = clean_file(raven_file)
raven_cloud = generate_wordcloud(process_file)
save_wordcloud(raven_cloud, 'raven_cloud.jpg')
if __name__ == '__main__':
main()
Conclusion
After reading this tutorial you should now be able to generate your own Wordcloud using Python. Use your imagination and have fun! Checkout the WordCloud for Python Documentation to learn more what you can do with this Python library. Please leave like
or comment
if you found this article interesting!
- Code available at GitHub
Top comments (0)