Seraph★776

Posted on Jul 26, 2022 • Edited on Aug 1, 2022

Generate Word Cloud using Python

Introduction

A word cloud (also called tag cloud or weighted list) is a visual representation of text data. Words are usually single words, and the importance of each is shown with font size or color. This article will discuss how to generate a Word Cloud using Python.

📚 Python Libraries

wordcloud
Counter
re
os

Input File

For the input file, you need a file that contains text. You can use a site like Project Gutenberg to find books that are available online. You can generate word clouds from famous books such as Alice in Wonderland by Lewis Carroll or Dracula by Bram Stoker. The possibilities are endless. This project used The Raven by Edgar Allan Poe. Simply copy the contents into a text file and save it in the same directory as the Python script.

Implementation

We need to write a function that will open this text file, iterate through the words, remove punctuation, and count the frequency of each word. We must also make sure to ignore word case, words that do not contain all alphabets, and common words like "and" or "the". We will follow these 5 basic steps to implement the word cloud.

Import Necessary libraries
Open Text file
Clean Text file
Generate Word cloud
Save Word cloud

Step 1: Import the necessary libraries

from wordcloud import WordCloud
from collections import Counter
import re
import os

The WordcCloud library is what generate the wordcloud.
The Counter module is used to create a dictionary that will count the frequency of each word.
The re module used to remove punctuations from the text file.
The os module is used for file handling

Step 2: Open the text file

def get_file(filename):
    with open(filename, encoding='utf-8') as file_object:
        content = [word.lower().strip() for word in file_object]
    return ' '.join(content)

The get_file() function will open the text file using UTF-8 encoding, and returns content of file.

Step 3: Clean Text file

def clean_file(data):
    data = re.sub(r'[^\w\s]', '', data)
    stopwords = ('a', 'an', 'and', 'as', 'at', 'but', 'by', 'from', 'he', 'him', 'i', 'is', 'my', 'of', 'or',
                 'on', 'said', 'that', 'the', 'there', 'this', 'to', 'with')
    return Counter([word for word in data.split() if word not in stopwords])

The clean_file() function takes 1 parameter data which is the text file that will be passed to it. The re module is used to remove punctuation marks from the text. Additionally, any stopwords (i.e. commonly used words) will be removed as well. The results are returned in a Counter object which is used to count the frequency of the remaining words in the file.

Step 4: Generate Wordcloud

def generate_wordcloud(data):
    return WordCloud(height=800, width=1200).generate_from_frequencies(data)

The generate_wordcloud() function takes 1 parameter, data, which is the Counter object. It uses this word frequency hashmap to generate a Wordcloud image 800x1200 pixels in size. The result is the following Wordcloud image:

Step 5: Save Wordcloud

def save_wordcloud(data, filename):
    data.to_file(os.path.join(filename))
    print(f'{filename} has been successfully saved.')

Finally we want to save this Wordcloud image. The save_wordcloud() function takes two parameters, data and filename. data is the wordcloud object that will be saved, and filename is that name the file will be saved as.

Implementing the Code


def main():
    # Get the path of the text file
    raven_path: str = os.path.join('the_raven.txt')

    # Open this text file:
    raven_file: str = get_file(raven_path)

    # Clean the text file:
    process_file: dict = clean_file(raven_file)

    # Generate wordcloud
    raven_cloud: [Wordcloud] = generate_wordcloud(process_file)

    # Save wordcloud image as 'raven_cloud.jpg'
    save_wordcloud(raven_cloud, 'raven_cloud.jpg')


if __name__ == '__main__':
    main()

The Full Code

import os
import re
from collections import Counter
from wordcloud import WordCloud


def get_file(filename):
    with open(filename, encoding='utf-8') as fo:
        content = [i.lower().strip() for i in fo]
    return ' '.join(content)


def clean_file(data):
    data = re.sub(r'[^\w\s]', '', data)
    stopwords = ('a', 'an', 'and', 'as', 'at', 'but',
                 'by', 'from', 'he', 'him', 'i', 'is',
                 'my', 'of', 'or', 'on', 'said', 'that',
                 'the', 'there', 'this', 'to', 'with')
    return Counter([word for word in data.split() 
                    if word not in stopwords])


def generate_wordcloud(data):
    return WordCloud(height=800, width=1200).generate_from_frequencies(data)


def save_wordcloud(data, filename):
    data.to_file(os.path.join(filename))
    print(f'{filename} has been successfully saved.')


def main():
    raven_path = os.path.join('the_raven.txt')
    raven_file = get_file(raven_path)

    process_file = clean_file(raven_file)
    raven_cloud = generate_wordcloud(process_file)
    save_wordcloud(raven_cloud, 'raven_cloud.jpg')


if __name__ == '__main__':
    main()

Conclusion

After reading this tutorial you should now be able to generate your own Wordcloud using Python. Use your imagination and have fun! Checkout the WordCloud for Python Documentation to learn more what you can do with this Python library. Please leave like or comment if you found this article interesting!

Code available at GitHub

DEV Community

Generate Word Cloud using Python

Introduction

Input File

Implementation

Step 1: Import the necessary libraries

Step 2: Open the text file

Step 3: Clean Text file

Step 4: Generate Wordcloud

Step 5: Save Wordcloud

Implementing the Code

The Full Code

Conclusion

Top comments (0)

Read next

how can i access the flach drive without having to formmat it

Shared remote development environment in our research group

How to return meaningful error messages with Zod, Lambda and API Gateway in AWS CDK

Day 7: Conquering Challenges and Building a ShoppingList App