DEV Community

Cover image for Generate Word Cloud using Python
Seraph★776
Seraph★776

Posted on • Edited on

Generate Word Cloud using Python

Introduction

A word cloud (also called tag cloud or weighted list) is a visual representation of text data. Words are usually single words, and the importance of each is shown with font size or color. This article will discuss how to generate a Word Cloud using Python.

📚 Python Libraries

  • wordcloud
  • Counter
  • re
  • os

Input File

For the input file, you need a file that contains text. You can use a site like Project Gutenberg to find books that are available online. You can generate word clouds from famous books such as Alice in Wonderland by Lewis Carroll or Dracula by Bram Stoker. The possibilities are endless. This project used The Raven by Edgar Allan Poe. Simply copy the contents into a text file and save it in the same directory as the Python script.

Implementation

We need to write a function that will open this text file, iterate through the words, remove punctuation, and count the frequency of each word. We must also make sure to ignore word case, words that do not contain all alphabets, and common words like "and" or "the". We will follow these 5 basic steps to implement the word cloud.

  1. Import Necessary libraries
  2. Open Text file
  3. Clean Text file
  4. Generate Word cloud
  5. Save Word cloud

Step 1: Import the necessary libraries

from wordcloud import WordCloud
from collections import Counter
import re
import os
Enter fullscreen mode Exit fullscreen mode
  1. The WordcCloud library is what generate the wordcloud.
  2. The Counter module is used to create a dictionary that will count the frequency of each word.
  3. The re module used to remove punctuations from the text file.
  4. The os module is used for file handling

Step 2: Open the text file

def get_file(filename):
    with open(filename, encoding='utf-8') as file_object:
        content = [word.lower().strip() for word in file_object]
    return ' '.join(content)
Enter fullscreen mode Exit fullscreen mode

The get_file() function will open the text file using UTF-8 encoding, and returns content of file.

Step 3: Clean Text file

def clean_file(data):
    data = re.sub(r'[^\w\s]', '', data)
    stopwords = ('a', 'an', 'and', 'as', 'at', 'but', 'by', 'from', 'he', 'him', 'i', 'is', 'my', 'of', 'or',
                 'on', 'said', 'that', 'the', 'there', 'this', 'to', 'with')
    return Counter([word for word in data.split() if word not in stopwords])
Enter fullscreen mode Exit fullscreen mode

The clean_file() function takes 1 parameter data which is the text file that will be passed to it. The re module is used to remove punctuation marks from the text. Additionally, any stopwords (i.e. commonly used words) will be removed as well. The results are returned in a Counter object which is used to count the frequency of the remaining words in the file.

Step 4: Generate Wordcloud

def generate_wordcloud(data):
    return WordCloud(height=800, width=1200).generate_from_frequencies(data)
Enter fullscreen mode Exit fullscreen mode

The generate_wordcloud() function takes 1 parameter, data, which is the Counter object. It uses this word frequency hashmap to generate a Wordcloud image 800x1200 pixels in size. The result is the following Wordcloud image:

The Raven Word Cloud

Step 5: Save Wordcloud

def save_wordcloud(data, filename):
    data.to_file(os.path.join(filename))
    print(f'{filename} has been successfully saved.')
Enter fullscreen mode Exit fullscreen mode

Finally we want to save this Wordcloud image. The save_wordcloud() function takes two parameters, data and filename. data is the wordcloud object that will be saved, and filename is that name the file will be saved as.

Implementing the Code


def main():
    # Get the path of the text file
    raven_path: str = os.path.join('the_raven.txt')

    # Open this text file:
    raven_file: str = get_file(raven_path)

    # Clean the text file:
    process_file: dict = clean_file(raven_file)

    # Generate wordcloud
    raven_cloud: [Wordcloud] = generate_wordcloud(process_file)

    # Save wordcloud image as 'raven_cloud.jpg'
    save_wordcloud(raven_cloud, 'raven_cloud.jpg')


if __name__ == '__main__':
    main()

Enter fullscreen mode Exit fullscreen mode

The Full Code

import os
import re
from collections import Counter
from wordcloud import WordCloud


def get_file(filename):
    with open(filename, encoding='utf-8') as fo:
        content = [i.lower().strip() for i in fo]
    return ' '.join(content)


def clean_file(data):
    data = re.sub(r'[^\w\s]', '', data)
    stopwords = ('a', 'an', 'and', 'as', 'at', 'but',
                 'by', 'from', 'he', 'him', 'i', 'is',
                 'my', 'of', 'or', 'on', 'said', 'that',
                 'the', 'there', 'this', 'to', 'with')
    return Counter([word for word in data.split() 
                    if word not in stopwords])


def generate_wordcloud(data):
    return WordCloud(height=800, width=1200).generate_from_frequencies(data)


def save_wordcloud(data, filename):
    data.to_file(os.path.join(filename))
    print(f'{filename} has been successfully saved.')


def main():
    raven_path = os.path.join('the_raven.txt')
    raven_file = get_file(raven_path)

    process_file = clean_file(raven_file)
    raven_cloud = generate_wordcloud(process_file)
    save_wordcloud(raven_cloud, 'raven_cloud.jpg')


if __name__ == '__main__':
    main()
Enter fullscreen mode Exit fullscreen mode

Conclusion

After reading this tutorial you should now be able to generate your own Wordcloud using Python. Use your imagination and have fun! Checkout the WordCloud for Python Documentation to learn more what you can do with this Python library. Please leave like or comment if you found this article interesting!


Top comments (0)