DEV Community

Cover image for Using pysimilar to compute the similarity between texts
Jordan Kalebu
Jordan Kalebu

Posted on • Updated on

Using pysimilar to compute the similarity between texts

Hi guys,
I recently wrote an article titled How to detect plagiarism in text using python whereby I show how you can easily detect plagiarism between documents as the title says manually using cosine similarity.

I republished that article on multiple platforms including here on dev.to and Hackernoon, and it's one of my most viewed articles plus the most starred GitHub repository among articles repositories.

Which gave me a second thought to refactor the code/article to make it more easily and friendly to get started with even for absolute beginners leading me to build a python library pysimilar which I can say simplify it to the maximum;

Getting started with Pysimilar

To get started with pysimilar for comparing text documents, you just need to install first of which you can either install directly from GitHub or using pip.

Here how to install pysimilar using pip

$ pip install pysimilar
Enter fullscreen mode Exit fullscreen mode

Here how to install directly from github

$ git clone https://github.com/Kalebu/pysimilar
$ cd pysimilar
$ pysimilar -> python setup.py install
Enter fullscreen mode Exit fullscreen mode

With Pysimilar you can either compare text documents as strings or specify the path to the file containing the textual documents.

Comparing strings directly

You can easily compare strings using pysimilar using compare() method just as illustrated below;

>>> from pysimilar import compare
>>> compare('very light indeed', 'how fast is light')
0.17077611319011649
Enter fullscreen mode Exit fullscreen mode

Comparing strings contained files

To compare strings contained in the files, you just need to explicitly specify the isfile parameter to True just as illustrated below;

>>> compare('README.md', 'LICENSE', isfile=True)
0.25545580376557886
Enter fullscreen mode Exit fullscreen mode

You can also compare documents with particular extension in a given directory, for instance, let's say I want to compare all the documents with .txt in documents directory here is what I will do;

The directory for documents used by the example below looks like this

documents/
├── anomalie.zeta
├── hello.txt
├── hi.txt
└── welcome.txt
Enter fullscreen mode Exit fullscreen mode

Here how to compare files of a particular extension

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = '.txt'
>>> comparison_result = pysimilar.compare_documents('documents')
>>> [['welcome.txt vs hi.txt', 0.6053485081062917],
    ['welcome.txt vs hello.txt', 0.0],
    ['hi.txt vs hello.txt', 0.0]]
Enter fullscreen mode Exit fullscreen mode

You can also sort the comparison score based on their score by changing the ascending parameter, just as shown below;

>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['welcome.txt vs hi.txt', 0.6053485081062917]]
Enter fullscreen mode Exit fullscreen mode

You can also set pysimilar to include files with multiple extensions

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = ['.txt', '.zeta']
>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['anomalie.zeta vs hi.txt', 0.4968161174826459],
 ['welcome.txt vs hi.txt', 0.6292275146695526],
 ['welcome.txt vs anomalie.zeta', 0.7895651507603823]]

Enter fullscreen mode Exit fullscreen mode

Well that's all for this article, Excited to see what you will build with it

GitHub logo Kalebu / pysimilar

A python library for computing the similarity between two strings (text) based on cosine similarity

Downloads Downloads Downloads

A python library for computing the similarity between two string(text) based on cosine similarity made by kalebu

Become a patron

How does it work ?

It uses Tfidf Vectorizer to transform the text into vectors and then obtained vectors are converted into arrays of numbers and then finally cosine similary computation is employed resulting to output indicating how similar they are.

Installation

You can either install it directly from Github or use pip to install it, here is how you to install it directly from github;

$  git clone https://github.com/Kalebu/pysimilar
$  cd pysimilar
$ pysimilar -> python setup.py install
Enter fullscreen mode Exit fullscreen mode

Installation with pip

$ pip install pysimilar
Enter fullscreen mode Exit fullscreen mode

Example of usage

Pysimilar allows you to either specify the string you want to compare directly or specify path to files containing string you want to compare.

Here an example on how to compare strings directly;

>>> from pysimilar import compare
>>> compare
Enter fullscreen mode Exit fullscreen mode

Top comments (2)

Collapse
 
vinceknight profile image
Vincent Prytherch

This looks neat. I might have a project where I could use it. 👍

Collapse
 
kalebu profile image
Jordan Kalebu

Thanks for finding it useful