This post cross-published with OnePublish
In this post we are going to build a web application which will compare the similarity between two docum...
For further actions, you may consider blocking this person and/or reporting abuse
Thanks for making the tutorial, coderasha. I currently following your tutorial, and I think I found some typo in this part :
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])
probably the "mydict" variable is typo, so I changed to "dictionary" based on previous line declaration and the code works.
Please verify this, Thanks
Oh, I see. Yes, there is a typo in that part. Thank you for your attention :)
Really great tutorial, thanks!
Two points and a question-
1.
should the second line of this -
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])
read -
for doc in tf_idf[corpus]:
2.
To avoid having percentages over 100 and also to calculate the correct average, I think you have to divide the second total by the number of documents in the query corpus.
i.e -
total_avg = ((np.sum(avg_sims, dtype=np.float)) / len(file2_docs))
3.
Any thoughts on how you would compare a corpus to itself? I.e to see how unique each document is within the corpus?
I've tried a variety of corpora using your code and they all end up with the same similarity score...8% (Using the update % calc above)
Cheers,
Jamie
Thank you for making the tutorial. I have some questions for the code as follows.
1). do you why if I switch the query document (demofile2.txt) and demofile.txt, I can not get the same similarity of two documents?
2). If the document demofile.txt just contains one sentence: " Mars is the fourth planet in our solar system." , the print (doc) will empty. Do you know why? In other words, the TFIDF does not work, when corpus is single sentence for your code.
file_docs = []
with open ('~/demofile.txt') as f:
tokens = sent_tokenize(f.read())
for line in tokens:
file_docs.append(line)
print("Number of documents:",len(file_docs))
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in file_docs]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tf_idf[corpus]:
print(doc)
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])
Please let me know if you have any comments about it. Thank you.
1) This "similarity" is asymmetric. Look at the definition of TFIDF, it calculates for whatever you consider the corpus, not the query. So when you switch query and corpus, you are changing the weights (IDF) or the "normalization" (I prefer to think of square-root(DF) as the denominator for both texts - the corpus as well as the query). Geometrically: You have two psuedo-vectors V1 and V2. Naively we think of similarity as some equivalent to cosine of the angle between them. But the "angle" is being calculated after a projection of both V1 and V2. Now if this projection is determined symmetrically you'll be fine. But actually the projection of both vectors is based on a component of the first vector. So it is not symmetric under exhange. Concretely, consider two vectors V1 = (3,4,5) and V2 = (3,1,2). Our rule is to calculate the angle after projecting perpendicular to the largest component of the first vector (to down weight or eliminate the most common tokens in the corpus). If V1 is the corpus, you are calculating the angle between V1' = (3,4,0) and V2' = (3,1,0). If V2 is the corpus you are calculating the angle between V1" = (0,4,5) and V2" = (0,1,2).
2) Again think geometrically in terms of projections. If your corpus has only one document, every token in that document has the maximum DF. So when you project perpendicular to this, you get zero! Algebraically, I suspect that what people call IDF is actually Log(IDF). So a token that appears in every document in the corpus has a DF of 1, its inverse is 1 and the log of that is ... 0. So if you only have one document, every token satisfies this and you are left with LIDF = 0.
Why log? probably something based on Zipf's law. But remember, this (LIDF) is not mathematically derived, it is just a heuristic that has become common usage. If you prefer, to do geometry with distributions, you should use something like the symmetrized Kullbach - Lieber probability divergence, or even better, the Euclidean metric in logit space.
Hi, this is very helpful! Wonder whether there is any doc on how the number of documentation is determined. I also need to read more on solutions if document is = 1.
Thanks for making this tutorial. Exactly what I was looking for!
A very very very helpful blog. Thanks a lot my friend.
I am wondering if possible to apply my idea by this approach. which am thinking to create two folders ( Student_answers ) and ( Teacher_reference_answers ). Each folder content number of txts. for example , student answers have 30 txt. and Teacher_reference answers have 5 txt. so the idea is to compare students answers documents with the 5 teachers answers to compute the score automatically ( and chose the biggest score for each student ) ?
I would love to hear from you <3
Oh wow, this is EXACTLY the kind of tutorial I've been looking for to dip my toes into NLP! Thank you so much for sharing this!!
Great! :D Glad you liked it!
It is giving an error while calling this piece of code
perform a similarity query against the corpus
query_doc_tf_idf = tf_idf[query_doc_bow]
print(document_number, document_similarity)
print('Comparing Result:', sims[query_doc_tf_idf])
error is :
FileNotFoundError: [Errno 2] No such file or directory: 'workdir/.0'
Can anyone help
building the index
sims = gensim.similarities.Similarity('workdir/',tf_idf[corpus],
num_features=len(dictionary))
Change the 'workdir/' to your directory where your python script is located
Nice! Thank you for sharing, very useful. The NLTK’s power!
I am glad you liked it! :)
Absolutely fabulous article!