If you're interested in Data Analytics, you will find learning about Natural Language Processing very useful. A good project to start learning ab...
For further actions, you may consider blocking this person and/or reporting abuse
Excellent post, you are absolutely amazing ❤️
I got one question though, when adding up the sentenceValues, why would you like the key in the sentenceValue dictionary to only be the first 12 characters of the sentence? I mean it might cause some troubles if the sentence is lower than 12 characters or if two different sentences starts with the exact same 12 characters.
I assume you did it as a way to reduce overheat, but to be honest. Perfomance wise I don't think the difference would be that significant, I would much rather prefer:
[:12]
As a sacrifies for a tiny performance increase.
I would love to hear your opinion on this matter.
If anyone got any errors running the code, copy paste my version.
That said, it does not work properly, it has some flaws, I tried to summarize this article as a test. Here is the result: (The threshold is: (1.5*average) )
"For example, the Center for a New American Dream envisions "... a focus on more of what really matters, such as creating a meaningful life, contributing to community and society, valuing nature, and spending time with family and friends."
Thank you very much, Sebastian!
I agree with you -- having the whole sentence as the dictionary key will bring a better reliability to the program compared to the first 12 characters of the sentence, my decision was mainly regarding the overheat, but as you said: it is almost negligible. One bug that I would look for is the use of special characters in the text, mainly the presence of quotes and braces, but this is an easily fixable issue (I believe using the three quotes as you are currently doing will avoid this issue)
I summarized the same article and got the following summary:
Feel free to use my version for comparison!
How short your summary was may be a result of the way you are using the Stemmer, I would suggest testing the same article without it to verify this. Besides that, your code is looking on point -- clean and concise. If you are looking for ways to improve your results, I would suggest you explore the following ideas:
Thanks for the suggestion!
Cool website you got yourself there!
I got a question I forgot to ask. Why do you turn the 'stopwords' list into a
set()
? First I thought it was because you properly intented to remove duplicate items from the list, but then it stroke me.. Why would there be duplicate items in a corpus list containing stop words? When I compared the length of the list before and after turning it into a set. There was no difference:len(stopwords.words("english") == len(set(stopwords.words("english")))
Outputs: True
Tracing the variable throughout the script, I most admit, I can not figure out why you turned it into a set. I assume it is a mistake?
Or do you have any specific reason for it?
Hmm, I believe the first time I used the list of stop words from NLTK there were some duplicates, if not I am curious too, lol. It may be time to change it to a list.
Thanks for the note!
If you ever try your implementation using TFIDF, let me know how it goes.
Excellent post!
@davidisrawi can you please help me with text extraction (3-4 Keywords) from each paragraph from an article.
I went through your article i got stuck with an Error "string index out of range".
sentenceValue[sentence[:12]] += wordValue[1]
IndexError: string index out of range
I have tried changing [sentence[:12]] to 7,8,9 but unable to resolve my error.
Please help regarding this
Thank you very much Dhairya!
This bug could happen whenever you list of sentences contains one of length < 12. A good workaround is to remove the
[:12]
index completely and use the whole sentence as thesentenceValue
keys. Does that make sense? So instead it would be:Let me know if that fixes the problem!
I have changed it but it is still giving me some KeyError: and showing 3-4 starting lines from my text.
How to solve this error ?
Hm sounds like you may have forgotten to remove the [:12] index from the other parts of your code where you use
sentenceValue
, maybe that is the issue? If not, feel free to share a snippet of your code so we can be on the same page.Hi, I am still facing the same error index out of range...help me on this
The code should be like this..
for sentence in sentences:
for wordValue in freqTable:
if wordValue in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freqTable[wordValue]
else:
sentenceValue[sentence] = freqTable[wordValue]
sentenceValue[sentence[:12]] += wordValue[1]
string index out of range
Hey! You may have a sentence that is lower than 12 characters. In this case, you can set the index of the word value to sentence[:10], or a lower number depending on your shortest sentence.
Lowering the number of characters used to hash the sentence value can bring some issues -- two sentences with the same 7,8,9 starting characters will then store/retrieve their value from the same index on the dictionary, that's why it's important to keep the sentence length for hashing as high as you can
any number i use gives the same error
sentenceValue[sentence[:2]] += wordValue[1]
IndexError: string index out of range
Interesting. For debugging it, I would print all your sentences and find if there's an empty one (or a very short one), I think that may be the issue.
Let me know if that works or if you found the issue
i am also facing string index out of range problemm.,,what is the issue
You may have a string that is less than your string length sentence[:2]. I would recommend printing the strings and see if this is the case
i have solved it..it was acctually puncutation problem..in my case...i just handle dot(.) character while giving words as value..
Could you please help me , i'm facing same problem here and i can't handle it , tahnk you.
Is this your error?
IndexError: string index out of range
If so, potential solutions could be:
If that doesn't solve it, let me know!
It is still giving me an error when the text is > than 12 characters long and the sentance is (when printed through the loop) "You notice a wall of text in twitch chat and your hand instinctively goes to the mouse.", which is the first line in the paragraph. I found that even when you take out the range the same error occurs.
The bug may be in how you are storing your sentences, make sure you print out the sentences as you store them instead of when you retrieve them, hopefully, that'll help you find the issue. If not, let me know if I can help!
Thanks @david Israwi for this simple and interesting text summarizer program.
I see and analyze your code.
The most error i found is
index out of range
and most of the people seem to have the same error a lot.The one thing i am confuse in this part of code:
why and how the only 1.5 average is used.
How about the the large one line text instead it not summarize it.
For example:
I am using python 3 and i resolve the
index out of range
error as:Thanks a lot! This post is really helpful! If you have other resources including making chatbot can be really helpful to me.
I am little bit interesting about how to implement the Text Summarizer using machine learning model. I am looking for this too...
You can directly send information at
sushant1234gautam@gmail.com
Great post David,
I have been trying to wrap my head around machine learning and NLP for a few months now. Developing intuition has been a slow process. Article like yours are a sources of "aha moments"/. I am trying to build a blog post summary app. Being a newbie I am using an API (AYLIEN) and following this summary generator tutorial. Having something working gives me motivation to read in-depth articles.
Thanks for your comments Vikram, best of luck with the summary app!
This isn't working right for me and I think it comes down to wordValue[0] not working for me the way you said. Do you know why that could be?
Like if I do:
for wordValue in freqTable:
print (wordValue[0])
I only get the first letters:
q
b
f
j
m
.
s
b
s
l
It seems like your bug comes from separating the paragraphs into letters instead of words.
The program should do the following commands in the respective order:
I wouldn't be able to know in which step the bug is, but it seems as if you are finding the frequency of each letter instead of each word, make sure you are keeping track of your arrays by printing them through your code, seems like you're almost there
I too am confused about this. According to my understanding, when we use the in operator in a dictionary, it only iterates through the keys. Therefore it would make sense that the program is printing only the first letter.
I guess to get the key-value pairs, we need to use the items() function as:
for wordValue in freqTable.items():
Hello sir, could you suggest a way to make the summarizer more efficient. Sometimes a few sentences with lower sentence values can be very important for the summary. In that case if few leave those out, the summary may not make sense
That's a good point. I think what you might be referring to is some kind of an adjacency value - this sentence might be worth more than we think because it's next to this really important sentence.
Another aspect you could change in the scoring algorithm is the use of TF-IDF. Let me know if you end up using it, I would like to see how that would look like
In Python, a string is a single-dimensional array of characters. The string index out of range means that the index you are trying to access does not exist. In a string, that means you're trying to get a character from the string at a given point. If that given point does not exist , then you will be trying to get a character that is not inside of the string. Indexes in Python programming start at 0. This means that the maximum index for any string will always be length-1. There are several ways to account for this. Knowing the length of your string (using len() function)could certainly help you to avoid going over the index.
Hi..
I have a problem with that :
sumValues += sentenceValue[sentence]
TypeError: unsupported operand type(s) for +=: 'int' and 'str'
Hi Viqi. Seems like you are storing a string in your sentenceValue dictionary instead of an actual value, it is supposed to be an int instead. Fixing that may solve the problem!
where we have import text file to run this?
That would be up to you!
In my implementation, I put everything in one method, so then I can just run it through the command line passing the actual string of text. Having said that, it totally depends on your use or implementation, in some cases it, might be worth to receive the text file instead
Here is my imp if you want to take a look at it: github.com/DavidIsrawi/SummarizeMe...
I'm getting a "syntax error" on any text that I try to pass through the program, how would I go about running the text i want to summarize through this program?
I would try to have the text be converted to UTF-8 before sending it through, maybe there are special characters or accents throwing it off