Ad Code

Data Science labs blog

Building Blocks - Text Pre-Processing in NLP using Python

 Building Blocks - Text Pre-Processing in NLP using Python


Image for post

Text Normalization


Tokenization

from nltk.tokenize import sent_tokenize, word_tokenize#Sentence Tokenization
print ('Following is the list of sentences tokenized from the sample review\n')
sample_text = """The first time I ate here I honestly was not that impressed. I decided to wait a bit and give it another chance.
I have recently eaten there a couple of times and although I am not convinced that the pricing is particularly on point the two mushroom and
swiss burgers I had were honestly very good. The shakes were also tasty. Although Mad Mikes is still my favorite burger around,
you can do a heck of a lot worse than Smashburger if you get a craving"""
tokenize_sentence = sent_tokenize(sample_text)print (tokenize_sentence)
print ('---------------------------------------------------------\n')
print ('Following is the list of words tokenized from the sample review sentence\n')
tokenize_words = word_tokenize(tokenize_sentence[1])
print (tokenize_words)
Following is the list of sentences tokenized from the sample review

['The first time I ate here I honestly was not that impressed.', 'I decided to wait a bit and give it another chance.', 'I have recently eaten there a couple of times and although I am not convinced that the pricing is particularly on point the two mushroom and \nswiss burgers I had were honestly very good.', 'The shakes were also tasty.', 'Although Mad Mikes is still my favorite burger around, \nyou can do a heck of a lot worse than Smashburger if you get a craving']
---------------------------------------------------------

Following is the list of words tokenized from the sample review sentence

['I', 'decided', 'to', 'wait', 'a', 'bit', 'and', 'give', 'it', 'another', 'chance', '.']

Stop Words Removal

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# define the language for stopwords removal
stopwords = set(stopwords.words("english"))
print ("""{0} stop words""".format(len(stopwords)))
tokenize_words = word_tokenize(sample_text)
filtered_sample_text = [w for w in tokenize_words if not w in stopwords]
print ('\nOriginal Text:')
print ('------------------\n')
print (sample_text)
print ('\n Filtered Text:')
print ('------------------\n')
print (' '.join(str(token) for token in filtered_sample_text))
179 stop words

Original Text:
------------------

The first time I ate here I honestly was not that impressed. I decided to wait a bit and give it another chance.
I have recently eaten there a couple of times and although I am not convinced that the pricing is particularly on point the two mushroom and
swiss burgers I had were honestly very good. The shakes were also tasty. Although Mad Mikes is still my favorite burger around,
you can do a heck of a lot worse than Smashburger if you get a craving

Filtered Text:
------------------

The first time I ate I honestly impressed . I decided wait bit give another chance . I recently eaten couple times although I convinced pricing particularly point two mushroom swiss burgers I honestly good . The shakes also tasty . Although Mad Mikes still favorite burger around , heck lot worse Smashburger get craving

Morphological Normalization

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
tokenize_words = word_tokenize(sample_text)stemmed_sample_text = []
for token in tokenize_words:
stemmed_sample_text.append(ps.stem(token))
lemma_sample_text = []
for token in tokenize_words:
lemma_sample_text.append(lemmatizer.lemmatize(token))

print ('\nOriginal Text:')
print ('------------------\n')
print (sample_text)
print ('\nFiltered Text: Stemming')
print ('------------------\n')
print (' '.join(str(token) for token in stemmed_sample_text))
print ('\nFiltered Text: Lemmatization')
print ('--------------------------------\n')
print (' '.join(str(token) for token in lemma_sample_text))
Original Text:
------------------

The first time I ate here I honestly was not that impressed. I decided to wait a bit and give it another chance. I have recently eaten there a couple of times and although I am not convinced that the pricing is particularly on point the two mushroom and swiss burgers I had were honestly very good. The shakes were also tasty. Although Mad Mikes is still my favorite burger around, you can do a heck of a lot worse than Smashburger if you get a craving.


Filtered Text: Stemming:
------------------

the first time I ate here I honestli wa not that impress . I decid to wait a bit and give it anoth chanc . I have recent eaten there a coupl of time and although I am not convinc that the price is particularli on point the two mushroom and swiss burger I had were honestli veri good . the shake were also tasti . although mad mike is still my favorit burger around , you can do a heck of a lot wors than smashburg if you get a crave .
Filtered Text: Lemmatization
--------------------------------

The first time I ate here I honestly wa not that impressed . I decided to wait a bit and give it another chance . I have recently eaten there a couple of time and although I am not convinced that the pricing is particularly on point the two mushroom and swiss burger I had were honestly very good . The shake were also tasty . Although Mad Mikes is still my favorite burger around , you can do a heck of a lot worse than Smashburger if you get a craving .
from nltk.stem import PorterStemmer
words = ["operate", "operating", "operates", "operation", "operative", "operatives", "operational"]
ps = PorterStemmer()for token in words:
print (ps.stem(token))
oper
oper
oper
oper
oper
oper
oper
Reactions

Post a Comment

0 Comments