Why preprocess ?
- Helps make for better input data
- When performing machine learning or other statistical methods
- Examples:
- Tokenization to create a bag of words
- Lowercasting words
- Lemmetization/Stemming
- Shorten words to their root stems
- Removing stop words, punctuation or unwanted tokens
- Good to experiment with different approaches
Text preprocessing with Python:
from nltk.corpus import stopwords text = """The cat is in the box. The cat likes the box. The box is over the cat.""" tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()] no_stops = [t for t in tokens if t not in stopwords.words('english')] Counter(no_stops).most_common(2)
In the previous article, the results of a similar sample were different. We got more meaningful results in this example.
You can read the previous article below
Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )