Introduction to Natural Language Processing in Python – (Simple text preprocessing)

Why preprocess ?

  • Helps make for better input data
    • When performing machine learning or other statistical methods
  • Examples:
    • Tokenization to create a bag of words
    • Lowercasting words
  • Lemmetization/Stemming
    • Shorten words to their root stems
  • Removing stop words, punctuation or unwanted tokens
  • Good to experiment with different approaches

 

Text preprocessing with Python:

from nltk.corpus import stopwords

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower())
if w.isalpha()]
no_stops = [t for t in tokens
if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

In the previous article, the results of a similar sample were different. We got more meaningful results in this example.

You can read the previous article below

Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )

 

About Deniz Parlak

Hi, i’m Security Data Scientist & Data Engineer at My Security Analytics. I have experienced Advance Python, Machine Learning and Big Data tools. Also i worked Oracle Database Administration, Migration and upgrade projects. For your questions [email protected]

Leave a Reply

Your email address will not be published. Required fields are marked *