Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )

Bag-of-words

Bag of words is a very simple and basic method to finding topics in  a text. For bag of words, you need to first create tokens using tokenization, and then count  up all the tokens you have.

The theory is that the more frequent a word or token is, the more central or important it might be.  Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.

 

Let’s explain with an example quickly.

 

from nltk.tokenize import word_tokenize
from collections import Counter

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
Counter(word_tokenize(text))

 

counter.most_common(2)

Counter objects also have a method called “most_common” which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency.

About Deniz Parlak

Hi, i’m Security Data Scientist & Data Engineer at My Security Analytics. I have experienced Advance Python, Machine Learning and Big Data tools. Also i worked Oracle Database Administration, Migration and upgrade projects. For your questions [email protected]

Leave a Reply

Your email address will not be published. Required fields are marked *