Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )


Bag of words is a very simple and basic method to finding topics in  a text. For bag of words, you need to first create tokens using tokenization, and then count  up all the tokens you have.

The theory is that the more frequent a word or token is, the more central or important it might be.  Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.


Let’s explain with an example quickly.


from nltk.tokenize import word_tokenize
from collections import Counter

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""



Counter objects also have a method called “most_common” which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency.

