Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )

Deniz Parlak

4 years ago

Bag-of-words

Bag of words is a very simple and basic method to finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have.

The theory is that the more frequent a word or token is, the more central or important it might be. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.

Let’s explain with an example quickly.

from nltk.tokenize import word_tokenize
from collections import Counter

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
Counter(word_tokenize(text))

counter.most_common(2)

Counter objects also have a method called “most_common” which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency.