Bag-of-words
Bag of words is a very simple and basic method to finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have.
The theory is that the more frequent a word or token is, the more central or important it might be. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.
Let’s explain with an example quickly.
from nltk.tokenize import word_tokenize from collections import Counter text = """The cat is in the box. The cat likes the box. The box is over the cat.""" Counter(word_tokenize(text))
counter.most_common(2)
Counter objects also have a method called “most_common” which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency.