IT Tutorial IT Tutorial | Oracle DBA | SQL Server, Goldengate, Exadata, Big Data, Data ScienceTutorial

Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )

Deniz Parlak September 7, 2020 Leave a comment

Bag-of-words

Bag of words is a very simple and basic method to finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have.

The theory is that the more frequent a word or token is, the more central or important it might be. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.

Let’s explain with an example quickly.

from nltk.tokenize import word_tokenize
from collections import Counter

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
Counter(word_tokenize(text))

counter.most_common(2)

Counter objects also have a method called “most_common” which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency.

About Deniz Parlak

Hi, i’m Security Data Scientist & Data Engineer at My Security Analytics. I have experienced Advance Python, Machine Learning and Big Data tools. Also i worked Oracle Database Administration, Migration and upgrade projects. For your questions [email protected]

Introduction to Natural Language Processing in Python – (Words counts with bag-of-words )

Bag-of-words

About Deniz Parlak

Leave a Reply Cancel reply