If you have taken it upon yourself to learn NLP, or Natural Language Processing, in Python, you have undoubtedly come across the term TF-IDF. In NLP, TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a way of calculating if a certain word appears frequently in a specific document (term frequency), but does not appear frequently among all other documents (inverse document frequency).

This is the formula for term frequency:

As you can see, the more frequently a word appears in a document, the higher that word’s term frequency will be.

This is the formula for inverse document frequency:

From this formula, we can gather that the fewer the documents that contain a certain word, the higher that word’s inverse document frequency will be.

Now let’s use what we learned on a dataset I am currently working on involving tweets about the Pfizer COVID-19 vaccine.

First we import the necessary libraries:

Next we read in our csv file. We will only be using two columns: the column that specifies if a user is verified or not (what I will be eventually predicting) and the column that contains the text of each user’s tweet about the Pfizer vaccine.

Also, I will be label encoding the verified user column:

Next, we separate the X and y values. Afterwards, we perform train_test_split:

Data Scientist | Data Analyst