TF-IDF Explained
If you have taken it upon yourself to learn NLP, or Natural Language Processing, in Python, you have undoubtedly come across the term TF-IDF. In NLP, TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a way of calculating if a certain word appears frequently in a specific document (term frequency), but does not appear frequently among all other documents (inverse document frequency).
This is the formula for term frequency:
As you can see, the more frequently a word appears in a document, the higher that word’s term frequency will be.
This is the formula for inverse document frequency:
From this formula, we can gather that the fewer the documents that contain a certain word, the higher that word’s inverse document frequency will be.
Now let’s use what we learned on a dataset I am currently working on involving tweets about the Pfizer COVID-19 vaccine.
First we import the necessary libraries:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import nltk
from nltk.corpus import stopwords
Next we read in our csv file. We will only be using two columns: the column that specifies if a user is verified or not (what I will be eventually predicting) and the column that contains the text of each user’s tweet about the Pfizer vaccine.
Also, I will be label encoding the verified user column:
messages = pd.read_csv("vaccination_tweets.csv", usecols = [8,10])
LE = LabelEncoder()
messages['target'] = LE.fit_transform(messages["user_verified"])
Next, we separate the X and y values. Afterwards, we perform train_test_split:
X = messages['text']
y = messages['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)