You may have recently come upon news regarding NFTs (non-fungible tokens) after the first ever NFT sold at an auction for nearly 70 million dollars, but you might be asking yourself: what exactly is an NFT?
Before I get to that, you might be asking yourself another question as well: what on Earth does fungible mean? Understanding that term makes explaining the concept of NFT far easier.
A fungible asset is one that is interchangeable. Let’s refer to the Ethereum blockchain, which NFT is actually a part of, to better explain what that means. If I lend you an Ethereum…
As you embark on your data science journey, one of the first things you will have to learn in Pandas is how to join different datasets. This will be an absolutely essential skill to have as you will find that it’s extremely rare that all the data you will need for data analysis and machine learning will be contained in a single dataset.
As such, you will be required to combine information from many different datasets into a single readable dataset before you begin your exploratory data analysis. …
One of the most fundamental things a person trying to learn Pandas in Python must grasp is the differences between apply vs map vs applymap. Although the differences might seem confusing at first, using some real-world examples helps cement the differences.
When data cleaning in Pandas, map() will only function on the rows of a given series
messages['tokenizer'] = messages['tokenizer'].map(lambda x: tokenizer.tokenize(x))
As you can see, I am only changing the values of one series in my data frame, which is why map worked in this instance.
The apply() function is interesting because it can also work exactly the same as above.
messages['tokenizer'] = messages['tokenizer'].apply(lambda x: tokenizer.tokenize(x))
But, it’s not just limited to working on a single series.
If you have taken it upon yourself to learn NLP, or Natural Language Processing, in Python, you have undoubtedly come across the term TF-IDF. In NLP, TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a way of calculating if a certain word appears frequently in a specific document (term frequency), but does not appear frequently among all other documents (inverse document frequency).
This is the formula for term frequency:
If you have wondered about the Support Vector Machine classifier previously, but have been too intimidated to learn more, this explainer will get you started with the basics surrounding this machine learning algorithm.
Support Vector Machine, SVM for short, is a supervised learning classifier that aims to maximize the boundary separating two or more groups. To better understand what that exactly means, let’s look at some visualizations:
As I have previously mentioned, I did my third project in the Flatiron Data Science Bootcamp on predicting customer churn (how likely they are to change cell phone providers in the near future) for a cell phone provider. This project was designed for us to practice using various classification machine learning algorithms such as logistic regression, k-nearest neighbors, decision trees, and random forests.
For algorithms where scaling data is a necessity such as logistic regression and k-nearest neighbors, the code can get at times long and tedious. Fortunately, there is a way in scikit-learn to streamline this process: pipelines!
What is the above image? This is a graph that tells you which features in my final model from my capstone project on food deserts, census tracts with low access to fresh, healthy food, were most important in predicting if a census tract was in fact a food desert or not. For this blog, I will be focusing on the two most important features.
Low Vehicle Access
The TractHUNV feature represents how many housing units in a census tract do not own or have access to a vehicle. …
The above image is a confusion matrix taken from my capstone food desert classification project that culminated my time at the Flatiron Data Science Bootcamp. Does it seem confusing? It’s supposed to be!
A confusion matrix is a plot used to represent the performance of a machine learning model. In the above confusion matrix, the x-axis represents my predicted labels and the y-axis represents true labels. So for example in the bottom-right corner, that number 5450 represents the amount of times the final model I chose in my capstone correctly predicted a census tract in the United States…
In essence, blockchain is a decentralized, fully transparent, and immutable ledger of transactions.
One of the most important features that blockchain has to offer is its full transparency. This means that if any data is altered in any way in the blockchain, the users of that blockchain would be able to see the changes that were recorded. How can this be useful in real life? …
Data Scientist | Data Analyst