Combination Sampling Methods

Joe Ramirez
3 min readFeb 2, 2021

--

Last year as I was finishing my capstone project on predicting food deserts at the Flatiron Data Science Bootcamp, I came across a dilemma: my recall score was not as high as I would have liked. I chose recall for this project because I wanted to reduce False Negatives as much as possible. A False Negative in the context of this project meant that my model would identify a census tract as not being a food desert when it in fact was. This would be most detrimental to my model because the result would be that the census tract would not receive desperate needed government subsidies to improve food access for its residents.

Class Imbalance

Although there was a class imbalance in my dataset, the difference was not massive. In other words, both classes had a significant amount of values. That meant I could avoid using the SMOTE oversampling technique. SMOTE tackles class imbalance issues by creating new synthetic values in the minority class that are combinations of the closest minority class neighbors. This technique is typically used when the minority class is very small when compared to the majority class. In addition, SMOTE can be prone to overfitting.

As a result, I leaned towards using the Tomek Links undersampling technique to tackle this moderate class imbalance. Tomek Links finds values in the majority class that are the nearest neighbors to values in the minority class and removes those values in the majority class. The downside to using Tomek Links is that it can cause significant data loss as you are removing values. However, I was not worried about this since there was not a major class imbalance meaning that a large number of values was not going to be removed from the majority class.

Enter SMOTEENN

Although my reasoning was sound for using the Tomek Links, my recall score was not as high as I had hoped. As a result, I started to investigate alternative sampling techniques. That is when I discovered SMOTEENN, which is a combination sampling technique that first oversamples the minority class using SMOTE and then uses Edited Nearest Neighbors to undersample the majority class. With SMOTEENN, you essentially get the best of both worlds: you are able to oversample the minority class while avoiding overfitting and you are also able to undersample the majority class while avoiding significant data loss.

While using Tomek Links, the best recall score I was able to achieve was .77. Using SMOTENN, I was able to improve this to .80, a significant increase.

Conclusion

As you can see, SMOTEENN provided my capstone project with an appreciable boost to its predictiveness. Anyone that is dealing with a class imbalance in their data and is struggling to decide between oversampling and undersampling methods should consider using a combination sampling technique to avoid the pitfalls of using either technique by itself.

--

--

Joe Ramirez
Joe Ramirez

Written by Joe Ramirez

Data Scientist | Data Analyst

No responses yet