When I first learned about hypothesis testing during my data science bootcamp at Flatiron, I found it very interesting, but I did not realize just how useful this statistic method was until I did a classification project where I was predicting customer churn for a cell phone company.
Customer churn in regards to a cell phone company means that a customer will leave the company in the near future. Of course, the cell phone company wants to keep its churn rate very low as it wants to retain as many customers as possible.
For my classification project, I predicted customer churn for the cell phone company SyriaTel. A typical customer churn rate in the cell phone business is around 2%, yet SyriaTel had a churn rate of 14%.
To investigate what was causing SyriaTel’s inability to retain customers, I performed some exploratory data analysis and statistical testing. One of the features in the data set was international plan and I was curious to see if having an international plan had any effect on whether a customer churned or not.
In hypothesis testing, we first set our null hypothesis to be what we are trying to disprove. In the SyriaTel case, it would mean that having an international plan had no affect on whether a customer would churn or not. Our alternative hypothesis is the exact opposite of the null hypothesis. By rejecting the null hypothesis, we can accept our alternative hypothesis.
This is what the notation looks like:
Ho: P₁ = P₂
Ha: P₁ ≠ P₂
This will be a two proportion z-test, so I utilized the appropriate function from statsmodels:
count = np.array([137, 346])
nobs = np.array([323, 3010])## Two proportions z-test
stat, pval = proportions_ztest(count, nobs)
As a result of the above code, I found my test statistic to be 15 and my p-value to be a number that was essentially 0. This finding meant that the churn rate for having an international plan and not having an international plan were statistically significantly different. As a result, I was able to reject my null hypothesis that they were equal and I was able to accept my alternative hypothesis that they were not equal.
I then graphed the difference:
As you can see from the above graph, customers that had an international plan were far more likely to churn than customers that did not. This meant that the feature international plan was going to be an important one for my classification model. But more than that, it meant that I had used hypothesis testing to find business answers in a real world application.
If I was a data scientist working for SyriaTel, I would recommend that the company look further into why their customers that had an international plan were far more likely to churn when compared to customers that did not have the plan. This finding suggests that the company has very good rates on their international plan, but less optimal rates on their domestic plans. As a result, many customers may sign up for SyriaTel as a secondary number to make international calls and then cancel soon after. SyriaTel should bolster the competitiveness of their domestic plans if they want to retain their international plan customers.
Hypothesis testing is a powerful tool than can give answers to struggling businesses on why their company is unable to retain customers. Until I used hypothesis testing in my classification project, I was unable to fully appreciate just how powerful this statistical method can be for a business in the real world.