Pipelines in Python

As I have previously mentioned, I did my third project in the Flatiron Data Science Bootcamp on predicting customer churn (how likely they are to change cell phone providers in the near future) for a cell phone provider. This project was designed for us to practice using various classification machine learning algorithms such as logistic regression, k-nearest neighbors, decision trees, and random forests.

For algorithms where scaling data is a necessity such as logistic regression and k-nearest neighbors, the code can get at times long and tedious. Fortunately, there is a way in scikit-learn to streamline this process: pipelines!

Pipelines

Pipelines are a way to streamline different steps of a machine learning workflow into code that is far shorter and easier to read! Let’s use an example using logistic regression from my customer churn classification project:

X_trainlr, X_testlr, y_trainlr, y_testlr = train_test_split(X, y, random_state=1)scaler = StandardScaler()X_trainlr = scaler.fit_transform(X_trainlr)
X_testlr = scaler.transform(X_testlr)
logreg = LogisticRegressionCV(cv = 3, penalty='l2', solver = 'saga', max_iter=100000, scoring='recall', n_jobs = -1)logreg.fit(X_trainlr, y_trainlr)

The above is a typical logistic regression workflow without using pipeline. Now watch what happens when we use pipeline:

X_trainlr, X_testlr, y_trainlr, y_testlr = train_test_split(X, y, random_state=1)pipe1 = Pipeline([('scaler', StandardScaler()),
('logreg', LogisticRegressionCV(cv = 3, penalty='l2', solver = 'saga', max_iter=100000, scoring='recall', n_jobs = -1))])
pipe1.fit(X_trainlr, y_trainlr)

As you can see, we were able to write less lines of code in the second example because pipelines seamlessly integrate scalers with machine learning algorithms.

Now let’s try with k-nearest neighbors:

X_trainknn20, X_testknn20, y_trainknn20, y_testknn20 = train_test_split(X, y, random_state=1)scaler = StandardScaler()X_trainknn20 = scaler.fit_transform(X_trainknn20)  
X_testknn20 = scaler.transform(X_testknn20)
knn20 = KNeighborsClassifier(n_neighbors=20)knn20.fit(X_trainknn20, y_trainknn20)

The above does not use a pipeline, so several separate lines of code are needed to scale the data. Now let’s see with a pipeline:

X_trainknn20, X_testknn20, y_trainknn20, y_testknn20 = train_test_split(X, y, random_state=1)pipe2 = Pipeline([('scaler', StandardScaler()),
('knn20', KNeighborsClassifier(n_neighbors=20))])
pipe2.fit(X_trainknn20, y_trainknn20)

The difference here is even more stark! As you can see, pipeline is an excellent way to save time and energy while adding readability to code when using machine learning algorithms that require data to be scaled!

Data Scientist | Data Analyst