Sparkify

6 min readMar 26, 2021

Definition.

Project Overview.
The problem of keeping users satisfied is relevant to any kind of service, as it contributes to a positive overall image of the service, helps to keep the service’s audience and attract new users which would result in higher profits.
In this project, I made an attempt to predict the users who are about to cancel a music streaming service based on available statistics: number of songs listened per day rate of errors occurrences, how many songs were listened to before logging out and more. The problem is a classification problem.

Problem Statement.
The goal is to develop a ML model using Spark which will predict the unsubscribing users. The task includes a number of subtasks:
1. Loading the collected data on users.
2. Initial preprocessing of the data.
3. Exploratory analysis.
4. Designing features which separate between the staying and cancelling users.
5. Training model based on the collected data.
6. Assessment of the results.

Metrics.
The f1 score is chosen to assess the results.
We calculate f1 as follows:

where TP- true positives, FP- false positives and FN- false negatives.

This metric is chosen due to a small number of users (225 for the small dataset).
False positive results are less punishing than false negative although the model should minimize them as well.

Analysis

Data Exploration.
The small dataset contains 286500 rows which represent 226 users
and contains following fields:

| — artist: string (nullable = true)
| — auth: string (nullable = true)
| — firstName: string (nullable = true)
| — gender: string (nullable = true)
| — itemInSession: long (nullable = true)
| — lastName: string (nullable = true)
| — length: double (nullable = true)
| — level: string (nullable = true)
| — location: string (nullable = true)
| — method: string (nullable = true)
| — page: string (nullable = true)
| — registration: long (nullable = true)
| — sessionId: long (nullable = true)
| — song: string (nullable = true)
| — status: long (nullable = true)
| — ts: long (nullable = true)
| — userAgent: string (nullable = true)
| — userId: string (nullable = true)

The `userId` field contains empty strings which most likely represent users who aren’t logged in yet or users considering to register on the platform. These users can be dropped as it is impossible to identify them. There is no NaN values in that field.
The `ts` field doesn’t contain incorrect values or NaNs, but it’s values should be divided by 1,000 in order to conform to the unix timestamp format.
The `page` field doesn’t contain incorrect values. Although the datasets contain a significant number of rows, it gives us information only about a small portion of users after aggregation. In addition two categories of users (churning and staying) are highly imbalanced, namely 52 churning against 173 staying in the small dataset.

Exploratory Visualization.
All of the charts below are related to the small dataset. We divide the users into two groups and compare the various aggregations based on the daily activity, overall statistics or the number of songs listened between certain actions. In general, most of the aggregations do not differ too much for the two categories by their mean and standard deviation.
The following statistics were explored and assessed for both categories (visualizations were included for the most promising statistics):
1. Overall/daily number of songs listened for the period of time presented by the data.

The overall song counts for churning users (left) and staying users (right)

2. Average number of certain actions per day (Cancel, Submit Downgrade, Thumbs Down, Downgrade, Logout, Save Settings, Add to Playlist, Add Friend, etc.)
3. Gender distribution of users among the two categories.
4. Likes/dislikes ratio.

The overall likes/dislikes ratio for churning users (left) and staying users (right)

5. Overall/daily number of songs by the top-20 performers listened by users.
6. Average number of songs listened between certain actions.

The average number of songs between logouts for churning users (left) and staying users (right)

The average number of songs between rolling ads for churning users (left) and staying users (right)

The average number of songs between visiting the Help page for churning users (left) and staying users (right)

The average number of songs between error occurrences for churning users (left) and staying users (right)

Methodology.

Data Preprocessing.
Rows containing empty string as user id were deleted. Two new columns were added: one called `churn` to denote rows in which a user confirmed cancellation of the service and another `date` which was derived from the timestamp field. Most of the work was done in the Feature Engineering section of the notebook. New dataframe was created with one row for each user, the dataframe includes some of the aggregations explored in the previous section:
— overall number of songs listened for the period of time presented by the data
— likes/dislikes ratio
— average number of songs listened between logouts, rolling ads, errors and visiting the Help page.
The `churn` column was also included which will serve as the label for our ML model.

Implementation.
Our problem is a binary classification problem. Pyspark.ml library has a number of algorithms suited for that task. After dividing the dataset in 3 parts (training, validation and test sets) in ratio 0.7/0.15/0.15 respectively, we will
test the following algorithms: LogisticRegression, DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, and choose one that shows the best results.
We will use pipeline, parameter tuning and cross validation which are also included in pyspark.ml library.
The parameter grid is chosen as follows:
— elasticNetParam: [0.0, 0.5, 1.0]
— maxIter: [5, 7, 10]
— threshold: [0.3, 0.5, 0.7]
For cross validation the number of folds is chosen as 3.
Also, the model will be trained and tested for various combinations of chosen features to find the best result.
The f1 score will be computed with MulticlassClassificationEvaluator of pyspark.ml library.

Refinement.
All of the algorithms performed poorly on initial data (both with the small and the medium datasets).
The best result was 0.19 for the f1 score on the validation set, and 0.24 on the test set.
It’s mostly the result of imbalance that exists between the two categories in the training set.
One way to fix that is to simply duplicate rows for churning users in the training set.
The oversampling was done as explained in this article .
Balancing categories in the training set significantly improved the performance of algorithms.
Also various combinations of selected parameters were tested to obtain the best result.
The best results were shown by logistic regression with all the features that were initially selected.
The final result is 0.44 for validation and 0.63 for the test in the small dataset.

Results.

Model Evaluation and Validation.
Cross validation and grid search were used to tune parameters.
The best parameters for the logistic regression algorithm are as follows:
— maxIter: 10
— elasticNetParam: 0.0
— threshold: 0.3
The model was evaluated with built-in MulticlassClassificationEvaluator using f1 score metric.

Justification.
Selected approach was tested on the small dataset with f1 score 0.63.
Logistic regression performed slightly better than other algorithms, possibly as the result of better parameters tuning.

Conclusion.

Reflection.
During the project the following steps were made:
1. Data was downloaded and analyzed.
2. Data was preprocessed and reshaped, suitable features were designed.
3. ML model was built and trained.
4. The results were assessed and tested.
The most difficult task was to find suitable features, mostly because there is a strong overlap between categories.
Almost all the explored features had significant standard deviation and some of them were almost identical for 2 categories.

Improvement.
Another approach can be used to tackle the imbalance of two classes, for example: SMOTE as explained in this article .
This method is better as it allows to remove the imbalance not just by duplicating the existing data, but by generating new information.
In addition, ensemble techniques can be used to further improve the performance of the model.