Udacity Capstone Starbucks Project Analysis

Soni Pandey
6 min readMar 31, 2021

Introduction

In the Capstone project submission they have provided multiple project options which we can choose for the submission and here I chosen Starbucks project.

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app.

Starbucks has provided 3 different JSON -

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed.

During the analysis of this data I am going to follow CRISP-DM Process, which has 6 stages -

While following the CRISP-DM stages, first step is Business Understanding.

Business Understanding

Looking at the Starbucks data, these are the questions came in my mind -

Q.1 What offers are popular among all of the users?
Q.2 Which group of people is more likely to use the offer?
Q.3 Which type of offer opt without even noticing the offer received?

There can be many other questions which can come to our mind and everyone has different set of mind. This is the important stage of CRISP-DM which is know as Business Understanding.

Data Understanding

In this step, we have analyzed the data while looking at all individual datasets (profile, portfolio and transcript).

  • While analyzing the portfolio data, this is the analysis came to my mind — We see that there are 3 unique types of offer has been provided by Starbucks and found that 10 unique offers has been provided and the distribution of offer is like 4 offers with bogo and discount and rest 2 is for informational.
  • Looking at the demographic data, we found that there is missing values and some invalid data also for one column. so — We see the count of age 118 is 2175 and similar count we saw for null values of income and gender. In our Udacity program, we have learned that if all the columns of one row is either invalid value i.e age (able to analyze from histogram visualization) or missing value i.e gender and age, this means that we should drop those row according to handle missing data process. This we’ll handle in Data Preprocessing step.
  • This is the analysis for transactional record — We can see that the unique no of Profile is the same as the transactional record for unique person and we suspect that there is no ambiguity in transactional records. Means all the transactions has been made by the profile which exist in our system. Need to expand the value column of transcript dataset, which we are going to handle in Data Preprocessing steps.

Data Preprocessing

In this steps, these are the steps which we have taken to achieve the future steps -

  • Handle Missing data or invalid data
  • Encode Categorical data
  • Expand the column
  • Formatting of the dataset as per our need
  • Merge all 3 datasets and dropped unnecessary columns

Data Analysis

These are the questions which we have analyzed in this section -

Q.1 What offers are popular among all of the users?
Q.2 Which group of people is more likely to use the offer?
Q.3 Which type of offer opt without even noticing the offer received?

Q.1 What offers are popular among all of the users?

Analysis — Discount offer is more popular because not only the absolute number of ‘offer completed’ is slightly higher than BOGO offer, its overall completed/received rate is also about 7% higher. However, BOGO offer has a much greater chance to be viewed or seen by customers.

Q.2 Which group of people is more likely to use the offer?

Analysis

The peak of offer completed is slightly before offer viewed in the first 5 days of experiment time.

They sync better as time goes by indicating that majority of the people still used the offer with conscious.

Analysis

Comparing the demographics data using profile.json between customers who used our offers before viewing it and the rest of the customers, there’s no significant difference. This indicates that all customers are equally likely to use our offers accidentally.

Q.3 Which type of offer opted without even noticing the offer received?

Analysis

The design of the offer plays a big role, especially the promotion channels and duration.

If an offer is being promoted through web and email, then it has a much greater chance of not being seen being used without viewing to link to the duration of the offers.

Longer duration increase the chance discount offer type also has a greater chance to be used without seeing compare to bogo.

Modeling

In this section, I would like to answer or Identify which user is likely to “waste” an offer i.e not using the offer or used the offer without viewing it.

For this I have chosen Logistic Regression Model, I wanted to choose Decision Tree as well, but there is imbalanced dataset and decision tree required more tuning of the dataset. This is the reason I have finally chosen Logistic Regression Model. I used GridSearchCV to tune the “C” parameters in the logistic regression model.

Evaluation

These are the evaluation metrics I have calculated and here is the score for all those -

precision score:  0.750642546228
Confusion Matrix:
[[3063 1641]
[1500 4841]]
[[3063 1641]
[1500 4841]]
Accuracy: 0.715617926664

False Positive performed worse than False Negative, meaning that the model is more accurate at identifying which offer will be wasted, not which offer will be used. This is essential the money-saving approach for a marketing team. To improve the model, I downsampled the majority label and balanced the dataset.

This is the final score calculated after downsampling the label and balanced the dataset -

precision score:  0.765467833552
Confusion Matrix:
[[3922 764]
[1426 2417]]
[[3922 764]
[1426 2417]]
Accuracy: 0.743228983468

Conclusion

Here I conclude:

  • Customers whom joined earlier, they are a lot less likely to use offers.
  • Customers with incomplete profiles are less likely to use offers it seems they are not interested to use app.
  • Men are less likely to use offers but they do prefer discount over BOGO, while women used BOGO more.

Suggestions: To avoid or to improve the situation of using an offer without viewing:

  • Need to promote the offer via at least 3 channels to increase exposure
  • Eliminate offers that last for 10 days, put maximum for 7 days.
  • There are lots of potential in the discount offer. The completion rate is 78% among those whom viewed the offer. Therefore, if the company can increase the viewing rate of discount offer, there’s a great chance to incentive more spending.

Future Improvement:

  • Incorporate the data from information offer.
  • Improve the model accuracy by fine tuning the model or try tree models.

--

--

Soni Pandey

I am a Node.js Developer and eager to learn new technology. I blog, tweet & read whenever I can.