FitBit user data analysis

Bellabeat, a high-tech manufacturer of health-focused products for women, asking for an analysis of existing user data collected by their competitor, fitbit.

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Data Set

This case study was completed as part of the google data analytics certification, for which fitness data from 30 fitbit users from 2016 is provided. This data is five years old so I would not consider it relevant. However, for the purpose of this exercise I will use it regardless.

FitBit Fitness Tracker Data via Mobius

Author

{Shane Chambry](https://www.shanechambry.com)

Summary

This data has it’s limitations, but there are a couple of key insights we may be able to leverage to improve the customer experience.

For our participants, time spent sedentary was negatively correlated with total time spent asleep. We can suggest to them that if they are having trouble sleeping, they could try limiting the amount of time they spend sedentary.
Average steps for a given minute of each day will go up sharply if a participant has a regular work out routine. With further analysis/ models, we may be able to predict when a participant will work out next and suggest optimization like before-workout hydration reminders.
The participants in this data set logged very little of their overall activity. It may be worth reconsidering if this feature is valuable to our customers.

Libraries

knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
library("tidyverse")
library("lubridate")
library("reshape2")
library("hms")
library("GGally")

A look at the data

Read the data:

# for each, date is converted from string to datetime
dailyActivity_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")%>%
    mutate(dateTime = mdy(ActivityDate))
# dailyIntensities_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv") %>%
#     mutate(dateTime = mdy(ActivityDay))
sleepDay_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv") %>%
    mutate(dateTime = mdy_hms(SleepDay))
minuteStepsNarrow_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv") %>%
    mutate(dateTime = mdy_hms(ActivityMinute))

Merge and alter types:

# dailyIntensityAndSleep <- merge(dailyIntensities_merged, sleepDay_merged, by=c("Id","dateTime"), All = TRUE)
dailyActivityAndSleep <- merge(dailyActivity_merged, sleepDay_merged, by=c("Id","dateTime"), All = TRUE) %>% 
  select(-ActivityDate, -SleepDay) %>%
  mutate(Id = as.character(Id)) # change Id from number to string
# Remove 0's from step count, assuming that the participant did not wear their tracker on this day.
dailyActivityAndSleep$TotalSteps[dailyActivityAndSleep$TotalSteps == 0] <- NA
dailyActivityAndSleep$TotalMinutesAsleep[dailyActivityAndSleep$TotalMinutesAsleep == 0] <- NA
dailyActivity_merged$TotalSteps[dailyActivity_merged$TotalSteps == 0] <- NA

str(dailyActivity_merged)

## spec_tbl_df [940 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  $ dateTime                : Date[1:940], format: "2016-04-12" "2016-04-13" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(sleepDay_merged)

## spec_tbl_df [413 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  $ dateTime          : POSIXct[1:413], format: "2016-04-12" "2016-04-13" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The sleep data has a lot of missing days, so will only use the combined data for the analysis relating to sleep.

length(unique(dailyActivityAndSleep$Id))

## [1] 24

length(unique(dailyActivity_merged$Id))

## [1] 33

9 participants did not participate in the sleep portion of the study.

dailyActivity_merged %>%
  summarise_all(funs(sum(is.na(.))))

## # A tibble: 1 x 16
##      Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesD~
##   <int>        <int>      <int>         <int>           <int>              <int>
## 1     0            0         77             0               0                  0
## # ... with 10 more variables: VeryActiveDistance <int>,
## #   ModeratelyActiveDistance <int>, LightActiveDistance <int>,
## #   SedentaryActiveDistance <int>, VeryActiveMinutes <int>,
## #   FairlyActiveMinutes <int>, LightlyActiveMinutes <int>,
## #   SedentaryMinutes <int>, Calories <int>, dateTime <int>

A day with TotalSteps as zero is assumed to be a day that the participant did not wear the tracker. These zeros are changed to na so as to leave our sums/averages unaffected.

Correlations

First, let’s take the combined daily activity data and see if there are any obvious correlations.

ggcorr(dailyActivityAndSleep, method = c("everything", "pearson"))

This implies a few obvious things:

The more steps you take, the farther you go.
More time spent very active contributes to the distance you travel while very active.
More time spent in bed is more time spent asleep.

There are two other key takeaways:

Time spent sedentary correlates negatively with time in bed and time asleep.
Calories burnt correlates strongest with time spent very active.

Let’s take a closer look.

Sleep Time and Sedentary Time

ggplot(dailyActivityAndSleep, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) +
  geom_hex()+
  labs(title = "Sleep Time vs Time Spent Being Sedentary", x="Time Spent Being Sedentary Today (Minutes)", y="Time Spent Asleep Today (Minutes)")

More data would be required to confirm the relationship, but one thing we could recommend to users who have poor quality sleep is to spend less time sedentary.

Calories and Very Active Minutes

groupedActivitySleep <- dailyActivityAndSleep %>%
  na.omit()
  

ggplot(groupedActivitySleep, aes(x=VeryActiveMinutes, y=Calories)) +
geom_hex()+
  labs(title = "Calories Burned vs Time Spent Being Very Active per day", x="Time Spent Being Very Active in a Day (Minutes)", y="Calories Burned in a Day (kCal)")

Calories and Steps

groupedActivitySleep <- dailyActivityAndSleep %>%
  na.omit() %>%
  group_by(Id, dateTime)
  

ggplot(groupedActivitySleep, aes(x=TotalSteps, y=Calories, color = Id)) +
geom_point()+
  labs(title = "Calories Burned vs Total Steps", x="Total Steps Today", y="Calories Burned Today (kCal)")

There seems to be a positive trend between steps and calories burned and time spent very active.However, since you cannot directly measure calories burned, it is a safe assumption that Calories burned for a participant is calculated based upon the steps (and some other calibrations like height, weight, or age). This measurement cannot be used to support the hypothesis that more steps is correlated with calories burned, but it is a well established trend and it’s safe to recommend higher activity levels to customers who want to burn more calories.

Steps per minute over the course of an average day.

Humans are creatures of habit. Especially when it comes to working out, we tend to do it at the same time each day as part of our routine. Let’s have a look at average steps participants took throughout the day.

stepsByMinuteAndPerson <- minuteStepsNarrow_merged %>% 
  na.omit() %>%
  mutate(dateTime = mdy_hms(ActivityMinute)) %>%
  mutate(Time = as_hms(dateTime)) %>%
  mutate(Id = as.character(Id)) %>%
#  select(-ActivityMinute) %>% 
  group_by(Id, Time) %>%
  summarise(avgStepsThisMinute = mean(Steps))

ggplot(stepsByMinuteAndPerson, aes(x = Time, y = avgStepsThisMinute, color = Id)) +
  geom_line() +
  facet_wrap(~Id)

Large peaks in average steps per minute indicates a daily routine. Using this approach, we can try remind users to drink water if their anticipated work out time is approaching to ensure an optimal workout. We can use this opportunity to advertise our water bottle.

Logged Activities vs Distance

Here we compare the average logged activities distance and the distance traveled as calculated by the hardware.

TrackedVSLogged <- dailyActivity_merged %>%
  na.omit() %>%
  select(dateTime, LoggedActivitiesDistance, TrackerDistance) %>% 
  group_by(dateTime) %>%
  summarize_all(mean) %>%
  melt(id = "dateTime")
  
ggplot(TrackedVSLogged, aes(x = dateTime, y = value, color = variable)) +
  geom_line()

The average user will not log all of their movements, and their actual distance traveled will be much higher than their logged distance. It could be worth exploring what value the logging feature has to our users. If the user is simply shown their actual distance traveled in a day, they may be more motivated to continue.

Smart Device Market Analysis