Bellabeat, a high-tech manufacturer of health-focused products for women, asking for an analysis of existing user data collected by their competitor, fitbit.
This data has it’s limitations, but there are a couple of key insights we may be able to leverage to improve the customer experience.
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
library("tidyverse")
library("lubridate")
library("reshape2")
library("hms")
library("GGally")
Read the data:
# for each, date is converted from string to datetime
dailyActivity_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")%>%
mutate(dateTime = mdy(ActivityDate))
# dailyIntensities_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv") %>%
# mutate(dateTime = mdy(ActivityDay))
sleepDay_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv") %>%
mutate(dateTime = mdy_hms(SleepDay))
minuteStepsNarrow_merged <- read_csv("data/Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv") %>%
mutate(dateTime = mdy_hms(ActivityMinute))
Merge and alter types:
# dailyIntensityAndSleep <- merge(dailyIntensities_merged, sleepDay_merged, by=c("Id","dateTime"), All = TRUE)
dailyActivityAndSleep <- merge(dailyActivity_merged, sleepDay_merged, by=c("Id","dateTime"), All = TRUE) %>%
select(-ActivityDate, -SleepDay) %>%
mutate(Id = as.character(Id)) # change Id from number to string
# Remove 0's from step count, assuming that the participant did not wear their tracker on this day.
dailyActivityAndSleep$TotalSteps[dailyActivityAndSleep$TotalSteps == 0] <- NA
dailyActivityAndSleep$TotalMinutesAsleep[dailyActivityAndSleep$TotalMinutesAsleep == 0] <- NA
dailyActivity_merged$TotalSteps[dailyActivity_merged$TotalSteps == 0] <- NA
str(dailyActivity_merged)
## spec_tbl_df [940 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## $ dateTime : Date[1:940], format: "2016-04-12" "2016-04-13" ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(sleepDay_merged)
## spec_tbl_df [413 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## $ dateTime : POSIXct[1:413], format: "2016-04-12" "2016-04-13" ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The sleep data has a lot of missing days, so will only use the combined data for the analysis relating to sleep.
length(unique(dailyActivityAndSleep$Id))
## [1] 24
length(unique(dailyActivity_merged$Id))
## [1] 33
9 participants did not participate in the sleep portion of the study.
dailyActivity_merged %>%
summarise_all(funs(sum(is.na(.))))
## # A tibble: 1 x 16
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesD~
## <int> <int> <int> <int> <int> <int>
## 1 0 0 77 0 0 0
## # ... with 10 more variables: VeryActiveDistance <int>,
## # ModeratelyActiveDistance <int>, LightActiveDistance <int>,
## # SedentaryActiveDistance <int>, VeryActiveMinutes <int>,
## # FairlyActiveMinutes <int>, LightlyActiveMinutes <int>,
## # SedentaryMinutes <int>, Calories <int>, dateTime <int>
A day with TotalSteps as zero is assumed to be a day that the participant did not wear the tracker. These zeros are changed to na so as to leave our sums/averages unaffected.
First, let’s take the combined daily activity data and see if there are any obvious correlations.
ggcorr(dailyActivityAndSleep, method = c("everything", "pearson"))
This implies a few obvious things:
There are two other key takeaways:
Let’s take a closer look.
ggplot(dailyActivityAndSleep, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) +
geom_hex()+
labs(title = "Sleep Time vs Time Spent Being Sedentary", x="Time Spent Being Sedentary Today (Minutes)", y="Time Spent Asleep Today (Minutes)")
More data would be required to confirm the relationship, but one thing we could recommend to users who have poor quality sleep is to spend less time sedentary.
groupedActivitySleep <- dailyActivityAndSleep %>%
na.omit()
ggplot(groupedActivitySleep, aes(x=VeryActiveMinutes, y=Calories)) +
geom_hex()+
labs(title = "Calories Burned vs Time Spent Being Very Active per day", x="Time Spent Being Very Active in a Day (Minutes)", y="Calories Burned in a Day (kCal)")
groupedActivitySleep <- dailyActivityAndSleep %>%
na.omit() %>%
group_by(Id, dateTime)
ggplot(groupedActivitySleep, aes(x=TotalSteps, y=Calories, color = Id)) +
geom_point()+
labs(title = "Calories Burned vs Total Steps", x="Total Steps Today", y="Calories Burned Today (kCal)")
There seems to be a positive trend between steps and calories burned and time spent very active.However, since you cannot directly measure calories burned, it is a safe assumption that Calories burned for a participant is calculated based upon the steps (and some other calibrations like height, weight, or age). This measurement cannot be used to support the hypothesis that more steps is correlated with calories burned, but it is a well established trend and it’s safe to recommend higher activity levels to customers who want to burn more calories.
Humans are creatures of habit. Especially when it comes to working out, we tend to do it at the same time each day as part of our routine. Let’s have a look at average steps participants took throughout the day.
stepsByMinuteAndPerson <- minuteStepsNarrow_merged %>%
na.omit() %>%
mutate(dateTime = mdy_hms(ActivityMinute)) %>%
mutate(Time = as_hms(dateTime)) %>%
mutate(Id = as.character(Id)) %>%
# select(-ActivityMinute) %>%
group_by(Id, Time) %>%
summarise(avgStepsThisMinute = mean(Steps))
ggplot(stepsByMinuteAndPerson, aes(x = Time, y = avgStepsThisMinute, color = Id)) +
geom_line() +
facet_wrap(~Id)
Large peaks in average steps per minute indicates a daily routine. Using this approach, we can try remind users to drink water if their anticipated work out time is approaching to ensure an optimal workout. We can use this opportunity to advertise our water bottle.
Here we compare the average logged activities distance and the distance traveled as calculated by the hardware.
TrackedVSLogged <- dailyActivity_merged %>%
na.omit() %>%
select(dateTime, LoggedActivitiesDistance, TrackerDistance) %>%
group_by(dateTime) %>%
summarize_all(mean) %>%
melt(id = "dateTime")
ggplot(TrackedVSLogged, aes(x = dateTime, y = value, color = variable)) +
geom_line()
The average user will not log all of their movements, and their actual distance traveled will be much higher than their logged distance. It could be worth exploring what value the logging feature has to our users. If the user is simply shown their actual distance traveled in a day, they may be more motivated to continue.