Investigating the relationship between physical activity and mortality through accelerometry data analysis

Quan Bui, Emily Johnson
August 12, 2016

Summary

The effects of physical activity on mortality were studied using accelerometry data from the National Health and Nutrition Examination Survey (NHANES). Exploratory activity data analysis revealed that there were significant differences in activity level between living and deceased subjects, with the largest difference in activity level occurring during the late morning (10am to 12pm). A logistic regression model was fit, able to predict probability of death up to 9 years based on a subject’s age, comorbidities, and activity level. These same predictors remained significant when used to fit a Cox proportional hazards model to determine time until death. To evaluate the ability of activity level to predict time until death, sensitivity analysis was performed. It was found that average daily activity is predictive up to 6 years post-study while percentage of daytime activity is predictive only up to 1.5 years post-study. Further logistic regression models were fit to predict death up to 5 years post-study and to find the best functional forms of these predictors. The models were evaluated using receiver operating characteristic analysis with 10-fold cross-validation. Daily activity count and variation as well as the allocation of activity during certain periods of night and day appeared to have a strong effect on mortality risk. In conclusion, it was found that the AUC improved by 2.43% when these activity-related predictors were included in the model. Volume of physical activity, as well as the circadian rhythms of physical activity appear to have strong effects on risk of death.

Introduction

There are a variety of health characteristics that have a complex effect on survival and mortality. Individuals who are healthy and active are expected to live longer lives. The National Health and Nutrition Examination Survey (NHANES) is a program that studied subjects’ activity levels throughout a 1-week study. NHANES provided health and demographic factors of each subject along with their daily activity data. The analysis utilized NHANES data from 2003-2004 and 2005-2006 cohorts. The following heat map shows the distribution of activity among all age groups that participated in the study.


To analyze the effects of physical activity on mortality, the raw data was cleaned and stored as AllDataHour50. This data set includes activity data recorded as log-transformed activity counts and excludes subjects who failed to record enough daily activity. Each subject also had post-study follow-up sessions to keep track of their survival status.

Exploratory Activity Data Analysis

To begin investigating the relationship between physical activity and mortality, exploratory analysis was performed on AllDataHour50 to uncover patterns and trends that could be further studied. It was initially found that for both the average total log-transformed activity counts and for the percentage of total log-transformed activity counts, there was a visible difference in value between subjects who were alive and those who died. Subjects who were still living had higher average total log activity counts when compared to those who died. When the deceased subjects were grouped by time of mortality, there appeared to be a direct correlation between activity and time of death. Those who were less active seemed to have earlier deaths. Meanwhile, when looking at the percentage of total log activity counts allocated into each hour of the day, subjects who were still living appeared to be more active earlier in the day than those who died. After grouping the deceased subjects by time of mortality, those who had activity allocated later in the day seemed to have earlier deaths. It was posited that this reflected the relationship between circadian rhythms of physical activity and mortality, as already concluded in previous analyses.

Particular interest was taken in the average activity during the time interval of 10am to 12pm, thTAC06. Exploratory analysis showed that it was in this interval where there was the largest difference in average total log activity count among living and deceased subjects. The plot below shows the density distributions of thTAC06 based on mortality status. For subjects still alive, their distribution appears to have a higher mean amount of activity for this interval.

A Wilcoxon rank-sum test confirmed that there was a statistically significant difference in mean total activity for 10am to 12pm between these two groups.

Logistic Regression Model Building

Logistic regression with survey weighting could be used to study whether or not physical activity could predict probability of mortality. Initial models were built to predict death within 1 year post-study based on non-activity related covariates: age, sex, race, BMI, and a variety of comorbidities such as cancer and diabetes. Significant predictors included Age, Cancer, and history of congestive heart failure (CHF). Then, average total log activity counts grouped into two-hour intervals were included one-by-one into the model, performing analysis of variance tests between each model to test for the significance of each coefficient.

With the introduction of activity as a predictor for probability of death in the process of model building, it was found that again, thTAC06, a subject’s average total log activity from 10am to 12pm, was significant. However, the interpretability and lack of standardization of thTAC06 was discussed, and eventually was replaced in the model. Instead, the final model included two other activity-related variables, pDaytime, and dTAC.norm, with pDaytime being the percentage of activity during the daytime, defined as 6am to 6pm, and dTAC.norm being the normalized daily total log activity count. The final model was able to predict probability of death up to 9 years.


Call:
svyglm(formula = mortstat ~ Age + CHF + Cancer + pDaytime + dTAC.norm, 
    design = dataSvy, family = quasibinomial())

Survey design:
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~wtmec4yr_adj, 
    data = mort.df, nest = TRUE)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.072752   0.826881  -8.554 6.79e-09 ***
Age          0.088831   0.010173   8.732 4.59e-09 ***
CHF          1.199219   0.193772   6.189 1.80e-06 ***
Cancer       0.366728   0.166562   2.202  0.03713 *  
pDaytime    -0.017853   0.005892  -3.030  0.00562 ** 
dTAC.norm   -0.659153   0.099713  -6.610 6.32e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.9400132)

Number of Fisher Scoring iterations: 6

Survival Analysis

Kaplan-Meier curves were used to relate time until death with several covariates in order to analyze any sort of associations that might exist with mortality. Sex, age, BMI, and activity level were focused on as predictors for time until death over a span of 10 years. Survey weighting was implemented in order to ensure that the data was representative of the population of interest. The weights included age, sex, and BMI, along with demographic and health factors. Both with and without the weighting adjustment, the survival analyses yielded similar results.

To analyze Age as a predictor, subjects were split up into three age groups: 50-69, 70-78, and 79-84 years old. The resulting survival curve showed that within 10 years post-study, only 10% of the first age group would face mortality, compared to 23% of 70-78 year olds and about 50% of the eldest age group. The Kaplan-Meier curve for sex showed that within 10 years post-study, 20% of men faced mortality, compared to about 15% of women. These analyses gave intuitive results, women lived longer than men and older people had a higher risk of death.

However, the most interesting survival curves were those analyzing BMI and total log activity count. For BMI, the subjects were split into four age BMI groups: underweight (BMI 0<18.5), normal weight (18.5<25), overweight (25<30), and obese (30<60). The Kaplan-Meier curve showed that underweight subjects were 30% likely to face mortality within 10 years after participating in the study. Normal, overweight, and obese subjects were only about 20% likely to face mortality. This analysis suggests that people over 50 are more likely to face mortality if they are underweight. Although this result is up to interpretation, it has been suspected that older people are more likely to be underweight if they face a serious health problem, such as cancer. Therefore, they would be more likely to face mortality.

For activity count, it was decided again to analyze the subjects’ activity at the time interval 10am to 12pm (thTAC06). The subjects were split up into four groups based on their average activity at that time during the week they participated in the study. Subjects with an activity level <200 had the highest likelihood of mortality at 30%. Subjects with an activity level 200<400 had less than 20% likelihood of mortality, while both subjects with high activity levels of 400<600 or of 600<800 had a likelihood of almost 0%. This shows a significant difference in mortality based on activity level.

Further analysis was done to see if this trend for activity count was continued at different parts of the day. We looked at each 2-hour interval and those with less activity were more likely to face mortality after 10 years no matter what time of day considered. However the most varying results appeared between 12pm to 2pm. Activity levels <200 faced an incredibly large mortality rate of 50%, while activity level 200<400 was only about 20%. Subjects with activity levels higher than 400 were almost 0% likely to face mortality after 10 years.

After this exploratory survival analysis was done, a Cox proportional hazards model was fit based on the mortality predictors found in the previous logistic regression model building. Here, the predictors still remain significant in relation to time until death. The summary output is shown below.

Stratified 1 - level Cluster Sampling design (with replacement)
With (60) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~wtmec4yr_adj, 
    data = mort.df, nest = TRUE)
Call:
svycoxph(formula = Surv(permth_exm/12, mortstat) ~ Age + CHF + 
    Cancer + pDaytime + dTAC.norm, design = dataSvy)

  n= 3398, number of events= 539 

               coef exp(coef)  se(coef)      z Pr(>|z|)    
Age        0.078679  1.081857  0.009624  8.176 3.33e-16 ***
CHF        1.037949  2.823422  0.139950  7.417 1.20e-13 ***
Cancer     0.368583  1.445685  0.135082  2.729  0.00636 ** 
pDaytime  -0.014680  0.985427  0.004808 -3.054  0.00226 ** 
dTAC.norm -0.631552  0.531766  0.083001 -7.609 2.76e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          exp(coef) exp(-coef) lower .95 upper .95
Age          1.0819     0.9243    1.0616    1.1025
CHF          2.8234     0.3542    2.1461    3.7145
Cancer       1.4457     0.6917    1.1094    1.8839
pDaytime     0.9854     1.0148    0.9762    0.9948
dTAC.norm    0.5318     1.8805    0.4519    0.6257

Concordance= 0.853  (se = 0.017 )
Rsquare= NA   (max possible= NA )
Likelihood ratio test= NA  on 5 df,   p=NA
Wald test            = 315  on 5 df,   p=0
Score (logrank) test = NA  on 5 df,   p=NA

Sensitivity Analysis

While the logistic regression model was able to predict probability of death up to 9 years, whether or not activity itself could be a predictor up to 9 years became the question of interest. In order to do so, sensitivity analysis was done on both dTAC.norm and pDaytime. The plots below illustrate how dTAC.norm and pDaytime act as predictors in the Cox proportional hazards model above.

To perform this analysis, subjects were increasingly excluded based on their mortality status at each year. At each year, subjects who had died were excluded from the sample, the Cox proportional hazards model was run, and the P-values of both dTAC.norm and pDaytime were obtained. This method of exclusion allowed for a more accurate evaluation of how far ahead these activity-related covariates could predict mortality status. As shown, at an alpha level of .05, dTAC.norm was a significant predictor up until 6 years post-study, while pDaytime was a significant predictor up until 1.5 years post-study. The latter provides further evidence that there is a relationship between circadian rhythms of physical activity and mortality. Perhaps, disruption of the circadian cycle of activity could be an indicator of death within the near future. On the other hand, the actual amount of daily activity could have longer-term effects on the risk of mortality.

Receiver Operating Characteristic Analysis

Given the results of the sensitivity analysis, post-study mortality within 5 years became the outcome of interest. Further logistic regression models were built in order to find the best functional forms of the predictors of mortality. To evaluate the performance of each model, receiver operating characteristic (ROC) curves were used with 10-fold cross-validation. Increases in the area under the curve (AUC) of an ROC reflected improvements in the model.

The plots above show the ROC curves for subsequent models and the improvements in the AUC from 0.7459 to 0.8165. Each plot graphs each of the ROC curves for the 10 sub-samples from the cross-validation and gives their mean AUC. The mean ROC curves are colored in red. Model 1 contained just the initial non-activity covariates from prior model building as predictors: Age, CHF, and Cancer. Model 2 contained those three predictors as well as more non-activity covariates that increased the AUC: Male, Black, Overweight (defined as having a BMI greater than or equal to 25), Diabetes, Stroke, MobilityProblem, CurrentDrinker, and CurrentSmoker. As shown, adjusting for sex, race, and other health indicators greatly improved the accuracy of the model in predicting risk of death. In Model 3, dTAC.norm was added to the predictors included in Model 2. Initially, pDaytime was also included, but it was found that the AUC actually increased more when pDaytime was excluded from the model. Perhaps, this is related to the findings from the sensitivity analysis that pDaytime is only predictive up to 1.5 years. Model 4 shown in the last plot includes changes in the functional forms of the established predictors as well as interaction terms that helped improve the AUC. The estimates and P-values for the predictors in Model 4 are shown below.

                                                                       Estimate
(Intercept)                                                       -2.938833e+00
I(Age^4)                                                           6.834117e-08
dTAC.norm                                                         -1.164960e+00
CHF                                                                8.734952e-01
Cancer                                                             7.798401e-01
Male                                                               8.856161e-01
Black                                                              3.498642e-02
Overweight                                                        -5.489583e-01
Diabetes                                                           2.753971e-01
Stroke                                                             3.806534e-01
MobilityProblem                                                    4.422353e-01
CurrentDrinker                                                    -7.428786e-01
CurrentSmoker                                                      8.777601e-01
I((sd11 + sd12 + 1)/(sd21 + sd22 + 1))                             1.055695e-02
I((pthTAC06 + pthTAC11)^4)                                         2.702435e-08
I(dTAC.norm^2)                                                     2.120389e-01
act_sd_norm                                                       -2.435619e+00
I(Age^4):dTAC.norm                                                 1.913783e-08
CHF:Cancer                                                         2.880242e-01
Overweight:Diabetes                                                9.767653e-02
I((sd11 + sd12 + 1)/(sd21 + sd22 + 1)):I((pthTAC06 + pthTAC11)^4) -5.360389e-08
                                                                      Pr(>|t|)
(Intercept)                                                       2.538043e-04
I(Age^4)                                                          1.435187e-05
dTAC.norm                                                         2.247742e-03
CHF                                                               1.774505e-02
Cancer                                                            1.503575e-03
Male                                                              2.688632e-03
Black                                                             8.831010e-01
Overweight                                                        1.896370e-02
Diabetes                                                          5.205859e-01
Stroke                                                            1.840417e-01
MobilityProblem                                                   4.594059e-02
CurrentDrinker                                                    9.679542e-04
CurrentSmoker                                                     6.438495e-04
I((sd11 + sd12 + 1)/(sd21 + sd22 + 1))                            6.685341e-02
I((pthTAC06 + pthTAC11)^4)                                        9.451167e-01
I(dTAC.norm^2)                                                    1.962227e-02
act_sd_norm                                                       8.381128e-04
I(Age^4):dTAC.norm                                                3.290491e-02
CHF:Cancer                                                        5.953137e-01
Overweight:Diabetes                                               8.353306e-01
I((sd11 + sd12 + 1)/(sd21 + sd22 + 1)):I((pthTAC06 + pthTAC11)^4) 1.359451e-01

While the majority of the changes in functional forms and additions of interaction terms involved Age and activity-related predictors, the inclusion of the interaction terms CHF:Cancer and Overweight:Diabetes also improved the model. Thus, the negative effects of certain comorbidities and health characteristics increase when in concurrence with each other.

Age and activity seem to have non-linear relationships with risk of death. The variable expression pthTAC06 + pthTAC11 gives the summed percentage of activity a subject has during the hours of 10am to 12pm and 8pm to 10pm. act_sd_norm is a subject’s normalized standard deviation of activity throughout a day. While act_sd_norm may not have strong interpretability, it can be thought of as the amount of variation in activity level a subject has in a day, which could also be thought of as a metric of the cyclic nature of activity for a subject. sd11, sd12, sd21, and sd22 give a subject’s standard deviation of their activity across the weeks of the study for the hours of 10am, 11am, 8pm and 9pm, respectively. The inclusion of I((sd11 + sd12 + 1)/(sd21 + sd22 + 1)) gives evidence for the significant effect of the ratio of variation between night and day on mortality. Again, activity during 10am to 12pm appears to have an important relationship with risk of death. I(Age^4):dTAC.norm accounts for the difference in amount of daily activity for varying ages and I((sd11 + sd12 + 1)/(sd21 + sd22 + 1))*I((pthTAC06 + pthTAC11)^4) can be interpreted as the changing effect of the ratio of day-night activity variation depending on the proportion of activity spent in those time intervals. In general, the inclusion of activity-related predictors increases the AUC by 0.0243.

It should be noted that some of the predictors in Model 4 are not statistically significant at an alpha level of .05:

Further analysis could be done to evaluate these predictors, especially with I((sd11 + sd12 + 1)/(sd21 + sd22 + 1)) and ((sd11 + sd12 + 1)/(sd21 + sd22 + 1)):I((pthTAC06 + pthTAC11)^4), which have P-values just above .05. These two activity-related predictors appear to have some sort of effect on mortality risk, and should be further investigated. Continued analysis could potentially improve the AUC as well.