Conducting mixed effect/trajectory analysis using DataSHIELD • dsHelper

Introduction

Overview

This tutorial will demonstrate how to model trajectories in DataSHIELD using repeated measures data from multiple cohorts.

This tutorial draws on the following papers/tutorials:

Hughes, R., Tilling, K. & Lawlor, D. Combining longitudinal data from different cohorts to examine the life-course trajectory. Preprint available on medrxiv: https://doi.org/10.1101/2020.11.24.20237669
Tilling K, Macdonald-Wallis C, Lawlor DA, Hughes RA, Howe LD. Modelling childhood growth using fractional polynomials and linear splines. Ann Nutr Metab. 2014;65(2-3):129-38. https://doi.org/10.1159/000362695.
Centre for Multilevel Modelling online course. http://www.bristol.ac.uk/cmm/learning/online-course/

You will use simulated data to replicate part of the analysis from Hughes et al. cited above. It would be helpful to read this tutorial in conjuncture with this paper.

Please note that not all the methods used in Hughes et al. are available in DataSHIELD. Currently it is not possible to do 1-stage meta-analysis of mixed effects models. All of these are in the pipeline and this tutorial will be extended when these methods become available.

Do spline models!

Data

We will be working with simulated repeated measures data for four cohorts: (i) Alspac, (ii) BCG, (iii) Born in Bradford & (iv) Probit.

Aims of the analysis

Our research aims for today are:

To model child weight trajectories over time for four cohorts
To model sex differences in weight trajectories
To combine these trajectories by 2-stage meta-analysis

Let’s get started.

Installing and loading packages

First you need to install some packages are used in tutorial. Even if you have previously installed them it is worth doing this again to ensure that you have the latest versions.

If prompted to update packages, select option ‘none’.

First install the package ‘remotes’ which contains the function ‘install_github’.

#install.packages("remotes")
library(remotes)

#install_github("datashield/DSI")
#install_github("datashield/dsBaseClient")
#install_github("lifecycle-project/ds-helper", ref = "dev")
#install.packages("DSMolgenisArmadillo")
#install.packages("tidyverse")

Now we load these packages.

library(dsBaseClient)
library(DSI)
library(DSMolgenisArmadillo)
library(dsHelper)
library(tidyverse)

A couple of things to note:

I use a lot of the tidyverse packages as I find them very efficient and they make the code more readable. If any syntax is unclear just ask!
I also use a number of functions from a package called dsHelper. This was written by me and Sido Haakma to make analysis in DataSHIELD more straightforward. Any problems with these functions just let us know.

Logging in and assigning data

The simulated data is held on a remote server. To access it, you first request a ‘token’ which contains the login details. You then use this to login and assign the data.

url <- "https://armadillo-demo.molgenis.net/"
token <- armadillo.get_token(url)
  
builder <- DSI::newDSLoginBuilder()

builder$append(
server = "alspac",
url = url,
table = "trajectories/data/alspac",
token = token,
driver = "ArmadilloDriver", 
profile = "xenon")

builder$append(
server = "bcg",
url = url,
table = "trajectories/data/bcg",
token = token,
driver = "ArmadilloDriver", 
profile = "xenon")

builder$append(
server = "bib",
url = url,
table = "trajectories/data/bib",
token = token,
driver = "ArmadilloDriver", 
profile = "xenon")

builder$append(
server = "probit",
url = url,
table = "trajectories/data/probit",
token = token,
driver = "ArmadilloDriver", 
profile = "xenon")

logindata <- builder$build()

conns <- DSI::datashield.login(logins = logindata, assign = T, symbol = "data")

We can check that this has worked. You should see that each cohort has one dataframe called “data”, which contains 7 variables.

ds.summary("data")

## $alspac
## $alspac$class
## [1] "data.frame"
## 
## $alspac$`number of rows`
## [1] 129681
## 
## $alspac$`number of columns`
## [1] 7
## 
## $alspac$`variables held`
## [1] "id"                 "age"                "sex"                "ethnicity"          "father_socialclass" "mother_education"  
## [7] "weight"            
## 
## 
## $bcg
## $bcg$class
## [1] "data.frame"
## 
## $bcg$`number of rows`
## [1] 12724
## 
## $bcg$`number of columns`
## [1] 7
## 
## $bcg$`variables held`
## [1] "id"                 "age"                "sex"                "ethnicity"          "father_socialclass" "mother_education"  
## [7] "weight"            
## 
## 
## $bib
## $bib$class
## [1] "data.frame"
## 
## $bib$`number of rows`
## [1] 68864
## 
## $bib$`number of columns`
## [1] 7
## 
## $bib$`variables held`
## [1] "id"                 "age"                "sex"                "ethnicity"          "father_socialclass" "mother_education"  
## [7] "weight"            
## 
## 
## $probit
## $probit$class
## [1] "data.frame"
## 
## $probit$`number of rows`
## [1] 191224
## 
## $probit$`number of columns`
## [1] 7
## 
## $probit$`variables held`
## [1] "id"                 "age"                "sex"                "ethnicity"          "father_socialclass" "mother_education"  
## [7] "weight"

We wont be using all of the variables today, however if you want to come back and use this data to try out more complicated models we’ll keep it online.

Part 1: Visualising the data

Scatter plots of child weight by age

Let’s start off by looking at our repeated measures weight data for each cohort. We can make scatter plots with age on the x-axis and weight on the y-axis.

Note that the values that you see are anonymised - they are a non-disclosive approximation of the original values.

ds.scatterPlot(x = "data$age", y = "data$weight")

plot of chunk scatter

## [1] "Split plot created"

What are you initial thoughts about the data?

Mean observed weight by age

Another way to visualise the weight data is to split child age into yearly intervals, and plot the mean on the y-axis with age on the x-axis (see Figure 2 from Hughes et al.).

First of all we need to calculate the mean values for weight at each yearly age band:

mean_age <- dh.meanByGroup(
    df = "data", 
    outcome = "weight", 
    group_var = "age")

mean_age %>% print(n = Inf)

## # A tibble: 55 × 3
##    cohort   age  mean
##    <chr>  <dbl> <dbl>
##  1 alspac     0  4.64
##  2 alspac     1  8.78
##  3 alspac     2 12.8 
##  4 alspac     3 16.1 
##  5 alspac     4 17.0 
##  6 alspac     5 18.4 
##  7 alspac     6 20.3 
##  8 alspac     7 24.3 
##  9 alspac     8 26.5 
## 10 alspac     9 30.5 
## 11 alspac    10 35.1 
## 12 alspac    11 38.8 
## 13 alspac    12 43.6 
## 14 alspac    13 49.3 
## 15 alspac    14 54.6 
## 16 alspac    15 61.3 
## 17 alspac    16 63.5 
## 18 alspac    17 67.0 
## 19 alspac    18 67.7 
## 20 alspac    19 70.8 
## 21 alspac    20 71.8 
## 22 bcg        0  5.11
## 23 bcg        1  8.88
## 24 bcg        2 13.3 
## 25 bcg        3 15.4 
## 26 bcg        4 16.4 
## 27 bcg        5 18.6 
## 28 bib        0  4.46
## 29 bib        1  8.11
## 30 bib        2 13.3 
## 31 bib        3 16.0 
## 32 bib        4 17.1 
## 33 bib        5 18.7 
## 34 bib        6 22.4 
## 35 bib        7 27.8 
## 36 probit     0  4.98
## 37 probit     1  8.66
## 38 probit     2 14.0 
## 39 probit     3 16.4 
## 40 probit     4 16.8 
## 41 probit     5 17.8 
## 42 probit     6 20.8 
## 43 probit     7 23.6 
## 44 probit     8 28.0 
## 45 probit     9 34.6 
## 46 probit    10 36.1 
## 47 probit    11 39.8 
## 48 probit    12 42.1 
## 49 probit    13 47.1 
## 50 probit    14 54.0 
## 51 probit    15 60.1 
## 52 probit    16 62.4 
## 53 probit    17 63.2 
## 54 probit    18 61.6 
## 55 probit    19 61.0

Now we have the values locally, we can use any r package to plot these. I like to use ggplot:

palette <- c("#264653", "#2a9d8f", "#E9C46A", "#F4A261", "#E76F51")

mean_age.plot <- ggplot() + 
  geom_line(data = mean_age, aes(x = age, y = mean, colour = cohort), linewidth = 0.8) +
  scale_y_continuous(limit = c(0, 80), breaks = seq(0, 80, 20), expand = c(0, 0)) +
  xlab("Child age") +
  ylab("Observed weight (KG)") +
  scale_colour_manual(values = palette)

mean_age.plot

plot of chunk mean-by-age-plot-1

Q: Looking at these two plots, What are your initial thoughts about the pattern of weight change over time?

Describing the exposures

DataSHIELD contains the functions ds.summary and ds.table which can be used to describe continuous and categorical variables. However, the output of these is a little tricky to work with. Instead we can use the function ‘dh.getStats’ to extract descriptives in a more usable format.

stats <- dh.getStats(
    df = "data", 
    vars = "sex")

stats

## $categorical
## # A tibble: 0 × 10
## # ℹ 10 variables: variable <chr>, cohort <chr>, category <chr>, value <dbl>, cohort_n <int>, valid_n <dbl>, missing_n <dbl>, perc_valid <dbl>,
## #   perc_missing <dbl>, perc_total <dbl>
## 
## $continuous
## # A tibble: 5 × 15
##   variable cohort    mean std.dev perc_5 perc_10 perc_25 perc_50 perc_75 perc_90 perc_95 valid_n cohort_n missing_n missing_perc
##   <chr>    <chr>    <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>     <dbl>        <dbl>
## 1 sex      alspac    0.49     0.5      0       0       0       0       1       1       1  129681   129681         0            0
## 2 sex      bcg       0.48     0.5      0       0       0       0       1       1       1   12724    12724         0            0
## 3 sex      bib       0.49     0.5      0       0       0       0       1       1       1   68864    68864         0            0
## 4 sex      probit    0.48     0.5      0       0       0       0       1       1       1  191224   191224         0            0
## 5 sex      combined  0.49     0.5      0       0       0       0       1       1       1  402493   402493         0            0

This gives us a list with two elements corresponding to the continuous and categorical variables. As we didn’t request stats for any continuous variables, the first list is empty.

Some more descriptives that are useful to have

There are other bits of information it is useful to report:

Total number of observations
Total number of participants
Min and max age of participants
Median number of measurements per individual

We can use the function dh.getRmStats to extract this info. Note that the min and max ages are based on the 5th and 95th percentiles as DataSHIELD doesn’t will not return true minimum and maximums due to disclosure issues.

rm_stats <- dh.getRmStats(
    df = "data",
    outcome = "weight",
    id_var = "id", 
    age_var = "age")

rm_stats

## # A tibble: 5 × 8
##   cohort   min_age max_age  n_obs n_participants n_meas_5 n_meas_med n_meas_95
##   <chr>      <dbl>   <dbl>  <dbl>          <dbl>    <dbl>      <dbl>     <dbl>
## 1 alspac         0   15.7  129681          14216     6          9         13  
## 2 bcg            0    4.99  12724            951    12         14         14  
## 3 bib            0    4.98  68864          13445     3          5          8  
## 4 probit         0   15.9  191224          17046     9         11         14  
## 5 combined       0   13.6  402493          45658     6.36       8.67      11.9

Part 2: Fitting a simple linear trajectory

Trajectory models: prep

Before we can get started, we need to make sure that our id variable is an integer. If we don’t, it breaks. First we create a new object which is an integer version of our ID. We then join this back to our main dataframe.

ds.asInteger(
    x.name = "data$id", 
    newobj = "id_int")

ds.dataFrame(
    x = c("data$id", "data$age", "data$sex", "data$ethnicity", 
          "data$father_socialclass", "data$mother_education", "data$weight",
          "id_int"),
    stringsAsFactors = F,
    newobj = "data")

Specifying the model

First we fit a model with weight predicted by four terms: an intercept (1), age, sex and the interaction between age and sex. These terms are our ‘fixed effects’, and comprise the first half of the formula below. They are specified in the same way as you would write a formula in a standard regression model.

In the second half of the formula below we specify our random effects. This is where we provide information about how we believe the data to be clustered.

The variable to the right of the vertical line is our ID variable. This specifies that we want to treat each invidual as a separate cluster, because we believe their measurements will be correlated. This is a plausible hypothesis: for example if my height is a lot higher than average at age 5, it is likely to also be higher than average at age 10. We can describe this as a 2-level clustering, with weight measurements (level 1) clustered within individuals (level 2).

The variables to the left of the vertical line specify which extra parameters we want to estimate for each individual. Or to put it another way, it specifies which parameters we allow to vary for each individual.

If we just specify “1”, then in addition to calculating an overall intercept, we also calculate an intercept for each individual. This provides us with information about the variability between individuals in their starting weight. This is called a random intercept model.

If we specifed “1 + age”, then in addition we would calculate a separate slope for each individual. This would provide us with information about how the gradient of the trajectory varies between individuals. This is called a random slope model.

Here we start with a random intercept model. Understand what you are doing is the difficult bit; fitting the model is straightfoward:

model1.fit <- ds.lmerSLMA(
    dataName = "data",
  formula = "weight ~ 1 + age + sex + age*sex + (1|id_int)", 
  datasources = conns)

model1.fit$output.summary

## $study1
## Linear mixed model fit by REML ['lmerMod']
## Formula: weight ~ 1 + age + sex + age * sex + (1 | id_int)
##    Data: dataDF
## Weights: weights.to.use
##  Offset: offset.to.use
## Control: control.obj
## 
## REML criterion at convergence: 839458.6
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
##     NA     NA     NA     NA     NA 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id_int   (Intercept) 14.77    3.843   
##  Residual             31.68    5.628   
## Number of obs: 129681, groups:  id_int, 14216
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  3.869135   0.056184  68.866
## age          3.534366   0.004078 866.797
## sex          0.336620   0.080317   4.191
## age:sex     -0.200629   0.005789 -34.655
## 
## Correlation of Fixed Effects:
##         (Intr) age    sex   
## age     -0.445              
## sex     -0.700  0.311       
## age:sex  0.313 -0.704 -0.443
## 
## $study2
## Linear mixed model fit by REML ['lmerMod']
## Formula: weight ~ 1 + age + sex + age * sex + (1 | id_int)
##    Data: dataDF
## Weights: weights.to.use
##  Offset: offset.to.use
## Control: control.obj
## 
## REML criterion at convergence: 54369.4
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
##     NA     NA     NA     NA     NA 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id_int   (Intercept) 1.250    1.118   
##  Residual             3.692    1.921   
## Number of obs: 12724, groups:  id_int, 951
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  6.03409    0.06151  98.097
## age          2.85790    0.01411 202.581
## sex         -0.54661    0.09028  -6.055
## age:sex      0.03661    0.02049   1.787
## 
## Correlation of Fixed Effects:
##         (Intr) age    sex   
## age     -0.455              
## sex     -0.681  0.310       
## age:sex  0.313 -0.688 -0.456
## 
## $study3
## Linear mixed model fit by REML ['lmerMod']
## Formula: weight ~ 1 + age + sex + age * sex + (1 | id_int)
##    Data: dataDF
## Weights: weights.to.use
##  Offset: offset.to.use
## Control: control.obj
## 
## REML criterion at convergence: 292805.1
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
##     NA     NA     NA     NA     NA 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id_int   (Intercept) 0.6142   0.7837  
##  Residual             3.6480   1.9100  
## Number of obs: 68864, groups:  id_int, 13445
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  5.089372   0.016491 308.613
## age          3.119718   0.005986 521.136
## sex         -0.394888   0.023685 -16.672
## age:sex      0.004995   0.008556   0.584
## 
## Correlation of Fixed Effects:
##         (Intr) age    sex   
## age     -0.528              
## sex     -0.696  0.368       
## age:sex  0.369 -0.700 -0.529
## 
## $study4
## Linear mixed model fit by REML ['lmerMod']
## Formula: weight ~ 1 + age + sex + age * sex + (1 | id_int)
##    Data: dataDF
## Weights: weights.to.use
##  Offset: offset.to.use
## Control: control.obj
## 
## REML criterion at convergence: 1138299
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
##     NA     NA     NA     NA     NA 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id_int   (Intercept)  3.402   1.844   
##  Residual             20.521   4.530   
## Number of obs: 191224, groups:  id_int, 17046
## 
## Fixed effects:
##              Estimate Std. Error  t value
## (Intercept)  4.689392   0.026376  177.793
## age          3.455656   0.003041 1136.282
## sex          0.013923   0.038018    0.366
## age:sex     -0.240527   0.004382  -54.889
## 
## Correlation of Fixed Effects:
##         (Intr) age    sex   
## age     -0.381              
## sex     -0.694  0.264       
## age:sex  0.264 -0.694 -0.381
## 
## $input.beta.matrix.for.SLMA
##             betas study 1 betas study 2 betas study 3 betas study 4
## (Intercept)     3.8691347    6.03408780   5.089372058    4.68939189
## age             3.5343657    2.85790465   3.119717700    3.45565555
## sex             0.3366204   -0.54660715  -0.394887544    0.01392347
## age:sex        -0.2006292    0.03661021   0.004995405   -0.24052747
## 
## $input.se.matrix.for.SLMA
##             ses study 1 ses study 2 ses study 3 ses study 4
## (Intercept) 0.056183657  0.06151162 0.016491091 0.026375527
## age         0.004077501  0.01410748 0.005986375 0.003041196
## sex         0.080317421  0.09027731 0.023685373 0.038017878
## age:sex     0.005789246  0.02049084 0.008556005 0.004382080

Exploring the output

We can see the coefficients for the fixed and random parts of our model directly from the model object.

Fixed effects coefficients for each cohort are here:

model1.fit$betamatrix.valid

##             betas study 1 betas study 2 betas study 3 betas study 4
## (Intercept)     3.8691347    6.03408780   5.089372058    4.68939189
## age             3.5343657    2.85790465   3.119717700    3.45565555
## sex             0.3366204   -0.54660715  -0.394887544    0.01392347
## age:sex        -0.2006292    0.03661021   0.004995405   -0.24052747

The pooled estimates are here:

model1.fit$SLMA.pooled.ests.matrix

##              pooled.ML      se.ML pooled.REML    se.REML  pooled.FE       se.FE
## (Intercept)  4.9201754 0.38859060   4.9202557 0.44896972  4.9643913 0.013250416
## age          3.2421628 0.13538495   3.2420998 0.15635197  3.4176657 0.002229407
## sex         -0.1477761 0.17066919  -0.1477464 0.19787562 -0.2576757 0.019061879
## age:sex     -0.1008701 0.06092812  -0.1006258 0.07037899 -0.1873971 0.003195114

And the details of the random effects are here:

model1.fit$output.summary$study1$varcor

##  Groups   Name        Std.Dev.
##  id_int   (Intercept) 3.8426  
##  Residual             5.6282

A neater way

I find these objects a little fiddly to work with (you may not). You can also use the dsHelper function to extract the coefficients.

model1.coef <- dh.lmTab(
    model = model1.fit, 
    type = "lmer_slma", 
    coh_names = names(conns), 
    ci_format = "separate", 
    direction = "wide")

Let’s explore there in more detail

Fixed effects

The fixed effects are analogous to your coefficients from a linear regression model.

model1.coef$fixed

## # A tibble: 20 × 9
##    cohort   variable    est    se pvalue  n_obs n_coh lowci uppci
##    <chr>    <chr>     <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 alspac   intercept  3.87  0.06     NA 129681     1  3.76  3.98
##  2 alspac   age        3.53  0        NA 129681     1  3.53  3.54
##  3 alspac   sex        0.34  0.08     NA 129681     1  0.18  0.49
##  4 alspac   age:sex   -0.2   0.01     NA 129681     1 -0.21 -0.19
##  5 bcg      intercept  6.03  0.06     NA  12724     1  5.91  6.15
##  6 bcg      age        2.86  0.01     NA  12724     1  2.83  2.89
##  7 bcg      sex       -0.55  0.09     NA  12724     1 -0.72 -0.37
##  8 bcg      age:sex    0.04  0.02     NA  12724     1  0     0.08
##  9 bib      intercept  5.09  0.02     NA  68864     1  5.06  5.12
## 10 bib      age        3.12  0.01     NA  68864     1  3.11  3.13
## 11 bib      sex       -0.39  0.02     NA  68864     1 -0.44 -0.35
## 12 bib      age:sex    0     0.01     NA  68864     1 -0.01  0.02
## 13 probit   intercept  4.69  0.03     NA 191224     1  4.64  4.74
## 14 probit   age        3.46  0        NA 191224     1  3.45  3.46
## 15 probit   sex        0.01  0.04     NA 191224     1 -0.06  0.09
## 16 probit   age:sex   -0.24  0        NA 191224     1 -0.25 -0.23
## 17 combined intercept  4.92  0.39     NA 402493     4  4.16  5.68
## 18 combined age        3.24  0.14     NA 402493     4  2.98  3.51
## 19 combined sex       -0.15  0.17     NA 402493     4 -0.48  0.19
## 20 combined age:sex   -0.1   0.06     NA 402493     4 -0.22  0.02

If we use dh.lmTab it allows us to manipulate our output more easily. For example you might want to split the fixed effects by cohort:

model1.coef$fixed %>% group_by(cohort) %>%
group_split()

## <list_of<
##   tbl_df<
##     cohort  : character
##     variable: character
##     est     : double
##     se      : double
##     pvalue  : double
##     n_obs   : double
##     n_coh   : double
##     lowci   : double
##     uppci   : double
##   >
## >[5]>
## [[1]]
## # A tibble: 4 × 9
##   cohort variable    est    se pvalue  n_obs n_coh lowci uppci
##   <chr>  <chr>     <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 alspac intercept  3.87  0.06     NA 129681     1  3.76  3.98
## 2 alspac age        3.53  0        NA 129681     1  3.53  3.54
## 3 alspac sex        0.34  0.08     NA 129681     1  0.18  0.49
## 4 alspac age:sex   -0.2   0.01     NA 129681     1 -0.21 -0.19
## 
## [[2]]
## # A tibble: 4 × 9
##   cohort variable    est    se pvalue n_obs n_coh lowci uppci
##   <chr>  <chr>     <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bcg    intercept  6.03  0.06     NA 12724     1  5.91  6.15
## 2 bcg    age        2.86  0.01     NA 12724     1  2.83  2.89
## 3 bcg    sex       -0.55  0.09     NA 12724     1 -0.72 -0.37
## 4 bcg    age:sex    0.04  0.02     NA 12724     1  0     0.08
## 
## [[3]]
## # A tibble: 4 × 9
##   cohort variable    est    se pvalue n_obs n_coh lowci uppci
##   <chr>  <chr>     <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bib    intercept  5.09  0.02     NA 68864     1  5.06  5.12
## 2 bib    age        3.12  0.01     NA 68864     1  3.11  3.13
## 3 bib    sex       -0.39  0.02     NA 68864     1 -0.44 -0.35
## 4 bib    age:sex    0     0.01     NA 68864     1 -0.01  0.02
## 
## [[4]]
## # A tibble: 4 × 9
##   cohort   variable    est    se pvalue  n_obs n_coh lowci uppci
##   <chr>    <chr>     <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 combined intercept  4.92  0.39     NA 402493     4  4.16  5.68
## 2 combined age        3.24  0.14     NA 402493     4  2.98  3.51
## 3 combined sex       -0.15  0.17     NA 402493     4 -0.48  0.19
## 4 combined age:sex   -0.1   0.06     NA 402493     4 -0.22  0.02
## 
## [[5]]
## # A tibble: 4 × 9
##   cohort variable    est    se pvalue  n_obs n_coh lowci uppci
##   <chr>  <chr>     <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 probit intercept  4.69  0.03     NA 191224     1  4.64  4.74
## 2 probit age        3.46  0        NA 191224     1  3.45  3.46
## 3 probit sex        0.01  0.04     NA 191224     1 -0.06  0.09
## 4 probit age:sex   -0.24  0        NA 191224     1 -0.25 -0.23

Or you might want to just select the pooled results:

model1.coef$fixed %>% dplyr::filter(cohort == "combined")

## # A tibble: 4 × 9
##   cohort   variable    est    se pvalue  n_obs n_coh lowci uppci
##   <chr>    <chr>     <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 combined intercept  4.92  0.39     NA 402493     4  4.16  5.68
## 2 combined age        3.24  0.14     NA 402493     4  2.98  3.51
## 3 combined sex       -0.15  0.17     NA 402493     4 -0.48  0.19
## 4 combined age:sex   -0.1   0.06     NA 402493     4 -0.22  0.02

Random effects

When the model if fit, group (level 2) residuals are calculated for each subject. These represent the difference between interecept of each subject and the mean intercept for the sample.

However, there are not returned by DataSHIELD as they could be disclosive. Instead we can view the standard deviation of these residuals:

model1.coef$random

## # A tibble: 8 × 5
##   cohort group    var1      var2  stddev
##   <chr>  <chr>    <chr>     <chr>  <dbl>
## 1 alspac id_int   intercept <NA>    3.84
## 2 alspac Residual <NA>      <NA>    5.63
## 3 bcg    id_int   intercept <NA>    1.12
## 4 bcg    Residual <NA>      <NA>    1.92
## 5 bib    id_int   intercept <NA>    0.78
## 6 bib    Residual <NA>      <NA>    1.91
## 7 probit id_int   intercept <NA>    1.84
## 8 probit Residual <NA>      <NA>    4.53

If we take ALSPAC as an example, we have an overall (mean) intercept of 3.87, and a standard deviation of 3.84. This indicates considerable variability between individual intercepts. If you had specified more than one random effect (e.g. slope) this would also be displayed here. We can also see the standard deviation of the residual error, ie the difference between observed values and predicted values within subject.

Plotting the trajectories

In order to plot trajectories, we first need to get model predicted values of weight. In order to do this, we need to create a dataframe with the values of the fixed effects at which we want to predict values of weight. The names of the columns of this new dataframe need to correspond to the names of the coefficients in the output above.

We want to visualise the trajectories for both males and females between ages 0 and 25. In our reference dataframe, we therefore create values for age between 0 and 25 at 0.1 intervals, repeated for sex == 0 (males) and sex == 1 (females).

new_data_m <- tibble(
    age = seq(0, 25, by = 0.1), 
    "age:sex" = age*0,
    sex = 0)

new_data_f <- tibble(
    age = seq(0, 25, by = 0.11), 
    "age:sex" = age*1,
    sex = 1)

pred_data <- bind_rows(new_data_m, new_data_f)

If we look at this object we have created, it has all the values of the model variables at which we want to predict weight:

pred_data

## # A tibble: 479 × 3
##      age `age:sex`   sex
##    <dbl>     <dbl> <dbl>
##  1   0           0     0
##  2   0.1         0     0
##  3   0.2         0     0
##  4   0.3         0     0
##  5   0.4         0     0
##  6   0.5         0     0
##  7   0.6         0     0
##  8   0.7         0     0
##  9   0.8         0     0
## 10   0.9         0     0
## # ℹ 469 more rows

We can then use this reference data and our model object to get predicted values. Note that we now have four more columns appended to our reference data containing predicted values, standard error of predictions and confidence intervals.

plotdata <- dh.predictLmer(
    model = model1.fit,
    new_data = pred_data,
    coh_names = names(conns))

plotdata

## # A tibble: 2,395 × 9
##    cohort intercept   age `age:sex`   sex predicted     se low_ci upper_ci
##    <chr>      <dbl> <dbl>     <dbl> <dbl>     <dbl>  <dbl>  <dbl>    <dbl>
##  1 alspac         1   0           0     0      3.87 0.0562   3.76     3.98
##  2 alspac         1   0.1         0     0      4.22 0.0560   4.11     4.33
##  3 alspac         1   0.2         0     0      4.58 0.0558   4.47     4.69
##  4 alspac         1   0.3         0     0      4.93 0.0557   4.82     5.04
##  5 alspac         1   0.4         0     0      5.28 0.0555   5.17     5.39
##  6 alspac         1   0.5         0     0      5.64 0.0553   5.53     5.74
##  7 alspac         1   0.6         0     0      5.99 0.0551   5.88     6.10
##  8 alspac         1   0.7         0     0      6.34 0.0550   6.23     6.45
##  9 alspac         1   0.8         0     0      6.69 0.0548   6.59     6.80
## 10 alspac         1   0.9         0     0      7.05 0.0547   6.94     7.15
## # ℹ 2,385 more rows

Nice, we’re getting there. One thing to address is that currently we have predicted values for all cohorts between ages 0 and 25. However many of the cohorts don’t have data between those ages, so we might not want to plot beyond where there is data. We can trim our predicted data to the available ages.

plot_trim <- dh.trimPredData(
    pred = plotdata,
    coh_names = c("alspac", "bcg", "bib", "probit", "combined"),
    min = rep(0, 5), 
    max = c(20, 5, 5, 20, 20))

plot_trim

## # A tibble: 1,343 × 9
##    cohort intercept   age `age:sex`   sex predicted     se low_ci upper_ci
##    <chr>      <dbl> <dbl>     <dbl> <dbl>     <dbl>  <dbl>  <dbl>    <dbl>
##  1 alspac         1   0           0     0      3.87 0.0562   3.76     3.98
##  2 alspac         1   0.1         0     0      4.22 0.0560   4.11     4.33
##  3 alspac         1   0.2         0     0      4.58 0.0558   4.47     4.69
##  4 alspac         1   0.3         0     0      4.93 0.0557   4.82     5.04
##  5 alspac         1   0.4         0     0      5.28 0.0555   5.17     5.39
##  6 alspac         1   0.5         0     0      5.64 0.0553   5.53     5.74
##  7 alspac         1   0.6         0     0      5.99 0.0551   5.88     6.10
##  8 alspac         1   0.7         0     0      6.34 0.0550   6.23     6.45
##  9 alspac         1   0.8         0     0      6.69 0.0548   6.59     6.80
## 10 alspac         1   0.9         0     0      7.05 0.0547   6.94     7.15
## # ℹ 1,333 more rows

Next we set sex as a factor to make the plot clearer:

plot_trim <- plot_trim %>%
mutate(sex = factor(sex, levels = c(0, 1), labels = c("Male", "Female")))

It would also be nice to be able to overlay the predicted trajectory with the observed values. Remember earlier we made a scatter plot using DataSHIELD? We can extract the anonymised values which were used in that plot and store them as a local object. This allows us to build more flexible plots.

Each row of the created object represents the anomyised values of a subject.

observed <- dh.getAnonPlotData(
    df = "data",
    var_1 = "age", 
    var_2 = "weight")

observed

## # A tibble: 804,986 × 3
##    cohort    age weight
##    <chr>   <dbl>  <dbl>
##  1 alspac  0.130   4.93
##  2 alspac  0.604   7.37
##  3 alspac  3.89   18.5 
##  4 alspac  7.64   26.5 
##  5 alspac  8.57   30.4 
##  6 alspac 10.8    39.4 
##  7 alspac 13.0    56.8 
##  8 alspac 17.8    84.4 
##  9 alspac  1.00    9.79
## 10 alspac  1.36   10.1 
## # ℹ 804,976 more rows

Now we are ready to plot the pooled trajectories. We use the function “sample_frac” to only take 1% of the observed values. You can play around with this value, but if you use too many the graph becomes unreadable.

ggplot() + 
  geom_point(
    data = sample_frac(observed, .01) %>% dplyr::filter(cohort == "combined"), 
    aes(x = age, y = weight), 
    alpha = 0.4, 
    size = 0.1) +
  geom_line(
    data = plot_trim %>% dplyr::filter(cohort == "combined"), 
    aes(x = age, y = predicted, colour = sex), size = 0.8) +
  geom_ribbon(
    data = plot_trim %>% dplyr::filter(cohort == "combined" & sex == "Male"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  geom_ribbon(
    data = plot_trim %>% dplyr::filter(cohort == "combined" & sex == "Female"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  scale_x_continuous(limit = c(0, 20), breaks = seq(0, 20, 5), expand = c(0, 0)) + 
  scale_y_continuous(limit = c(0, 80), breaks = seq(0, 80, 20), expand = c(0, 0)) +
  xlab("Child age") +
  ylab("Predicted weight (KG)") +
  labs(colour = "Sex")

plot of chunk plot-model-1-pooled

We can also plot the trajectories for each cohort separately:

ggplot() + 
  geom_point(
    data = sample_frac(observed, .01) %>% dplyr::filter(cohort != "combined"), 
    aes(x = age, y = weight), 
    alpha = 0.4, 
    size = 0.1) +
  geom_line(
    data = plot_trim %>% dplyr::filter(cohort != "combined"), 
    aes(x = age, y = predicted, colour = sex), size = 0.8) +
  geom_ribbon(
    data = plot_trim %>% dplyr::filter(cohort != "combined" & sex == "Male"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  geom_ribbon(
    data = plot_trim %>% dplyr::filter(cohort != "combined" & sex == "Female"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  facet_wrap(~cohort) +
  scale_x_continuous(limit = c(0, 20), breaks = seq(0, 20, 5), expand = c(0, 0)) + 
  scale_y_continuous(limit = c(0, 80), breaks = seq(0, 80, 20), expand = c(0, 0)) +
  xlab("Child age") +
  ylab("Predicted weight (KG)") +
  labs(colour = "Sex")

plot of chunk plot-model-1-cohort

Q: Looking at the plots, how well do you think our model fits the data?

Part 3: Fitting a non-linear trajectory using a transformation of the age term

The previous model doesn’t look great to me, especially around age 5 where there looks to be a distinct non-linearity in the trajectory which we are not capturing.

One way to address this is to model weight not as a function of age, but as a function of a transformation of age.

Let’s start by creating a new variable which is age^2, and see if this provides a better fit. We create the variable, and join it back in our main data frame.

ds.make(
    toAssign = "data$age^2", 
    newobj = "age_2")

ds.dataFrame(
    x = c("data", "age_2"), 
    newobj = "data")

Now we are going to simply repeat the steps before, but using this transformed variable instead of the original age variable.

First we run the model:

age_2.fit <- ds.lmerSLMA(
    dataName = "data",
  formula = "weight ~ 1 + age_2 + sex + age_2*sex + (1|id_int)", 
  datasources = conns)

Now we create a dataframe of the values of the fixed effects at which we want to predict weight. Note that as we have transformed our age term, we need to add this transformed version of the variable to our reference data.

pred_ref_2a <- tibble(
    age = seq(0, 25, by = 0.1), 
    age_2 = age^2,
    "age_2:sex" = age_2*0,
    sex = 0)

pred_ref_2b <- tibble(
    age = seq(0, 25, by = 0.11), 
    age_2 = age^2,
    "age_2:sex" = age_2*1,
    sex = 1)

pred_ref_2 <- bind_rows(pred_ref_2a, pred_ref_2b)

Again we get the predicted values, trim them at the ages of the observed data and set sex as a factor.

pred_data_2 <- dh.predictLmer(
    model = age_2.fit,
    new_data = pred_ref_2,
    coh_names = names(conns))

pred_data_2_trim <- dh.trimPredData(
    pred = pred_data_2,
    coh_names = c("alspac", "bcg", "bib", "probit", "combined"),
    min = rep(0, 5), 
    max = c(20, 5, 5, 20, 20))

pred_data_2_trim <- pred_data_2_trim %>%
mutate(sex = factor(sex, levels = c(0, 1), labels = c("Male", "Female")))

Let’s plot this against the observed data and see if it fits any better:

ggplot() + 
  geom_point(
    data = sample_frac(observed, .01) %>% dplyr::filter(cohort == "combined"), 
    aes(x = age, y = weight), 
    alpha = 0.4, 
    size = 0.1) +
  geom_line(
    data = pred_data_2_trim %>% dplyr::filter(cohort == "combined"), 
    aes(x = age, y = predicted, colour = sex), size = 0.8) +
  scale_x_continuous(limit = c(0, 20), breaks = seq(0, 20, 5), expand = c(0, 0)) + 
  scale_y_continuous(limit = c(0, 80), breaks = seq(0, 80, 20), expand = c(0, 0)) +
  xlab("Child age") +
  ylab("Predicted weight") +
  labs(colour = "Sex")

plot of chunk plot-model-2

Hmmm. We now have a non-linear trajectory, but it still doesn’t look a great fit. We haven’t looked at any fit statistics or residuals yet, but from the eye it clearly doesn’t fit the data. It also doesn’t seem to capture the sex differences we would expect. We can do better than this.

Part 4: Fractional polynomials

Using one transformation of age allowed one ‘bend’ in the trajectory. However, the problem we saw with the previous model is that natural growth trajectories often have multiple bends.

One way to model this is to use two transformations of age (e.g. age^2 and age^0.5). These are called fractional polynomials (powers which involve fractions). You could use more than 2 transformations: this would give you a better model fit, but with greater risk of overfitting your model. Here we will stick two.

Even with only two transformations of age, there are infinite number of possible combinations. How do we chose which to use?

You may have a good a priori reason to chose a combination of terms. If you don’t, one reasonable approach is to select a certain set of transformations, and fit every combination of those. We can then see which combination of age terms provides the best model fit.

Preparing the data for modelling

Remove values of zero

Transforming values of zero will create infinite values which will break our models. We add a small quantity to our age variable to avoid this.

ds.assign(
    toAssign = "data$age+0.01", 
    newobj = "age", 
  datasources = conns)

dh.dropCols(
    df = "data", 
    vars = "age", 
    type = "remove", 
    new_obj = "data")

ds.dataFrame(
    x = c("data", "age"), 
    newobj = "data")

Create transformations of age term

We could do the transformations manually, but this would be a bit cumbersome. We can insted use the function ds.makeAgePolys to create transformations of age. Here we use the following powers: -2, -1, -0.5, log, 0.5, 2, 3.

dh.makeAgePolys(
    df = "data", 
    age_var = "age", 
    poly_names = c("m_2", "m_1", "m_0_5", "log", "0_5", "2", "3"), 
    poly_form = c("^-2", "^-1", "^-0.5", "log", "^0.5", "^2", "^3"))

We can check this has worked, and see the mean values for these new variables.

poly_summary <- dh.getStats(
    df = "data", 
    vars = c("agem_2", "agem_1", "agem_0_5", "age_log", "age0_5", "age2", "age3"))

poly_summary$continuous %>%
dplyr::select(cohort, variable, mean, std.dev) %>%
print(n = Inf)

## # A tibble: 35 × 4
##    cohort   variable    mean std.dev
##    <chr>    <chr>      <dbl>   <dbl>
##  1 alspac   age_log     2.13    1.3 
##  2 bcg      age_log     1.25    0.67
##  3 bib      age_log     0.98    0.73
##  4 probit   age_log     1.4     1.19
##  5 alspac   age0_5     69.2    86.8 
##  6 bcg      age0_5      6.86    8.19
##  7 bib      age0_5      5.25    8.85
##  8 probit   age0_5     34.4    72.0 
##  9 alspac   age2      889.   1393.  
## 10 bcg      age2       27.0    39.2 
## 11 bib      age2       22.6    46.0 
## 12 probit   age2      447.   1125.  
## 13 alspac   age3        0.78    2.12
## 14 bcg      age3       -0.04    1.69
## 15 bib      age3       -0.92    2.15
## 16 probit   age3       -0.21    2.07
## 17 alspac   agem_0_5    1.48    2.59
## 18 bcg      agem_0_5    1.68    2.45
## 19 bib      agem_0_5    2.95    3.54
## 20 probit   agem_0_5    2       2.65
## 21 alspac   agem_1      8.89   26.7 
## 22 bcg      agem_1      8.84   26.0 
## 23 bib      agem_1     21.2    38.6 
## 24 probit   agem_1     11.0    28.0 
## 25 alspac   agem_2    792.   2688.  
## 26 bcg      agem_2    754.   2628.  
## 27 bib      agem_2   1945.   3942.  
## 28 probit   agem_2    905.   2845.  
## 29 combined age0_5     39.8  1150   
## 30 combined age2      504.      2.18
## 31 combined age3       -0.01    1.23
## 32 combined age_log     1.56   73.8 
## 33 combined agem_0_5    1.99    2.84
## 34 combined agem_1     12.0    30.0 
## 35 combined agem_2   1042.   3037.

Identifying the best fitting model

Ok, so now we have a load of transformations of our age in term in the dataframe “data”. Next we want to fit models for each paired combination of these terms. Again, we could do this by writing out ds.lmerSLMA 28 times, but this brings a fair chance that you make an error in the code.

I’ve written a couple more functions to streamline the process.

Create model formulae

‘dh.makeLmerForm’ creates the model formulae based on the age terms you have created. The argument to “agevars” is a vector of variable names corresponding to the age transformations we created above.

First we try to find the best shape of trajectory, so we create model formulae just with the age terms.

weight_form <- dh.makeLmerForm(
  outcome = "weight", 
  id_var = "id_int", 
  age_vars = c("age", "agem_2", "agem_1", "agem_0_5", "age_log", "age0_5", "age2", "age3"), 
  random = "intercept")

weight_form %>% 
head(10)

## # A tibble: 10 × 2
##    polys           formula                               
##    <chr[1d]>       <chr>                                 
##  1 age,agem_2      weight~1+ age+agem_2 + (1|id_int)     
##  2 age,agem_1      weight~1+ age+agem_1 + (1|id_int)     
##  3 age,agem_0_5    weight~1+ age+agem_0_5 + (1|id_int)   
##  4 age,age_log     weight~1+ age+age_log + (1|id_int)    
##  5 age,age0_5      weight~1+ age+age0_5 + (1|id_int)     
##  6 age,age2        weight~1+ age+age2 + (1|id_int)       
##  7 age,age3        weight~1+ age+age3 + (1|id_int)       
##  8 agem_2,agem_1   weight~1+ agem_2+agem_1 + (1|id_int)  
##  9 agem_2,agem_0_5 weight~1+ agem_2+agem_0_5 + (1|id_int)
## 10 agem_2,age_log  weight~1+ agem_2+age_log + (1|id_int)

This has created a table where each row (a total of 28) contains the formula for a mixed effects model with a different combination of the fractional polynomials.

Fit all combinations of polynomials

The function dh.lmeMultPoly takes as an input to the argument “formulae” a vector of formulae. Here we use the formulae created above. You could also create your own vector of formulae and supply it to the same argument.

If you’ve read Rachel Hughes’ paper, you’ll notice that she first fits models on a discovery sample, and then checks the fit on the remaining sample. I’ve skipped this here for simplicity but it is possible to do this in DS.

This will take a few minutes as we are fitting 28 different models.

poly.fit <- dh.lmeMultPoly(
    df = "data",
    formulae = weight_form$formula, 
  poly_names = weight_form$polys)

## Error in dh.lmeMultPoly(df = "data", formulae = weight_form$formulae, : object 'weight_form' not found

There are some converegence warnings. We can see details here:

poly.fit$convergence

## # A tibble: 28 × 4
##    poly            completed message all_converged
##    <chr[1d]>       <chr>     <chr>   <lgl>        
##  1 age,agem_2      TRUE      <NA>    TRUE         
##  2 age,agem_1      TRUE      <NA>    TRUE         
##  3 age,agem_0_5    TRUE      <NA>    TRUE         
##  4 age,age_log     TRUE      <NA>    TRUE         
##  5 age,age0_5      TRUE      <NA>    TRUE         
##  6 age,age2        TRUE      <NA>    TRUE         
##  7 age,age3        TRUE      <NA>    TRUE         
##  8 agem_2,agem_1   TRUE      <NA>    FALSE        
##  9 agem_2,agem_0_5 TRUE      <NA>    FALSE        
## 10 agem_2,age_log  TRUE      <NA>    TRUE         
## # ℹ 18 more rows

The output contains a table with the negative log likelihood for each model in each study, the average rank of that model across all studies and the summed negative log likelihood. The model with the lowest summed log-likelhood across studies contains two powers: age^0.5 & age^2.

poly.fit$fit

## # A tibble: 28 × 6
##    model          alspac     bcg      bib   probit   sum_log
##    <chr>           <dbl>   <dbl>    <dbl>    <dbl>     <dbl>
##  1 age0_5,age2  -413072. -25053. -136400. -555192. -1129718.
##  2 age,age2     -414380. -24847. -135066. -562074. -1136366.
##  3 age0_5,age3  -418867. -25055. -136595. -557293. -1137810.
##  4 age,age3     -414581. -25326. -138701. -560576. -1139183.
##  5 age,age0_5   -418305. -25019. -135347. -569398. -1148069.
##  6 age,age_log  -419968. -25864. -139284. -570858. -1155973.
##  7 age,agem_0_5 -420377. -26508. -142164. -570741. -1159790.
##  8 age,agem_1   -420403. -26741. -143292. -570683. -1161119.
##  9 age,agem_2   -420409. -26825. -143753. -570701. -1161687.
## 10 age_log,age2 -419517. -27708. -151222. -565902. -1164349.
## # ℹ 18 more rows

In this scenario we are trying to fit the same model to all cohorts. In other scenarioes you might want to fit a different model to different cohorts. You could still use this table as a reference for the best fitting model, for example for alspac:

poly.fit$fit %>%
select(model, alspac) %>%
arrange(desc(alspac))

## # A tibble: 28 × 2
##    model          alspac
##    <chr>           <dbl>
##  1 age0_5,age2  -413072.
##  2 age,age2     -414380.
##  3 age,age3     -414581.
##  4 age,age0_5   -418305.
##  5 age0_5,age3  -418867.
##  6 age_log,age2 -419517.
##  7 age,age_log  -419968.
##  8 age2,age3    -420299.
##  9 age,agem_0_5 -420377.
## 10 age,agem_1   -420403.
## # ℹ 18 more rows

Final model

Ok, so we’ve got an idea what the best fitting combination of age terms will be. Now we can fit our final model, including sex and the interaction with the age terms. We aren’t worrying about other covariates in this example, but of course you could include them.

final.fit <- ds.lmerSLMA(
    dataName = "data",
  formula = "weight ~ 1 + age0_5 + age2 + sex + age0_5*sex + age2*sex + (1|id_int)")

Making plots

Now we can go through the same process as above to make our plot. Again, we need to create our age transformations in the new data to get the predicted values for weight at each age.

pred_ref_final_m <- tibble(
    age = seq(0, 25, by = 0.1), 
    age0_5 = age^0.5,
    age2 = age^2, 
    sex = 0,
    "age0_5:sex" = age0_5*sex, 
  "age2:sex" = age2*sex)

pred_ref_final_f <- tibble(
    age = seq(0, 25, by = 0.1), 
    age0_5 = age^0.5,
    age2 = age^2, 
    sex = 1,
    "age0_5:sex" = age0_5*sex, 
  "age2:sex" = age2*sex)

pred_ref_final <- bind_rows(pred_ref_final_m, pred_ref_final_f)

Again we get the predicted values, trim them at the ages of the observed data and set sex as a factor.

pred_ref_final <- dh.predictLmer(
    model = final.fit,
    new_data = pred_ref_final,
    coh_names = names(conns)
)

pred_ref_final_trim <- dh.trimPredData(
    pred = pred_ref_final,
    coh_names = c("alspac", "bcg", "bib", "probit", "combined"),
    min = rep(0, 5), 
    max = c(20, 5, 5, 20, 20)
)

pred_ref_final_trim <- pred_ref_final_trim %>%
mutate(sex = factor(sex, levels = c(0, 1), labels = c("Male", "Female")))

And now we can make some plots

ggplot() + 
  geom_point(
    data = sample_frac(observed, .01) %>% dplyr::filter(cohort == "combined"), 
    aes(x = age, y = weight), 
    alpha = 0.4, 
    size = 0.1) +
  geom_line(
    data = pred_ref_final_trim %>% dplyr::filter(cohort == "combined"), 
    aes(x = age, y = predicted, colour = sex), size = 0.8) +
  geom_ribbon(
    data = pred_ref_final_trim %>% dplyr::filter(cohort == "combined" & sex == "Male"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  geom_ribbon(
    data = pred_ref_final_trim %>% dplyr::filter(cohort == "combined" & sex == "Female"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  scale_x_continuous(limit = c(0, 20), breaks = seq(0, 20, 5), expand = c(0, 0)) + 
  scale_y_continuous(limit = c(0, 80), breaks = seq(0, 80, 20), expand = c(0, 0)) +
  xlab("Child age") +
  ylab("Predicted weight (KG)") +
  labs(colour = "Sex")

plot of chunk plot-model-3-pooled

We can also plot the trajectories by cohort

ggplot() + 
  geom_point(
    data = sample_frac(observed, .01) %>% dplyr::filter(cohort != "combined"), 
    aes(x = age, y = weight), 
    alpha = 0.4, 
    size = 0.1) +
  geom_line(
    data = pred_ref_final_trim %>% dplyr::filter(cohort != "combined"), 
    aes(x = age, y = predicted, colour = sex), size = 0.8) +
  geom_ribbon(
    data = pred_ref_final_trim %>% dplyr::filter(cohort != "combined" & sex == "Male"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  geom_ribbon(
    data = pred_ref_final_trim %>% dplyr::filter(cohort != "combined" & sex == "Female"), 
    aes(x = age, ymin = low_ci, ymax = upper_ci), alpha = 0.1) +
  facet_wrap(~cohort) +
  scale_x_continuous(limit = c(0, 20), breaks = seq(0, 20, 5), expand = c(0, 0)) + 
  scale_y_continuous(limit = c(0, 80), breaks = seq(0, 80, 20), expand = c(0, 0)) +
  xlab("Child age") +
  ylab("Predicted weight (KG)") +
  labs(colour = "Sex")

plot of chunk plot-model-3-cohort

Great, this looks like a much better fit. We have captured the initial growth, followed by a reduction in growth rate at age 5 followed by an increase into adolescence.

This model still isn’t perfect - looking at the graph what problem can you see?

Checking model fit

The final step is to check how well the model fits at different age points. Here we can approximate Table 2 from Tilling et al. (2014) “Modelling Childhood Growth …”.

First we get mean observed values for weight between different age bands. The ‘intervals’ argument specifies how we want to bin our age variable. Again this takes a few minutes to run.

observed_by_age <- dh.meanByGroup(
    df = "data", 
    outcome = "weight", 
    group_var = "age", 
    intervals = c(0, 1, 1, 2, 3, 5, 6, 10, 11, 15, 16, 18)
)

observed_by_age

## # A tibble: 19 × 6
##    cohort group  mean std.dev nvalid nmissing
##    <chr>  <chr> <dbl>   <dbl>  <dbl>    <dbl>
##  1 alspac 0_1    5.79    2.08  37577    92104
##  2 bcg    0_1    6.12    1.92   4949     7775
##  3 bib    0_1    5.39    1.88  40931    27933
##  4 probit 0_1    5.94    1.99 102981    88243
##  5 alspac 11_15 48.6    11.3   20591   109090
##  6 probit 11_15 41.6     9.89  13109   178115
##  7 alspac 16_18 67.3    13.9    4299   125382
##  8 probit 16_18 62.9    12.5    7962   183262
##  9 alspac 1_2   12.0     1.98  10065   119616
## 10 bcg    1_2   11.1     2.19   1882    10842
## 11 bib    1_2   11.8     2.34   7927    60937
## 12 probit 1_2   11.1     2.39  17930   173294
## 13 alspac 3_5   17.0     2.48  14382   115299
## 14 bcg    3_5   16.6     2.66   3582     9142
## 15 bib    3_5   17.6     2.73  11164    57700
## 16 probit 3_5   16.9     2.51  13378   177846
## 17 alspac 6_10  29.1     7.07  20138   109543
## 18 bib    6_10  25.6     4.33    409    68455
## 19 probit 6_10  23.0     4.10  14980   176244

We rename our column to ‘observed’:

obs <- observed_by_age %>%
dplyr::rename(observed = mean)

Now we get the average predicted values between the same age bands

pred_by_age <- pred_ref_final_trim %>%
mutate(
    group = case_when(
        age > 0 & age <= 1 ~ "0_1", 
        age >= 1 & age <= 2 ~ "1_2", 
        age >= 3 & age <= 5 ~ "3_5", 
        age >= 6 & age <= 10 ~ "6_10", 
        age >= 11 & age <= 15 ~ "11_15", 
        age >= 16 & age <= 18 ~ "16_18")
        ) %>%
dplyr::filter(!is.na(group)) %>%
group_by(group, cohort) %>%
summarise(
    predicted = round(mean(predicted), 2),
    low_ci = round(mean(low_ci), 2),
    upper_ci = round(mean(upper_ci), 2))

pred_by_age

## # A tibble: 24 × 5
## # Groups:   group [6]
##    group cohort   predicted low_ci upper_ci
##    <chr> <chr>        <dbl>  <dbl>    <dbl>
##  1 0_1   alspac        7.33   7.22     7.44
##  2 0_1   bcg           7.57   7.45     7.69
##  3 0_1   bib           7.38   7.35     7.41
##  4 0_1   combined      7.41  NA       NA   
##  5 0_1   probit        7.31   7.26     7.36
##  6 11_15 alspac       48.9   48.8     49.0 
##  7 11_15 combined     42.2   NA       NA   
##  8 11_15 probit       47     46.9     47.1 
##  9 16_18 alspac       68.7   68.6     68.9 
## 10 16_18 combined     56.9   NA       NA   
## # ℹ 14 more rows

Finally we join the observed and predicted values into one table, and calculate the difference between the two. We can also use the confidence intervals to calculate upper and lower bounds of the residual. You can see that model fit isn’t great for PROBIT towards the extremes. For other cohorts, the looks pretty good.

res_tab <- left_join(obs, pred_by_age, by = c("group", "cohort")) %>%
mutate(
    difference = round(observed - predicted, 2),
    lower_res = round(observed - upper_ci, 2),
    higher_res = round(observed - low_ci, 2),
    group = factor(
        group, 
        levels = c("0_1", "1_2", "3_5", "6_10", "11_15", "16_18"),
        ordered = TRUE), 
    limits = paste0(lower_res, " to ", higher_res)) %>%
filter(!is.na(predicted)) %>%
dplyr::select(group, cohort, observed, predicted, difference, limits) %>%
arrange(group) %>%
group_by(cohort) %>%
group_split %>%
set_names(names(conns))

res_tab

## <list_of<
##   tbl_df<
##     group     : ordered<59289>
##     cohort    : character
##     observed  : double
##     predicted : double
##     difference: double
##     limits    : character
##   >
## >[4]>
## $alspac
## # A tibble: 6 × 6
##   group cohort observed predicted difference limits        
##   <ord> <chr>     <dbl>     <dbl>      <dbl> <chr>         
## 1 0_1   alspac     5.79      7.33      -1.54 -1.65 to -1.43
## 2 1_2   alspac    12.0      10.9        1.11 1.01 to 1.22  
## 3 3_5   alspac    17.0      17.4       -0.49 -0.6 to -0.38 
## 4 6_10  alspac    29.1      29.4       -0.25 -0.36 to -0.14
## 5 11_15 alspac    48.6      48.9       -0.24 -0.35 to -0.13
## 6 16_18 alspac    67.3      68.7       -1.46 -1.61 to -1.3 
## 
## $bcg
## # A tibble: 3 × 6
##   group cohort observed predicted difference limits        
##   <ord> <chr>     <dbl>     <dbl>      <dbl> <chr>         
## 1 0_1   bcg        6.12      7.57      -1.45 -1.57 to -1.33
## 2 1_2   bcg       11.1      11.4       -0.29 -0.41 to -0.17
## 3 3_5   bcg       16.6      16.9       -0.27 -0.4 to -0.15 
## 
## $bib
## # A tibble: 3 × 6
##   group cohort observed predicted difference limits        
##   <ord> <chr>     <dbl>     <dbl>      <dbl> <chr>         
## 1 0_1   bib        5.39      7.38      -1.99 -2.02 to -1.96
## 2 1_2   bib       11.8      11.1        0.7  0.67 to 0.74  
## 3 3_5   bib       17.6      17.2        0.41 0.37 to 0.44  
## 
## $probit
## # A tibble: 6 × 6
##   group cohort observed predicted difference limits        
##   <ord> <chr>     <dbl>     <dbl>      <dbl> <chr>         
## 1 0_1   probit     5.94      7.31      -1.37 -1.42 to -1.32
## 2 1_2   probit    11.1      10.7        0.45 0.4 to 0.5    
## 3 3_5   probit    16.9      16.9       -0.02 -0.08 to 0.04 
## 4 6_10  probit    23.0      28.3       -5.32 -5.38 to -5.25
## 5 11_15 probit    41.6      47         -5.38 -5.46 to -5.31
## 6 16_18 probit    62.9      66.1       -3.15 -3.26 to -3.04

Finish by logging out of the demo server

datashield.logout(conns)