Analyzing Case-Cohort Studies

Prentice (1986) proposed a case-cohort design for use in large cohort studies. In this design a likelihood is constructed from a simple or stratified random sample drawn from the full cohort. The likelihood, as described inTechnical Details, includes contributions from all members of the sample and all cases, whether they were included in the sample or not. Parameter estimation in this model proceeds as in the usual partial likelihood-based proportional hazards model, but it involves a correction to the variance estimates that takes into account failures among those not included in the risk set. PEANUTS can be used to model case-cohort data. As described by Therneau and Li (Therneau and Li 1999) and Langholz and Jiao (Langholz and Jiao 2007), the variance adjustment can be computed as a function of dfbeta statistics. Langholz and Jiao also describe the variance adjustment in the situation in which the design is based on a stratified random sample.

The example in this section uses a case-cohort study constructed by Langholz and Jiao as a simple random sample derived from an earlier version of the TB fluoroscopy data (Boice and Monson 1977).

Since the primary time scale of interest for the baseline rates in these analyses is attained age, but women only enter the study at the time of first exposure, we will use age at entry to the study as an entry-time variable. This is a general feature of the program that can be used in the same manner whether or not one is dealing with case-cohort data.

The data for both of these examples are stored in comma-separated value files. The data for the two examples are quite similar. The following table describes the most important variables in these data sets.

Description of the TB Fluoroscopy Data

Name	Description
id	Subject identifier
dose	Total radiation dose in rad
inage	Age at entry
xtage	Age at exit
tragecat	Age at treatment category (< 15, 15-19, 20-29, 30+)
ccoh_ind	Case-cohort indivator (0 smapled non-case; 1 sampled casel 2 unsampled case)

Example 6.17 Fitting Case-Cohort Models

We begin by naming the variables in the input data set, specifying the input format and a pre-input edit to remove bilateral cases. Following this, the data are read from a comma-separated value file called tbfccus.csv. Records with missing dose (dose < 0) are skipped. The commands to do this are

USETXT ../exdata/tbfccus.csv @

tran if dose < 0 then delete endif @

INPUT @

Because the transformation makes use of the DELETE function, it must be given before the data are input. The DELETE function drops records from the working data set. Once a record has been deleted in this way, it is not available for analysis in the session (although it is still included in the input data file). As illustrated earlier, the SELECT command can be used to analyze a subset of the working data set. In this example the DELETE function is used to exclude reocrds with missing dose.

Once the data are read, we define indicators for any case (breast) and for cases that were not included in the cohort (outside) using the following transformations

TRAN breast = ccoh_ind > 0 ;

outside = ccoh_ind == 2 ;

The first transformation defines breast equal to 1 for any value of ccoh_ind that is greater than 0 and 0 otherwise. The second defines outside as equal to 1 for cases that were not in the sampled cohort and 0 for all other records. The outside-the-cohort indicator is one of the key variables that must be specified for a case-cohort analysis.

The command

CATEG dose as dcat 0 / 0 / 250 / > @

creates the dose category variable (dcat) used by Langholz and Jiao. This variable is 1 for women with dose 0, 1 for women with doses between 1 and 250 rad, and 3 for women with higher doses.

The command

LEVE tragecat @

is used to indicate tragecat is a categorical variable. The program will determine the number of levels (4 in this case).

Once the data have been read and the necessary variables defined, it is necessary to specify the key variables and to describe the nature of the case-cohort sample. As in any survival model for ungrouped data in EPICURE, one needs to specify the key variables. In this example these are the cases variable and the exit and entry time variables. The commands to do this are the same as those used in other PEANUTS modeling. In this case the commands are:

CASES breast @

ENTRY inage @
TIME xtage @

For a case cohort analysis, we need to provide additional information on a) the variable used to indicate cases outside the cohort (i.e. unsampled cases) and b) the nature of the sampling. It is also possible to choose between two estimators of the adjusted variance. These are asymptotic (the default estimator) and robust. These estimators are described .This is done using the CASECOHORT command.

The first argument to this command is the name of the variable that identifies cases that were not included on the sample, which is the variable called outside is this example. This variable should be coded as 0 for sampled records (cases or non-cases) and 1 for cases that were not included in the sample. In this example this is the variable called outside that was defined above. This is followed by information on the nature of the sample. This can be either the size of the full cohort or the sampling fraction. For this example, we specify the size of the full cohort. In this example we use the following command:

CASECOHORT outside COHORTSIZE 1741 @

Following this command the program provides a summary of the sample, which for this example is as follows:

Sampling fraction : 0.0568639

Sample size : 99

Unsampled records : 70
Population size : 1741

The summary includes information on the sampling fraction (about 6% in this case), the number of records included in the sampled cohort, and the number of unsampled records. (It is not necessary that all of the unsampled records be cases. If non-cases are included in the unsampled records, i.e. those with a non-zero value of the outside-the-cohort indicator variable, they will not be used in the fitting and have no effect on the variance computations.)

Model specification proceeds as for other EPICURE models. In this case we will fit a relative risk model with the zero dose group as the baseline and estimates of the log relative risk for the two non-zero dose categories. This model is specified and fit with the command

FIT dcat @.

For this example the fit summary table is as follows:

Hazard function regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

breast is used for cases

inage is used for entry (left-truncation) times

xtage is used for survival time

Case-cohort analysis with asymptotic variance (see Langholtz & Jiao 2007 for details)

99 cohort records and 70 unsampled records from population of 1741

5.7% sampling fraction

outside indicates cases outside of cohort

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

2 dcat_2................... 0.6572 0.3449 1.905 0.0567

3 dcat_3................... 1.553 0.8329 1.865 0.0622

Records used 169

-2 * log-likelihood 603.801 Free parameters 2

AIC 607.801 Informative risk sets 75

The first part of this output is identical to what one sees for a model fit to the full cohort. This is followed by a short description of the case-cohort sampling that includes an indication of the variance computation method (asymptotic in this case), and provides the name of the outside-the-cohort indicator.

The parameter summary table looks like that for any other hazard function regression model fit using PEANUTS. The only difference is that the standard errors are the adjusted standard error and test statistics and P-values are based on these adjusted standard errors. (If the PSAVE command were used to save the model information, the standard errors and covariance matrix values would also be the value adjusted for the sampling fraction). It is important to note that the BOUNDS and PROFILE commands do not provide proper confidence intervals since they do not make allowance for the nature of the sampling.

The CI command can be used to obtain adjusted Wald-like confidence intervals for all of the model parameters that are based on the adjusted standard errors. For this model the Wald bounds for the tow estimated parameters are:

CI @

95% Confidence Bounds

# Name Estimate Std. Err. Lower Upper

-- ---------------------------- ---------- --------- -------- --------

Log-linear term 0

2 dcat_2................... 0.6572 0.3449 -0.01880 1.333

EXP(estimate) 1.929 1.412 0.9814 3.793

3 dcat_3................... 1.553 0.8329 -0.07934 3.186
EXP(estimate) 4.726 2.300 0.9237 24.18

The estimates and 95% bounds for the category specific ERRs are 1.55 (0.98 to 3.79) and 4.73 (0.92 to 24.2).

As noted above, the parameter variance estimate were computed using the asymptotic variance method. We can repeat the computations and obtain the standard errors based on the robust variance method using the following commands:

CASECOHORT ROBUST
FIT @

These commands produce the following parameter summary table:

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

2 dcat_2................... 0.6572 0.3381 1.944 0.0519

3 dcat_3................... 1.553 0.8273 1.877 0.0605

Records used 169

-2 * log-likelihood 603.801 Free parameters 2

AIC 607.801 Informative risk sets 75

The parameter estimates and the maximum likelihood values are the same as for the previous fit, but the standard errors and, as a result, the test statistics and P=values are slightly different. We could use the CI command as before to obtain relative risk estimates and confidence bounds, but for this example we use the LINCOMB commands mentioned above. The output is as follows:

LINCOMB 2 @

Estimate Std.Error 95% Wald Bounds

MLE : 0.65720 0.33808 -0.0054253 1.3198

exp(MLE): 1.9294 0.99459 3.7428

LINCOMB 3 @

Estimate Std.Error 95% Wald Bounds

MLE : 1.5531 0.82734 -0.068426 3.1747
exp(MLE): 4.7263 0.93386 23.920

The confidence bounds are slightly different than those obtained using the asymptotic variance adfjustment.

The LINCOMB command can also be used to obtain a bound on the ratio of the risks for the two non-zero dose categories. The command for this is LINCOMB 3 – 2 @. The output for this is:

Estimate Std.Error 95% Wald Bounds

MLE : 0.89593 0.80803 -0.68777 2.4796

exp(MLE): 2.4496 0.50269 11.937

This indicates that the risks in the high dose category are 2.45 times those in the lower dose category with a 95% confidence bound of (0.50 to 11.9).

It is also possible to fit linear ERR or other gernealized risk models to case-cohort data. The following commands can be used to fit such a model in which the ERR estimate has units of ERR per Gray (Gy), where 1 Gy is equal to 100 rad. The following commands switch back to the asymptotic variance computation method; compute the dose in Gy, clear the baseline term, specify and fit the ERR model, and compute 95% Wald-like bounds for the ERR parameter estimate.

! Fit an ERR model with units of ERR/Gy) and obtain Wald bounds

TRAN dgy = dose/100 @

LOGL 0 @

LINE 1 dgy @

fit @

LINCOMB 1 @

As indicated in the output below, the estimated ERR per Gy is 0.57 with a 95% confidence interval of
(-0.23 to 1.38)

Output 6.31

Case-cohort analysis with asymptotic variance (see Langholtz & Jiao 2007 for details)

99 cohort records and 70 unsampled records from population of 1741

5.7% sampling fraction

outside indicates cases outside of cohort

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Linear term 1

1 dgy...................... 0.5716 0.4106 1.392 0.164

Records used 169

-2 * log-likelihood 607.321 Free parameters 1

AIC 609.321 Informative risk sets 75

LINCOMB 1 @

Estimate Std.Error 95% Wald Bounds

MLE : 0.57156 0.41058 -0.23316 1.3763