Stratified models for the odds ratio – matched case control studies

This section contains examples that illustrate the use of GMBO/PECAN to fit stratified models for the odds ratio. The primary focus here is on the analysis of matched case control studies using conditional logisticregression, but one can also fit stratified models using unconditional logistic regression. The latter approach is useful, and is the EPICURE default choice, for strata (case-contral sets) with large (more than 50 by deafult) individuals.

The data used for most of the examples originated in a matched case-control study of the effect of exogenous estrogens on the risk of endometrial cancer among women in a retirement community near Los Angeles (Mack, Pike et al. 1976). The study incldued 63 cases with 4 matched controls per case. Matching was made on the basis of age (within one year), vital status and community of residence at the time of diagnosis of the case, age at entry into the retirement community, and marital status. Women who had a hysterectomy prior to the time at which the case was diagnosed were not considered as controls. The analysis of these data have been discussed in detail by Breslow and Day (Breslow and Day 1980).

The variables in this data set are

setno	Matched set number
cases	Case-control indicator
gall	Gall bladder disease indicator
hyp	Hypertension indicator
ob	Obesity indicator
est	Estrogen usage indicator
cdose	Coded conjugate estrogen dose
dura	Estrogen usage duration in months
non	Non-estrogen drug usage indicator
age	Age at time of case diagnosis

The case-control indicator is coded 1 for cases and 0 for controls. The other indicator variables are coded 1 for yes and 0 for no. Conjugated estrogen dose is coded as 0 for none, 1 for -0.1 -0.299 mg/day, 2 for 0.3 -0.625 mg/day, and 3 for 0.6325 or more mg/day. Data on obesity, estrogen dose, and estrogen usage duration were missing for some women. Missing values for these variables are coded as 9, 9, and 99, respectively. The data are sorted by set number, and within each set the case is the first.

Example 5.11 Matched Case-Control Data with 1:1 Matching

For this example, we will reproduce some of the analyses described in Section 7.3 of Breslow and Day (Breslow and Day 1980). These analyses make use of 63 case-control pairs consisting of the case and first control from each matched set in the data set. Transformations will be used to create a variable that is used to select desired records.

The commands used to read the input data (which are assumed to be stored in a text file suitable for free-format input) and create some additional variables needed for the analyses are

NAMES setno cases gall hyp ob est cdose dura non age @

INPUT ../exdata/leimod.dat@

MISS cdose 9; ob 9; dura 99 @

CONS #cage = -1 @

TRAN IF cases == 1 THEN

#cage = age ;

ENDIF ;

cage = #cage ;

categ = 1 + (cage >= 65) + (cage >= 75) ;

cage70 = cage - 70 ; agegp = (age >= 70) ;

cest = (cdose > 0) ;

caseno = GL(5,1) ; pairs = caseno <= 2 ;

SAVE @

The data are read from a file called LEIMOD.DAT. The MISS command is used to recode missing values for dose, ob, and dura to the EPICURE missing value code. The constant transformation is used to initialize the named constant (#cage) needed for computation of the special age variable (cage) used in many of the Breslow and Day analyses. This variable is defined to be the age of the case in each matched set.

The GL function generates a sequence of regular labels. In this case there are five labels and each label is repeated one time, resulting in the sequence 1, 2, 3, 4, 5, 1 ,2, 3, 4, 5, . Because of the way in which the data are sorted, the value of caseno is 1 for the cases and 2 for the first control for each case. The pairs variable is an indicator used to select records for the initial analyses. The data are saved in a BSF file with the default name, LEIMOD.BSF.

Since the analyses in this example will use only one control per case, we select those cases with pairs equal to 1. This is accomplished with the command

SELECT pairs == 1 @

Although it is not necessary for this example because the default names are used, we explicitly specify the case-control indicator and set-number variables using the commands. The COND command instructs the program to use conditional likelihood.

CASES cases

STRATA setno@

COND @

We continue this example by reproducing some of the results described in Table 7.3 of (Breslow and Day 1980). The first model to be considered is a model with no covariates, that is, a model in which none of the covariates have an effect on the cancer risk. The NULL command is given after the fit to designate this model as the null model for a subsequent likelihood ratio test. This will create named constants called #_nuldv and #_nulnp, which contain the deviance and degrees of freedom, respectively, for this model. We also use constant transformations to save another copy of the deviance for this model in a named constants called #basdv for use in later computations. The commands to fit this simple model and save the deviance are

FIT @

NULL

CONS #basdv = #_dv @

We will now compute score and likelihood ratio tests for an effect of estrogen usage on the risk of endometrial cancer. The risk model used here is the usual log-linear model

After computation of the score test, we fit the model and compute the likelihood ratio test of the null hypothesis that estrogen usage has no effect on the risk of endometrial cancer. In addition, we compute the Wald confidence bounds (this also displays the relative risk estimate). This is done using the following commands:

SCORE + est @

FIT @

LRT

CI @

The SCORE command updates the model and computes the score statistic for the null hypothesis that estrogens have no effect on the risk of endometrial cancer. The FIT command is used to fit the updated model, and CI causes 95 percent Wald bounds to be computed and displayed. The LRT command instructs the program to compute and display the likelihood ratio test. The output produced by these commands is shown in Output 5.11.

The summary for the fitted model differs slightly from that for an unstratifeid model. The model description and parameter summary table are essentially identical. However, the deviance summary is different. In EPICURE the “deviance” for conditional logistic regression models is defined as minus two times the log-likelihood. This is identical to the goodness-of-fit () statistic of Breslow and Day (Breslow and Day 1980). Defining the deviance in this way means that the likelihood ratio statistic for the comparison of nested models can be computed simply as the difference between the deviance values for the two fits with degrees of freedom equal to the difference in the number of free parameters in the model. The PECAN deviance summary also includes the number of free parameters in the model, the number of risk sets used in the analysis, and the number of non-informative risk sets. Non-informative risk sets are those risk sets in which the cases and controls have the same pattern for all covariates in the model. In this model, there are 31 case-control pairs in which both the case and control used or did not use estrogens and, thus, provide no information for the estimation of the odds ratio non-informative.

Output 5.11 Estimation and testing for an estrogen effect on endometrial cancer risk-1:1 matching

FIT @

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

Using pairs == 1

cases is used for cases

Records used 126

Deviance 87.3365 Free parameters 0

Informative strata 63 (handled with the conditional likelihood)

NULL

CONS #basdv = #_dv @

SCORE + est @

The score statistic is 21.125 with df = 1 (P < 0.001 )

FIT @

Iter Step Deviance

0 0 87.3365

1 0 64.2288

2 0 62.9341

3 0 62.8875

4 0 62.8874

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

Using pairs == 1

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

1 est...................... 2.269 0.6065 3.741 < 0.001

Records used 126

Deviance 62.8874 Free parameters 1

Informative strata 32 (handled with the conditional likelihood)

Non-Informative strata 31

LRT

LR statistic 24.45 Degrees of freedom 1

P value < 0.001

CI @

95% Confidence Bounds

# Name Estimate Std. Err. Lower Upper

-- ---------------------------- ---------- --------- -------- --------

Log-linear term 0

1 est...................... 2.269 0.6065 1.080 3.457

EXP(estimate) 9.667 1.834 2.945 31.73

The next set of commands are used to fit a set of nested models for the effects of gallbladder disease, hypertension, and the joint effect of having had both of these conditions on endometrial cancer risk. The models to be used in these analyses are all contained in the following model:

Before fitting the gallbladder model, we compute a score statistic for the gallbladder disease effect. This could be done in a similar way for the subsequent models. After the first fit, we use the LRT command to compute the likelihood ratio statistic and then save the information about the current fit for use as the null model in the next test. After the final test, we explicitly compute the two degree-of-freedom likelihood ratio test comparing the full model to the model with no parameters and use the CHISQ command to evaluate the P-value for this test. If we had not given the NULL command after the first fit, the LRT command could have been used to compute this statistic and the associated P-value. However, since the null model was changed, it is necessary to compute the test statistic and its associated P-value explicitly. This computation makes use of the #basdv constant saved earlier. The commands to fit these models and carry out the tests of interest are

NOMODEL

FIT @

SCORE + gall @

FIT @ LRT

NULL

SCORE +hyp @

FIT @

LRT NULL

CONS #dvdif = #basdv - #_dv @

CHISQ 2 #dvdif @

FIT + gall*hyp @ LRT

This series of commands is generally straightforward; however, you should note the use of the NOMODEL command to clear the previous model. This command removes all parameters currently in the model. (This could also be accomplished with the LOGL 0 @ command.) This is done here only because we wish to compute the score test. A portion of the output produced by these commands is shown below.

Output 5.12 Output (partial) for estimation and testing for the effects of gallbladder disease and hypertension on the risk of endometrial cancer - 1:1 matching

NOMODEL

FIT @

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

Using pairs == 1

cases is used for cases

Records used 126

Deviance 87.3365 Free parameters 0

Informative strata 63 (handled with the conditional likelihood)

SCORE + gall @

The score statistic is 3.55556 with df = 1 (P = 0.0593)

FIT @

Iter Step Deviance

0 0 87.3365

1 0 83.6698

2 0 83.6536

3 0 83.6536

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

Using pairs == 1

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

1 gall..................... 0.9555 0.5262 1.816 0.0694

Records used 126

Deviance 83.6536 Free parameters 1

Informative strata 18 (handled with the conditional likelihood)

Non-Informative strata 45

LRT

LR statistic 3.683 Degrees of freedom 1

P value 0.0550

NULL

SCORE +hyp @

The score statistic is 0.861336 with df = 1 (P = 0.353)

FIT @

Iter Step Deviance

0 0 83.6536

1 0 82.7887

2 0 82.7878

3 0 82.7878

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

Using pairs == 1

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

1 gall..................... 0.9704 0.5307 1.828 0.0675

2 hyp...................... 0.3481 0.377 0.9233 0.356

Records used 126

Deviance 82.7878 Free parameters 2

Informative strata 39 (handled with the conditional likelihood)

Non-Informative strata 24

LRT

LR statistic 0.8657 Degrees of freedom 1

P value 0.352

NULL

CONS #dvdif = #basdv - #_dv @

CHISQ 2 #dvdif @

Chi square = 4.5487 df = 2 P = 0.102864

FIT + gall*hyp @

Iter Step Deviance

0 0 82.7878

1 0 80.8714

2 0 80.8413

3 0 80.8413

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

Using pairs == 1

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

1 gall..................... 1.517 0.699 2.17 0.03

2 hyp...................... 0.6270 0.4353 1.44 0.15

3 gall * hyp............... -1.548 1.125 -1.377 0.169

Records used 126

Deviance 80.8413 Free parameters 3

Informative strata 39 (handled with the conditional likelihood)

Non-Informative strata 24

LRT

LR statistic 1.947 Degrees of freedom 1

P value 0.163

We conclude this example with a categorical model for the estrogen dose-response function. The model can be written as

where indexes the three conjugated estrogen dose groups for exposed women (group 0 corresponds to unexposed women). To fit this model, we first use the LEVELS command to indicate that cdose is to be treated as a categorical variable. The model is then fit in the usual way. The commands to carry out this analysis are

LEVELS cdose @

FIT cdose @

The output for this model is shown in Output 5.13.

Earlier we noted that information on conjugated estrogen dose was not available for all women in the study and used the MISS command to recode the missing values to the EPICURE missing value internal code. Only those case-control pairs in which both the case and her matched control have nonmissing values for cdose should be used in the analysis. If you look at the deviance summary in the figure, you will see that the number of risk sets is given as 59, which is the number of case-control pairs for which both women have known values for cdose. As a model is fit, PECAN, like all of the EPICURE regression programs, checks the variables involved in the fit for missing values and automatically excludes those records with missing data. For PECAN, this also involves checking whether a risk set can still be included in the analysis. Additional topics related to missing values will be explored further in the next example. The complete set of commands used in this example is shown in Listing 5.3 at the end of this chapter.

Output 5.13 A categorical dose-response model - 1:1 Matching

LEVELS cdose @

cdose has 4 levels from 0 to 3

FIT cdose@

Iter Step Deviance

0 0 81.7914

1 0 63.5681

2 0 62.9863

3 0 62.9796

4 0 62.9796

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 59 strata

Conditional likelihood used for 59 strata

Using pairs == 1

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

2 cdose_1.................. 1.524 0.6181 2.466 0.0137

3 cdose_2.................. 1.266 0.569 2.225 0.0261

4 cdose_3.................. 2.120 0.693 3.059 0.00222

Records used 122 Unused records 4

Deviance 62.9796 Free parameters 3

Informative strata 45 (handled with the conditional likelihood)

Non-Informative strata 14

Example 5.12 Matched Case-Control Data with 1:M Matching

In the previous example, we considered a subset of the endometrial study data in which the data were limited to the first matched control for each case. In this example, we continue to work with the endometrial cancer data, but we use all of the data, that is, four controls per case. Because PECAN automatically detects 1:M or N:M (multiple cases and controls in each stratum) matching and chooses the appropriate estimation algorithm, it is not necessary for the user to do anything special when working with such data. The commands to read the data are identical to those used in the previous example and will not be repeated here. (One could carry out all of the analyses described in this example in the same session as the analyses of the first example by canceling the selection used to restrict the data to 1:1 matching with the NOSELECT@ or SELECT@ command.)

We begin by fitting the null model, which assumes an odds ratio of 1 in each matched set, and then we consider the model with a multiplicative estrogen effect. The deviance for the null model is saved for use in the computation of likelihood ratio tests and P-values. After computation of the score statistic for an estrogen effect, we fit the estrogen effect model and compute the likelihood ratio statistic (using the script file discussed above). The commands to fit these models and carry out these tests are

FIT @ NULL

SCORE +est @

FIT @ LRT

The output from this analysis is shown below.

Output 5.14 Output from fit of null model and estrogen effect model

FIT@

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

cases is used for cases

Records used 315

Deviance 202.789 Free parameters 0

Informative strata 63 (handled with the conditional likelihood)

NULL

SCORE +est @

The score statistic is 31.1556 with df = 1 (P < 0.001 )

FIT @

Iter Step Deviance

0 0 202.789

1 0 168.806

2 0 167.462

3 0 167.443

4 0 167.443

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

1 est...................... 2.074 0.4208 4.928 < 0.001

Records used 315

Deviance 167.443 Free parameters 1

Informative strata 58 (handled with the conditional likelihood)

Non-Informative strata 5

LRT

LR statistic 35.35 Degrees of freedom 1

P value < 0.001

The format of the parameter summary table is identical to that in the previous example. The point estimate of the log of the relative risk for estrogen exposure, 2.074, is similar to that obtained in the restricted data set of the previous example (2.269), but, because of the use of the additional data, the standard error is about 30 percent less in the current analysis. Also, the number of non-informative risk sets is much smaller in the full data set (5 sets) than it was in the 1:1 matched data (31 sets).

Thus far all of the PECAN examples have made use of the classic linear model for the log of the relative risk. We will now consider an alternative but, in this simple case, equivalent model in which the relative risk is modeled as a linear function of estrogen usage. The model is

The commands to specify and fit this model are

NOMODEL

LINE 1 est @

FIT @

The first command resets the model, which in this case involves removing the parameter associated with est from the log-linear subterm of term 0, and is equivalent to a LOGL 0 @ command. If the previous model contained parameters in several terms or subterms, the NOMODEL command provides a convenient alternative to a series of empty subterm specification commands. The model of interest is specified with the LINE 1 command and fit with the FIT command. The parameter summary table for this fit is shown below.

Output 5.15 Excess relative risk model with matched case-control data

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Stratification on setno with 63 strata

Conditional likelihood used for 63 strata

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Linear term 1

1 est...................... 6.955 3.348 2.078 0.0377

Records used 315

Deviance 167.443 Free parameters 1

Informative strata 58 (handled with the conditional likelihood)

Non-Informative strata 5

Because est is a binary indicator variable, the deviance for this fit is identical to the previous model. The parameter estimate, 6.955, is equal to , where 2.074 is the estimate of the log relative risk in the log-linear (multiplicative) risk model. Note that the standard error for the excess relative risk parameter is larger than we would expect based upon the P-value for the likelihood ratio test, P <0.001. (This test is identical to that carried out for the log-linear model.) This is a consequence of the parameterization (called parameter effects curvature in modern, nonlinear regression literature) and indicates that one should not rely on the Wald statistic as a test for the statistical significance of the estrogen effect in this model. It also indicates that Wald-type confidence intervals for this parameterization are misleading. These issues were discussed in detail in the GMBO/PECAN examples. One might obtain a better asymptotic description of the variability in the excess relative risk by estimating the logarithm of the excess relative risk using the model

which could be specified and fit with the commands

LINE 1 est=1 @

LOGL 1 %CON @

FIT @

As discussed above and in the GMBO/PECAN examples, likelihood-based confidence bounds provide a useful parameterization-independent description of the variability in specific parameters (for equivalent models). The BOUNDS command can be used to compute likelihood-based bounds for parameters in all EPICURE regression programs. The 95 percent likelihood bounds for the parameter in this linear model are (2.71, 18.78) while those for the log-linear model are (1.00, 2.93). You can easily check the equivalence of these bounds.

Despite potential asymmetry in the likelihood surface, alternative parameterizations such as those considered here are useful in some problems. For example, if one is interested in dose-response models over a broad range of exposures, it may be that the relative risk is better modeled as a linear rather than log-linear function of dose. It may also be the case that effects of factors that modify the risk are better modeled as effects on the excess relative risk than as multiplicative effects on the relative risk.

We continue this example with an illustration of hypothesis tests in the presence of missing values in some of the covariates of interest. In particular, we will consider testing for a log-linear effect of conjugated estrogen dose, cest, on the risk of endometrial cancer. After clearing the previous linear model, one might use the commands

FIT @

FIT cest @

However, since conjugated estrogen dose is not available for all women, the difference in the deviance for these two models is not a valid likelihood ratio test for the estrogen dose response. The null model will be fit using data on all women in the study (315 women in 63 matched sets), but the dose response model will be based upon the data for the 291 women in 59 matched sets with complete dose information. If the fact that estrogen dose is missing is not related to subsequent cancer development, one can construct a valid likelihood ratio test based on an analysis of the subset of women with dose information. The CHECKMISS command can be used to indicate one or more variables to be checked for missing values prior to fitting each model. Records with missing values for any of the variables in the check list are not included in subsequent analyses. Thus, we can use the following commands to fit the null and full models in this example:

CHECK cest @

FIT @

FIT cest @

The output produced by these commands is summarized in below.

Output 5.16 Likelihood ratio test with listwise deletion of records with missing data

CHECK cest @

8 records rejected (in addition to the SELECT)

FIT @

Iter Step Deviance

0 0 159.223

1 0 159.223

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Using records with no missing values for: cest

Stratification on setno with 59 strata

Conditional likelihood used for 59 strata

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

1 cest..................... 1.710 0.3541 4.83 < 0.001

Records used 307 Unused records 8

Deviance 159.223 Free parameters 1

Informative strata 56 (handled with the conditional likelihood)

Non-Informative strata 3

FIT cest @

Iter Step Deviance

0 0 188.129

1 0 159.498

2 0 159.224

3 0 159.223

4 0 159.223

Conditional-logistic regression

Additive excess relative risk T0 * (1 + T1 + T2 + ...)

Using records with no missing values for: cest

Stratification on setno with 59 strata

Conditional likelihood used for 59 strata

cases is used for cases

Parameter Summary Table

# Name Estimate Std.Err. Test Stat. P value

-- ---------------------------- ---------- --------- ---------- --------

Log-linear term 0

1 cest..................... 1.710 0.3541 4.83 < 0.001

Records used 307 Unused records 8

Deviance 159.223 Free parameters 1

Informative strata 56 (handled with the conditional likelihood)

Non-Informative strata 3

Based on the large change in the deviance, it is clear that there is a dose response, with the risk changing by a factor of 5.5 per unit of coded estrogen dose.

The commands for this example are shown in Listing 5.4 at the end of the chapter.