The likelihood equations for case-cohort data were taken from
the original paper by Prentice (1986) in which he proposed the case-cohort
design. From a cohort of size
, a subcohort of size
is obtained by simple random
sampling.
The quantities
,
, and
are as defined above. As given
by Prentice, the estimates of the relative risk are obtained from the
pseudo-likelihood, namely,
(A.4)
where
is now the set of all members of
the risk set who are also in the subcohort sample and the case at
.
differs from
, which was a set of all members
of the risk set.
The likelihood is the product over all
subjects in the cohort; factors
in expression 8.4 may differ from one only for cases and members of
. This implies that, although only the
covariate vector
for cases and the subcohort are
directly used in parameter estimation, the full cohort must be followed to
determine
and, in particular,
.
The maximum pseudo-likelihood estimate,
, for
is the quantity that nullifies
the
partial derivatives of the
logarithm pseudo-likelihood, namely,
(8.5)


where


Using a Taylor's series expansion, a consistent estimate of the
covariance matrix for
is given by the so-called
“sandwich” estimator
, where
is the matrix of mixed partial
derivatives of the log likelihood
and
is the covariance matrix of
at
.
In place of the matrix of mixed partial derivatives, PEANUTS
uses the more easily computed quantity obtained from expression 3 with
replacing
.
The covariance matrix of
requires extensive calculations
because of
. The
element of
is given by
(A.6)

Since the pseudo-likelihood conditions on noncensored failure
times, we have
for all censored
failure times, that is,
with
.
For each noncensored failure time,
with
, the
element of the covariance matrix
of
is given by the usual
expectation of
, namely,

where

and

Next, let
be an indicator
function that takes value 1 if the failure at time
occurs outside the subcohort and
value 0 otherwise. Suppose
with the
case occurring within the
subcohort,
. Then
, since
is fixed, because the risk set
at
,
, consists only of subcohort members
and, conditional on all covariate and censoring histories up to
,
can be fully characterized. With
=1, conditioning on covariate
and censoring histories and the observed exposures at
, we cannot know for certain which of
the
members is the case and
therefore not in the subcohort, and thus we cannot know the composition of
. This randomness induces the
covariance between
and
. Prentice specifies the covariance as


A similar expression is obtained for
. Note that because of differences in
the composition of
and
and possible covariate changes
over time,
may not equal
.
Expression 7 can then be rewritten as

In the case of ties in the failure times of noncensored cases, Prentice suggests modifying the likelihood using the methods given by Peto (1972) and Breslow (1974) for the partial likelihood, namely,

where
is the number of ties.
The subcohort is selected at the start of follow-up. One possible disadvantage of the case-cohort approach occurs when there is substantial censoring; there may be no members of the subcohort to compare with later-occurring cases. Prentice (1986) and others have suggested that this problem can be minimized by sampling one or more additional subcohorts. Because the covariance between scores depends on the sampling, the equations provided above are not valid for all types of repeated subcohort sampling schemes. However, the formulae are valid for the important special cases where the subsequent subcohort is a simple random sample from the remaining cohort; in particular, the formulae are valid when a subsequent subcohort includes all remaining cohort members. Members of a new sample are “entered” through the use of a late entry (or left-truncation) variable with the EPICURE command ENTRY varname @. The estimated baseline survival curve (see below) is valid only if the sampling fraction is the same for all subcohort selections.
In his paper, Prentice also suggests a natural estimator for the cumulative baseline hazard function, from which an estimated baseline survival curve can be obtained by exponentiating the negative cumulative hazard. The form of the estimator for the cumulative hazard is the same as for a full cohort, but it is modified by division by the sampling fraction. If a survival plot is requested in EPICURE with the SCURV command, then the sampling fraction must have been provided in a previous CSCHRT fraction @ command.
The smoothing algorithm for the survival curves is implemented with the SCURV command; adjustments to the curves are made using the subcohort sampling fraction.