This section contains a series of short examples that demonstrate a number of features of DATAB not used in the previous examples in this chapter. The topics considered include weighted mean and first-at-risk (FCOUNT) summary variables; time-dependent transformations and time-dependent category variables; adding expected rates to a table; making event-time tables with a user-defined time scale; and collapsing tables to produce marginal subtables.
Example 8.8 Weighted Mean and FCOUNT Summary Variables
The examples in this chapter have involved event-count and mean summary variables. DATAB has several other summary variable types that are useful in some situations. First, as noted above, mean summaries in event-count tables are unweighted, while those in event-time tables are weighted by the person-years. These default weights can be changed using the WEIGHT subcommand. For risk-updated means, the weight variable must be a previously specified summary variable defined with an EVENT, RSUM, or FCOUNT command. For the less commonly used exit-updated means, the weight variable must be specified and it must be an event-count or other exit-updated sum. For example, in addition to the average dose in each cell, we might want to compute the mean dose for cases. The following set of summary variable definition commands defines an event-count variable named cases as the sum of the brstat variable in each cell, the usual person-year weighted mean dose (dose), and the mean dose for cases in the cell (cdose):
EVENT brstat AS cases @
RMEAN dose @ ! Person-year weighted mean dose
RMEAN dose WEIGHT cases AS cdose @ ! Case weighted dose
It is important to remember that for cells in which there are no cases the value of cdose will be 0.
In event-time tables, it can be useful to obtain a count of the number of people entering the study in each time period. Such a summary variable can be obtained using the FCOUNT command. The default name for the summary variable produced by this command is called fatrisk. The AS subcommand can be used to give this variable a different name. The following series of commands defines an FCOUNT summary variable and requests the computation of person-year and first-at-risk weighted dose means:
FCOUNT @
EVENT brstat AS cases @
RMEAN dose @
RMEAN dose; WEIGHT fatrisk AS wdose @
Once a table has been created, the sum of fatrisk is equal to the number of people who were ever at risk. If an fatrisk summary variable had been included in the multi-time scale event-time table for the fluoroscopy data produced in the last example of the previous section, the sum of this variable would be 1,630. This number can be obtained using SUM fatrisk@ command
In a table that includes the summary variables defined above (after the INPUT command), the commands
MEAN dose ; WEIGHT pyr @
MEAN wdose ; WEIGHT fatrisk @
can be used, after the table has been created, to compute the person-year weighted mean dose and the (unweighted!) mean dose for all study participants.
Example 8.9 Using the GETRATES Command to Add Expected Rates to an Event-Time Table
In standardized mortality ratio (SMR) or standardized incidence ratio (SIR) studies, it is necessary to include information about expected rates or expected numbers of events for each cell in an event-time table. The GETRATES command can be used to add rates to an existing table. This command reads specially formatted rate files and uses index variables to associate rates in these tables with cells in the table. Two index variables, which we will call year and age, reference specific rate tables. Rate tables may also be stratified by sex and another stratification variable, for example, race. A single rate file can contain rates for an arbitrary number of causes for one or two sexes. It is also possible to have a second stratification variable with two levels, for example, white and non-white groups in the U.S. mortality rates. In this case, the program reads two rate files with identical structure (number of sexes, year categories, and age categories.) The “rates” for the last cause in a table are assumed to be weights that can be used for combining sexes or strata as requested by the user. EPICURE includes several sets of rate files, including U.S. white and nonwhite cause-specific mortality rates for the period 1925 through 1985 and cancer incidence rates from the Connecticut Tumor Registry. Causes within a rate file are identified by an integer identifier. Details about the rate file format and the specific rate files included with EPICURE are given in EPICURE Rate File Format of the EPICURE Command Summary and in files on the EPICURE distribution disks.
In this example we will continue from previous examples and add breast cancer rates to a modified version of the event-time table of Example 8.7. The (sex and stratum-specific) rate table for a given cause is accessed using indices based on year and age variables in the table. The age summary variable in this table can be used to determine the age index, but the original table does not include a year variable. We will create this variable using a time-dependent transformation defined with the TTRAN command. The code to define this variable is
TTRAN year = CYEAR(time) + CMONTH(time)/12 + 0.5 @
This transformation makes use of two of the EPICURE date functions to compute year and month values from the time dynamic variable created by the CALENDAR command. This transformation statement should appear after the CALENDAR command (since it makes use of the time variable defined in that command) and before the INPUT command. DATAB transformations defined with the TTRAN command are evaluated using current values of all dynamic variables each time a cell is updated. Thus, this transformation computes the approximate fractional year at the midpoint of the current time period. Variables defined with TTRAN transformations can be used to define time-dependent categories (with the TDEP command) or, as in this example, time-dependent summary variables.
The year variable will be used for computation of a time-dependent summary variable with the command
RMEAN year @
This summary variable will contain the average year value for each cell in the table.
The GETRATES command is used after the table has been created, that is, after the INPUT command. To add rates, we must specify the rate table(s) to be used, the index variables in our table to be used to determine which rates to be used, the cause of interest, and the name of a variable to contain the rates. Rates are only added if all of the necessary parameters are specified, and, as will be seen, the command can be repeated several times to specify these parameters. The GETRATES command is the most complex EPICURE command, and although we will discuss several aspects of its usage here, you should consult the description in the EPICURE Command Summary (GETRATES) before using the command with your data.
We will begin with specification of the rate file to be used and ask for a summary of the cause codes in the table. The rate file to be used here is the Connecticut incidence rate file. We can specify this file directly with the FILE1 subcommand, but a special subcommand, USINC, can be used to set up this file. (The U.S. white and nonwhite rate files are used by default. But, if necessary, these files can also be selected with the USMORT subcommand.). The default name for this subdirectory is RATES. The RPATH subcommand can be used to indicate that rate files are in a different directory. If RPATH is used, you should supply the complete path. All rate files must be in the same directory.
As noted above, it is necessary to specify the cause code to add rates to the file. However, at this point we do not know the cause codes in our file. The LISTC subcommand can be used to obtain a complete list of the cause codes in a rate file.
So we being with the command
GETRATES RPATH ..\exrates\ USINC LISTCAUSE @
which selects the Connecticut rate file and prints a list of the causes and cause codes in this file. The output produced by this command is shown below.
Output 8.11 GETRATES command output with cause code list
GETRATES RPATH ..\exrates\ USINC LISTCAUSE @
Index variables Age: age Year: year
Sex: ?????
Cause code: ????? Save as: ?????
Rate file 1: ..\exrates\ct_inc89.oxr
Disease Codes:
1) ALL MALIGNANT 2) ALL BUCCAL 3) LIP 4) TONGUE
5) SALIVARY 6) GUM,OTH MOUTH 7) OROPHARYNX 8) NASOPHARYNX
9) HYPOPHARYNX 10) ALL DIGESTIVE 11) ESOPHAGUS 12) STOMACH
13) SMALL INTESTIN 14) COLON 15) RECTUM 16) LIVER/GALLBLAD
17) PANCREAS 18) ALL RESPIRATOR 19) NASAL CAVITIES 20) LARYNX
21) LUNG 22) FEMALE BREAST 23) MALE BREAST 24) FEMALE GENITAL
25) CERVIX 26) CORP & UTER NO 27) OVARY, INC TUB 28) OVARY, EXC TUB
29) PROSTATE 30) TESTIS 31) KIDNEY 32) BLADDER
33) MELANOMA 34) EYE 35) BRAIN & CNS 36) THYROID
37) ENDO W/ THYMUS 38) ENDO W/O THYMU 39) BONE 40) CONNECTIVE
41) LYMPHAT & HEMA 42) N-H LYMPHOMA 43) HODGKIN'S 44) MULT. MYELOMA
45) ALL LEUKEMIA 46) ACUTE LYMPH LE 47) CLL 48) LYMPH LEUK NOS
49) ANLL 50) ACU MYELOID LE 51) CHR MYELOID LE 52) MYELOID LEUK N
53) CT WHITE POPS
The output indicates that year will be used to determine the year and age will be used to determine age when accessing the rate table. These are default names and can easily be changed by the user. We also see that the sex-specific rates will be combined. The question marks (????) in the Cause Code and Save As fields indicate that neither of these items has been given. For the command to execute, it is necessary to specify all of the items shown. Since there is only a single rate file for Connecticut, no information about strata is shown and it is not necessary to provide information about stratification.
The cause code list indicates that breast cancer rates are cause number 22 in this table. Note that the last “cause” is population sizes. These sizes are used as weights when rates are combined across sexes or strata.
Our table includes women only, so we need to indicate that the female rates are to be used. We could do this by defining a dummy variable that takes on the value 2 for all records and use this as the sex index variable, but in this example we will use the USE F subcommand to indicate that only female rates are to be used. The following command is used to select female rates and to specify breast cancer rates:
GETRATES CAUSE 22 USE F AS ctbrrt @
The AS ctbrrt subcommand indicates that the expected rates will be stored in a variable called ctbrrt.
The output produced by this command is shown below.
Output 8.12 GETRATES command output after cause specification
GETRATES CAUSE 22 USE F AS ctbrrt@
Index variables Age: age Year: year
Sex: Use only females
Cause code: 22 Save as: ctbrrt
Rate file 1: ..\exrates\ct_inc89.oxr
Note that the options from the first command have been retained. The only option that is not retained from one use to the next is the name of the variable in which the rates are to be saved.
We are now ready to do the computations.
To compute the expected number of events in each cell, we need to multiply the women-years by the expected rate in each cell. However, the rates in the rate table are not rates per woman-year; so to compute the expected values, we need to know the rate scaling factor for the rates. This value is read from the rate file header and stored in a named constant called #_rtfact. We can use the command
SHOW #_rtfact @
to print the value of the scale factor. The output following this command (not shown) indicates that #_rtfact equals 1,000. This means that the rates in the table are rates per thousand women-years. Knowing this, we can easily compute the expected number of events in each cell. We could use the command
TRAN expctd = pyr * ctbrrt / 1000 @
but it might be better to use the equivalent transformation,
TRAN expctd = pyr * ctbrrt / #_rtfact @
which uses the scale factor named constant directly.
Once we have computed the expected values, we can use the command
SUM cases expctd @
to compare the numbers of observed and expected cases in this cohort. The result of this command (not shown) indicates that 75 cases were observed and 45.9 were expected. This suggests that Connecticut or Massachusetts breast cancer rates are quite different or that radiation has some effect on breast cancer incidence in this population.
We conclude this example by saving the table in a BSF file called TABTBF.BSF and writing the data to a text file called EXAM8-7.OUT. The commands to do this are
SAVE ; TO tabtbf @
DATA ; TO exam8.9.out
FORMAT (6f7.0,f10.3,2f5.0,3f8.2, f9.2,g12.4,f8.3) HEADER @
This BSF file will be used in the next example. The commands used to in this example are included in Listing 8.2 at the end of this chapter.
Example 8.10 Making Marginal Tables
On occasion one might want to collapse a table over one or more dimensions or combine categories on some dimensions. While this can be done directly using a modified set of commands to process the original data, it is usually easier to start from a data set that contains the uncollapsed table. This is especially easy when the initial table is stored in a BSF file created by DATAB at the time the table was created. (It is important that the table be saved when it is created).
For this example we will collapse the table created and saved in the previous example over calendar time. We begin by indicating that we want to use the TBFTAB.BSF file and use the SHOW command to obtain a detailed description of the variables in this file. The commands are
USE ./tabtbf
SHOW*
The * option requests detailed information about all variables. The output produced by these commands is shown Output 8.13.
Output 8.13 Detailed variable list from SHOW * command
SHOW *
aftcat : Categorical with 6 levels (1,6) Virtual
DCAT : Categorical with 7 levels (1,7) Virtual
agecat : Categorical with 6 levels (1,6) Virtual
time : Categorical with 5 levels (1,5) Virtual
AT_RISK : Continuous, Summary (acount)
PYR : Continuous, Summary (sum)
FATRISK : Continuous, Summary (sum)
cases : Continuous, Summary (sum)
aft : Continuous, Summary (mean)
dose : Continuous, Summary (mean)
age : Continuous, Summary (mean)
year : Continuous, Summary (mean)
ctbrrt : Continuous
expctd : Continuous
%ORD : Continuous Sortorder
%SEL : Continuous Selection
16 variables used. At least 500 additional variables can be created.
A brief description of the BSF file follows the USE command. The variable list produced by the SHOW command indicates that the first four variables are the cell index variables. These variables are categorical, and the summary includes information on the number of categories for each of the index variables. Table index variables are always labeled “virtual” and will always be the first variables in the list. The table index variables are followed by the summary variables defined at the time the table was created. With the exception of the %cellno summary, which is used to determine the cell index values, the summary variable type is indicated. The last variables in the list were created after the table had been produced.
The simplest way to collapse this table uses the following two commands:
COLLAPSE OVER time @
SUMMARY @
The COLLAPSE command specifies the index variable over which the table is to be collapsed. Although not done here, the COLLAPSE command could be followed by CATEGORY commands to redefine categories. More than one index variable can follow the OVER subcommand. This COLLAPSE command has the same effect as the command
TABULATE OVER aftcat dcat agecat @
The SUMMARY command indicates which of the original summary variables are to be included in the new table. In this example we will keep all of the summary variables from the original table. It is also possible to supply a list of summary variables for the new table.
The SUMMARY command can only be used for summary variables in the input table. The usual summary variable definition commands must be used to create summary variables from nonsummary variables, like ctbrrt and expctd in this table. For this example we use the commands
RMEAN ctbrrt WEIGHT pyr @
XSUM expctd @
to create the desired summaries. Since the table does not involve time scales, the RSUM command is equivalent to XSUM in this case. The explicit use of pyr as a weight variable for the ctbrrt mean is necessary to obtain the correct rates in the collapsed table.
We are now ready to create the new table. This is done with the command
INPUT @
It is not necessary to give the file name because it was specified in the USE command. Output 8.14 shows the description of the resulting table.
Output 8.14 Description of marginal table
Input from ./tabtbf.bsf
Description of table:
Variable Category Lower Upper
Number Name Number Name Bound Bound
1 aftcat
1 1 [ 1.000 1.000 ]
2 2 [ 2.000 2.000 ]
3 3 [ 3.000 3.000 ]
4 4 [ 4.000 4.000 ]
5 5 [ 5.000 5.000 ]
6 6 [ 6.000 6.000 ]
2 DCAT
1 1 [ 1.000 1.000 ]
2 2 [ 2.000 2.000 ]
3 3 [ 3.000 3.000 ]
4 4 [ 4.000 4.000 ]
5 5 [ 5.000 5.000 ]
6 6 [ 6.000 6.000 ]
7 7 [ 7.000 7.000 ]
3 agecat
1 1 [ 1.000 1.000 ]
2 2 [ 2.000 2.000 ]
3 3 [ 3.000 3.000 ]
4 4 [ 4.000 4.000 ]
5 5 [ 5.000 5.000 ]
6 6 [ 6.000 6.000 ]
Summary Variables
1) %COUNT 2) PYR 3) FATRISK 4) cases 5) aft
6) dose 7) age 8) year 9) ctbrrt 10) expctd
The potential number of cells is 252
At least 500 additional variables can be created.
Records read -- 475 Records used -- 475
This table contains data in 166 of the 252 potential cells
15 variables defined At least 500 additional variables can be created.
Note that the at_risk summary variable has been replaced with a %count summary variable. This was done because at_risk would be meaningless in the collapsed table. Although this table includes several additional summary variables, the number of cells and the basic information in each cell (women-years, case counts, mean dose, and so on) are the same as in the table created in Example 8.5.
Example 8.11 Making an Event-Time Table with Non-Calendar Time Scales
Studies such as clinical trials or animal studies often involve only a single time scale that is not necessarily directly connected to calendar time. In such cases the basic data include a time-on-study variable, a case indicator, and other covariates. DATAB can be used to compute event-time tables for such tables. This is done using the DURATION command.
In this example we will use data on survival and white blood count for two groups of leukemia patients (Feigl and Zelen, 1965). The variables in the file are:
|
Variable |
Description |
|
wbc |
white blood count |
|
time |
time to death (in weeks) |
|
ag
|
AG status 1 - positive 0 - negative |
The data can be read without a format specification.
The table we want to make will stratify the person-time with wbc and ag categories. The following commands describe the input data, compute several auxiliary variables, and define the fixed categories in the table:
NAMES wbc time ag @
LEVELS ag 0-1 @
TRAN lwbc = LOG(wbc) ; cases = 1 @
TABULATE OVER ag @
CATEGORY wbc AS wbccat
0 / 2500 / 5000 / 7500 / 10000 / 20000 / 30000 / 50000 / 250000
@
These commands should be quite familiar by now. The only thing that is unusual thus far is the computation of the cases variable. Since all of the patients died, the data set contained no case indicator, but to obtain the case count in the table we will need such a variable.
The next few commands define the time categories, specify the exit-time variable, define the summary variables, and construct the table. The commands are
DURATION tcat
0 / 5 / 10 / 15 / 20 / 25 / 50 / 75 / 100 / 200
@
EXIT time @
FCOUNT @
EVENT cases @
RMEAN wbc @
RMEAN tcat AS time @
RMEAN lwbc @
INPUT ../exdata/fglzln.dat @
The DURATION command specifies time categories. A duration-time table can have only a single time scale. The variable name for this command is used as the name of a dynamic variable in the input data set and as the name of the time period index variable in the table. This name must differ from the name of the exit-time variable. The EXIT command takes a single argument, the name of the variable that contains the exit time. We do not need an entry time in this example because everyone started at time 0. The summary variables include a first-at-risk count, an event count, cell means for current time (computed using the dynamic variable tcat), wbc, and its logarithm. Output 8.15 shows the description of this table.
Output 8.15 Description of a duration table
Input from ../exdata/fglzln.dat
Description of table:
Variable Category Lower Upper
Number Name Number Name Bound Bound
1 ag
1 0 [ 0.000 0.000 ]
2 1 [ 1.000 1.000 ]
2 wbccat
1 0 - 2500 [ 0.000 2500. )
2 2500 - 5000 [ 2500. 5000. )
3 5000 - 7500 [ 5000. 7500. )
4 7500 - 10000 [ 7500. 1.000e+04 )
5 10000 - 20000 [ 1.000e+04 2.000e+04 )
6 20000 - 30000 [ 2.000e+04 3.000e+04 )
7 30000 - 50000 [ 3.000e+04 5.000e+04 )
8 50000 - 250000 [ 5.000e+04 2.500e+05 )
3 tcat
1 0 - 5 [ 0.000 5.000 )
2 5 - 10 [ 5.000 10.00 )
3 10 - 15 [ 10.00 15.00 )
4 15 - 20 [ 15.00 20.00 )
5 20 - 25 [ 20.00 25.00 )
6 25 - 50 [ 25.00 50.00 )
7 50 - 75 [ 50.00 75.00 )
8 75 - 100 [ 75.00 100.0 )
9 100 - 200 [ 100.0 200.0 )
Summary Variables
1) AT_RISK 2) PYR 3) FATRISK 4) cases 5) wbc
6) time 7) lwbc
The potential number of cells is 144
At least 500 additional variables can be created.
Records read -- 33 Records used -- 33
This table contains data in 84 of the 144 potential cells
12 variables defined At least 500 additional variables can be created.
This should be pretty familiar by now. The pyr summary variable does not contain person-years. Rather, given the time unit, it is person-weeks. But it is worth noting that we have “reduced” the data on 33 people to 84 records. Using this table in AMFIT, it is possible to obtain likelihood ratio tests, parameter estimates, and standard errors virtually identical to those reported by Cox and Oakes (1984, pp 99-101) based on a proportional hazards analysis of these data. We will leave this as an exercise for the interested user. Have fun.