DATAB is a utility program that transforms data, selects subsets, and produces general summary tables from numeric input data. DATAB uses a subset of the EPICURE commands, supplemented with additional commands for table creation.
DATAB has two modes for input data processing: transformation mode and tabulation mode. In transformation mode there is a one-to-one correspondence between the records in the output data set and those in the input data set. In this mode, the program reads each input record, carries out a set of user defined transformations, and adds records (except those deleted as a result of the transformations) to the output data set.
In tabulation mode there is a many-to-one (event-count tables) or many-to-many (event-time tables) relationship between the records in the output data set and those in the input data set. In this mode, input records are read and, following any user defined transformations, are used to update a general multivariate, multi-way summary table in which one or more summary variables are classified according to user defined categories.
In the simplest DATAB tables, each cell contains a single summary measure, a count of the number of records in the input data set which fall into the cell. In more complex tables, several summary statistics may be computed for each cell. An important feature of the DATAB tabulation mode involves the computation and cross classification of person-years. These person-year tabulations may involve up to three time scales and time dependent categories.
Regardless of the input mode, once the data have been processed, the user can compute simple summary statistics for all or part of the output data set, use transformations to create additional variables, and save the data either as a text file or in a specially formatted binary save format (BSF) file for use as input to any of the EPICURE programs (including DATAB).
As with all EPICURE programs, NAMES, FORMAT, USE, RECORDS, READ, and INPUT commands are used to describe the format and location of the input data.
The TRAN command may be used to define transformations either before or after the data have been read. Transformations defined before the INPUT commands are stored for use in editing the data as it is read. In this case, the DELETE function may be used to delete selected records. Transformations defined after the INPUT commands are carried out immediately.
The LEVELS command can be used to specify categorical variables either before or after the INPUT command. If this command is used prior to the INPUT command, it is necessary to define the number of levels and range explicitly.
In either transformation or tabulation mode, once the input data have been processed, some or all of the data in the output data set can be saved with either the DATA command (which produces a text file) or the SAVE command (which produces a BSF file). The TRAN command can be used to create new variables. The SUM, MEAN, PLOT, and HIST commands can be used to compute summary statistics or to make simple plots. The SORT and SELECT commands can be used to sort the data or restrict operations to a user defined subset of the data.
As noted above, DATAB can be used to produce a variety of general summary tables. The creation of these tables requires a description of the cross-classification which defines the cells in the table (category variables) and specification of the summary variables to be computed for each cell in the table.
It is useful to distinguish between several types of category variables: fixed category variable values do not change with time; time scales describe the passage of time relative to various origins such as follow-up time, attained age, or the special case of a calendar time scale; and time dependent category variables vary with time. A table can include time dependent category variables only if it has at least one time scale. Tables which include at least one time scale are event-time tables. At a minimum these tables contain a count of the number of observed events and person-time. DATAB tables which do not involve time are event-count tables.
A number of DATAB commands are related to the definition of cross-classifications. The TABULATE command must be given whenever a table is to be produced. The CATEGORY, DURATION, TIME, CALENDAR, and TDEP commands are used to specify category variables and to define categories for fixed category variables, user's time scales, standard time scales, a calendar time scale, and time dependent categories, respectively. The ENTRY and EXIT commands are used to indicate the input variables containing entry and exit dates for person-year tables. Finally, the TTRAN command is used to specify the transformations used to define time dependent covariates. These commands are described in Chapter 10, "Category Definition Commands".
In event-count tables, each record in the input data set can contribute to at most one cell in the table. By default, each cell in a event-count table contains a count of the number of records contributing to the cell. In event-time tables, one record in the input data set may contribute to several cells. For example, in a cohort study where one of the category variables is follow-up time, one individual may be at risk during several of the follow-up time intervals. In this case there are two default summary variables. The first is a count of the number of records ever at risk in the cell. The second is the total time at risk (for example, the number of person-years) for the cell. For either table type, the user can specify additional summary statistics to be computed for each cell. These statistics can be (weighted) sums, means, or sums-of-squared deviations of variables in the input data set. For event-time tables, cell specific summary statistics can be computed for all records ever at risk, (for example, average age or exposure) or only for those records for which the exit date falls in the interval covered by the cell (for example, the number of deaths or cases of a disease of interest). For event-time tables, it is also possible to compute the number of people entering the study (first at risk) in the interval covered by each cell.
The DATAB summary variable definition commands include: XSUM, EVENT, XMEAN, and SSD. When these commands are used for event-time tables, the statistic in a cell is updated only for those records whose exit date falls in the cell. The RSUM and RMEAN commands must be used to indicate that summary statistics are to be computed for all records ever at risk in the cell. The FCOUNT command is used to compute the first at risk count for each cell in a person-year table. The syntax and usage for these commands is discussed in Summary Variable Definition Commands.
The REJECT command is used to create a file containing a user selected subset of the input variables for input records that were not used to update any cell in the table. It is especially useful in ascertaining that complex tables have been properly created. Chapter 12, "Other DATAB Commands," contains information on the REJECT command.