GROUP

Purpose

Produces a data-set that contains one record for each distinct covariate pattern in the current model.

Programs

GMBO, AMFIT

Syntax

GROUP @

Arguments and Subcommands

None

Remarks

This command “reduces” the data set to the minimal set of records based on the current model. This minimal data set contains one record for each distinct covariate pattern in the current model. Some logistic regression programs, e.g., BMDP and LogXact, routinely carry out analyses on the minimal data set. This approach is suggested by some authors, e.g., Hosmer and Lemeshow (1989), on the grounds that it reduces bias in the global deviance or likelihood ratio statistic. Using the minimal data set does not affect parameter estimates or asymptotic standard errors for a given model. However, deviance values for two grouped models should not be compared. In general, there is probably little reason to carry out such grouping except possibly for a “final” model for which one is interested in the global deviance or Pearson chi-square statistic. Grouping makes the biggest difference for models that contain categorical variables with a limited number of categories. If the model contains continuous variables, grouping will have little effect on the deviance.

This command creates three new variables, %GRP, %GRPN and %GRPC. The first of these contains the group number, which is a sequential number assigned to the distinct covariate patterns. The ordering of these values is determined by the order of the data. The %GRPN variable contains the total number of trials (GMBO) or person years (AMFIT) associated with a covariate pattern. This variable has a positive value for the first occurrence of a covariate pattern and is set equal to zero for all other records with the same covariate pattern. The %GRPC variable contains the total number of cases for a covariate pattern. This variable is set equal to zero for all records other than the record corresponding to the first occurrence of a covariate pattern.

This command must be used after a model has been specified. If you want to fit the model to these data, you must specify the trials (GMBO) or person years (AMFIT) variable to be and the cases variable to be %GRPC and refit the model. The reduction does not do anything to change the number of records in the data set. Rather, for each set of records with the same covariate pattern, the new variables are defined as missing for all records except the first record with this pattern. You can write the reduced data set to disk by selecting records with positive values for %GRPN and using the DATA command to write the (model-dependent) relevant variables for the selected records.

The GROUP command is used in the BRTHWGT.GBO example.

Example

A logistic regression model with three variables, sex, city, and agegrp (age group), is fit to an ungrouped data-set. The data are then grouped, and the same model is fit to the grouped data. Finally, the grouped data are written to a BSF file called GROUPED.BSF.

FIT sex city agegrp @

GROUP @

CASES %grpc @

N %grpn @

FIT @

SELECT %GRPN ? 0 @

SAVE %GRPN %GRPC sex city agegrp ;

TO grouped @