Beware of Interactions
Parallel trials but not lines
In a previous post I used an example from Chuang-Stein and Tong(1996) to illustrate ambiguities that arise when fitting interactions in a model. The example in question was with a binary outcome and binary covariate. The figure below illustrates a similar situation with a continuous covariate and a continuous outcome using data from Wei and Zhang (2001). Since the fitted lines are not parallel, the treatment effect in the fit depends on the baseline value. If you want to quote a single value, do you want to quote the one corresponding to a baseline value of 0 (left most vertical dashed line) or the one that applies to the average baseline value of 11.3 (dashed line further to the right) or some other value?
Surprise packages
To return to the Chuang-Stein and Tong (CST) example, the general message was what the computer interprets the main effect to be depends on parameterisation, which might vary from package to package. An example was given in which three statisticians would report three very different estimates, standard errors and p-values for a treatment effect depending on parameterisation. I showed this by actually parameterising the dummy variable for the covariate (sex) myself. Now, however, I am going to use three packages to illustrate that results can appear to change drastically when applying different packages to the same dataset using the same model.
The power of three
I shall illustrate the analysis of the CST example using R(R) SAS(R) and Genstat(R) . In each case a binary event (which either occurs or does not occur) is analysed using sex, treatment and their interaction. In each case I have simply copied and pasted output and have circled the P-value for the treatment effect. Just to avert being scolded by the P-value police, this is not because I regard P-values as being particularly important but because it helps highlight a point.
What R(R) says
What Genstat(R) says
What SAS(R) says
Getting explicit
If you compare the three results you will see that the P-value that R reports is what statistician A quoted in the previous blog. What Genstat(R) gives you is what statistician B reported and SAS(R) gives you what statistician C quoted. (I have cheated slightly with Genstat(R) in that I have explicitly instructed it in the code to use M, for male, as the reference level for sex. Had I used F would have got the same result as R(R).)
In case you doubt that the same model is being fitted by the three packages, look at the P-value for the interaction of treatment and sex. In all three cases it is 0.178 to 3 decimal places.
Recommended by LinkedIn
Isn't this moot?
You could argue that this is all irrelevant. You should not quote a main effect when the interaction is in the model. As I put it in Senn(2000), where I first commented on this example
It may be argued, of course, that where an interaction is present, the results in individual strata are different, therefore, any overall summary is both arbitrary and misleading and a weighted average is inappropriate. (p544)
However, I then went on to say
There are good reasons, however, for preferring the weighted average. The first is related to a point recognized a long time ago by the Bayesian, Harold Jeffreys (46). It is that we should have more scepticism regarding higher order effects (such as interactions) than we show for main effects. The second is that if the treatment has no effect, it becomes implausible to believe in a treatment-by-factor interaction. The third is the related one that the null hypothesis that the treatment is a placebo carries with it the implication that its effect is identically zero in every stratum. For all these reasons, weighting the estimates as if they were homogenous according to precision is a natural starting point for further exploration. (p544-545)
Note, by the way, that the interaction effect is not significant in this example. However, even if it were, I would be interested in a main-effect summary
In fact, statisticians commonly fit models in which interactions are implicitly in the model, even though the object is to report a single main effect. For example, a fixed effects meta-analysis removes the treatment-by-trial interaction from the estimate of the error, since the variance is estimated separately from within each trial. The variation of the treatment effect from trial to trial (the treatment-by-trial interaction) can be used to estimate heterogeneity as a second question but the fixed effects meta-analysis is of interest of itself, even if other analyses then follow.
In that connection, it might be interesting to look also at the Mantel-Haenszel(MH) analysis of this example. The MH test is one that isolates the 1 degree of freedom from a series of k 2x2 tables corresponding to a 'main effect', leaving the other k-1 to assess heterogeneity. This is what SAS(R) has to say.
Note that the P-value of 0.009 is much closer to statistician D's solution in the previous blog-post to that of statistician C.
My opinion
This is the conclusion I came to in 2000
Using a null hypothesis that implies that no interaction is present, however, does not commit us to using a statistic that takes no account of interaction. Consider by analogy the common two independent samples t-test. Under the null hypothesis of no difference between groups, a variance estimate treating the two groups as one and using n1 + n2 - 1 degrees of freedom is unbiased for sigma^2. We commonly use instead, however, an estimate pooled from the two groups and using n1 + n2 - 2 degrees of freedom. This does, in fact, lead to a more sensitive test. By the same token, we can choose to fit treatment-by-center (trial) interactions as part of a general strategy for examining the effect of treatment without committing ourselves as to the reality of the interactive effects. (P545.)
Lessons
References
C. Chuang-Stein and D. M. Tong (1996) The impact of parameterization on the interpretation of the main-effect terms in the presence of an interaction. Drug Information Journal, 421-424.
L. J. Wei and J. Zhang (2001) Analysis of data with imbalance in the baseline outcome variable for randomized clinical trials. Drug Information Journal, 1201-1214.
S. J. Senn (2000) The many modes of meta. Drug Information Journal, 535-549.
J. A. Lewis (1999) Statistical principles for clinical trials (ICH E9): an introductory note on an international guideline. Statistics in Medicine, 1903-1942.