Simple Linear Regression Analysis
Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter. Every experiment analyzed in DOE++ includes regression results for each of the responses. These results, along with the results from the analysis of variance (explained in the One Factor Designs and General Full Factorial Designs chapters), provide information that is useful to identify significant factors in an experiment and explore the nature of the relationship between these factors and the response. Regression analysis forms the basis for all DOE++ calculations related to the sum of squares used in the analysis of variance. The reason for this is explained in Appendix B. Additionally, DOE++ also includes a regression tool to see if two or more variables are related, and to explore the nature of the relationship between them.
This chapter discusses simple linear regression analysis while a subsequent chapter focuses on multiple linear regression analysis.
Simple Linear Regression Analysis
A linear regression model attempts to explain the relationship between two or more variables using a straight line. Consider the data obtained from a chemical process where the yield of the process is thought to be related to the reaction temperature (see the table below).
This data can be entered in DOE++ as shown in the following figure:
And a scatter plot can be obtained as shown in the following figure. In the scatter plot yield, is plotted for different temperature values, .
It is clear that no line can be found to pass through all points of the plot. Thus no functional relation exists between the two variables and . However, the scatter plot does give an indication that a straight line may exist such that all the points on the plot are scattered randomly around this line. A statistical relation is said to exist in this case. The statistical relation between and may be expressed as follows:
The above equation is the linear regression model that can be used to explain the relation between and that is seen on the scatter plot above. In this model, the mean value of (abbreviated as ) is assumed to follow the linear relation:
The actual values of (which are observed as yield from the chemical process from time to time and are random in nature) are assumed to be the sum of the mean value, , and a random error term, :
The regression model here is called a simple linear regression model because there is just one independent variable, , in the model. In regression models, the independent variables are also referred to as regressors or predictor variables. The dependent variable, , is also referred to as the response. The slope, , and the intercept, , of the line are called regression coefficients. The slope, , can be interpreted as the change in the mean value of for a unit change in .
The random error term, , is assumed to follow the normal distribution with a mean of 0 and variance of . Since is the sum of this random term and the mean value, , which is a constant, the variance of at any given value of is also . Therefore, at any given value of , say , the dependent variable follows a normal distribution with a mean of and a standard deviation of . This is illustrated in the following figure.
Fitted Regression Line
The true regression line is usually not known. However, the regression line can be estimated by estimating the coefficients and for an observed data set. The estimates, and , are calculated using least squares. (For details on least square estimates, refer to Hahn & Shapiro (1967).) The estimated regression line, obtained using the values of and , is called the fitted line. The least square estimates, and , are obtained using the following equations:
where is the mean of all the observed values and is the mean of all values of the predictor variable at which the observations were taken. is calculated using and is calculated using .
Once and are known, the fitted regression line can be written as:
where is the fitted or estimated value based on the fitted regression model. It is an estimate of the mean value, . The fitted value,, for a given value of the predictor variable, , may be different from the corresponding observed value, . The difference between the two values is called the residual, :
Calculation of the Fitted Line Using Least Square Estimates
The least square estimates of the regression coefficients can be obtained for the data in the preceding table as follows:
Knowing and , the fitted regression line is:
This line is shown in the figure below.
Once the fitted regression line is known, the fitted value of corresponding to any observed data point can be calculated. For example, the fitted value corresponding to the 21st observation in the preceding table is:
The observed response at this point is . Therefore, the residual at this point is:
In DOE++, fitted values and residuals can be calculated. The values are shown in the figure below.
Hypothesis Tests in Simple Linear Regression
The following sections discuss hypothesis tests on the regression coefficients in simple linear regression. These tests can be carried out if it can be assumed that the random error term, , is normally and independently distributed with a mean of zero and variance of .
The tests are used to conduct hypothesis tests on the regression coefficients obtained in simple linear regression. A statistic based on the distribution is used to test the two-sided hypothesis that the true slope, , equals some constant value, . The statements for the hypothesis test are expressed as:
The test statistic used for this test is:
where is the least square estimate of , and is its standard error. The value of can be calculated as follows:
The test statistic, , follows a distribution with degrees of freedom, where is the total number of observations. The null hypothesis, , is accepted if the calculated value of the test statistic is such that:
where and are the critical values for the two-sided hypothesis. is the percentile of the distribution corresponding to a cumulative probability of and is the significance level.
If the value of used is zero, then the hypothesis tests for the significance of regression. In other words, the test indicates if the fitted regression model is of value in explaining variations in the observations or if you are trying to impose a regression model when no true relationship exists between and . Failure to reject implies that no linear relationship exists between and . This result may be obtained when the scatter plots of against are as shown in (a) of the following figure and (b) of the following figure. (a) represents the case where no model exits for the observed data. In this case you would be trying to fit a regression model to noise or random variation. (b) represents the case where the true relationship between and is not linear. (c) and (d) represent the case when is rejected, implying that a model does exist between and . (c) represents the case where the linear model is sufficient. In the following figure, (d) represents the case where a higher order model may be needed.
A similar procedure can be used to test the hypothesis on the intercept. The test statistic used in this case is:
where is the least square estimate of , and is its standard error which is calculated using:
The test for the significance of regression for the data in the preceding table is illustrated in this example. The test is carried out using the test on the coefficient . The hypothesis to be tested is . To calculate the statistic to test , the estimate, , and the standard error, , are needed. The value of was obtained in this section. The standard error can be calculated as follows:
Then, the test statistic can be calculated using the following equation:
The value corresponding to this statistic based on the distribution with 23 (n-2 = 25-2 = 23) degrees of freedom can be obtained as follows:
Assuming that the desired significance level is 0.1, since value < 0.1, is rejected indicating that a relation exists between temperature and yield for the data in the preceding table. Using this result along with the scatter plot, it can be concluded that the relationship between temperature and yield is linear.
In DOE++, information related to the test is displayed in the Regression Information table as shown in the following figure. In this table the test for is displayed in the row for the term Temperature because is the coefficient that represents the variable temperature in the regression model. The columns labeled Standard Error, T Value and P Value represent the standard error, the test statistic for the test and the value for the test, respectively. These values have been calculated for in this example. The Coefficient column represents the estimate of regression coefficients. The Effect column represents values obtained by multiplying the coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in Two Level Factorial Experiments. Columns Low Confidence and High Confidence represent the limits of the confidence intervals for the regression coefficients and are explained in Confidence Interval on Regression Coefficients.
Analysis of Variance Approach to Test the Significance of Regression
The analysis of variance (ANOVA) is another method to test for the significance of regression. As the name implies, this approach uses the variance of the observed data to determine if a regression model can be applied to the observed data. The observed variance is partitioned into components that are then used in the test for significance of regression.
Sum of Squares
The total variance (i.e., the variance of all of the observed data) is estimated using the observed data. As mentioned in Statistical Background, the variance of a population can be estimated using the sample variance, which is calculated using the following relationship:
The quantity in the numerator of the previous equation is called the sum of squares. It is the sum of the square of deviations of all the observations, , from their mean, . In the context of ANOVA this quantity is called the total sum of squares (abbreviated ) because it relates to the total variance of the observations. Thus:
The denominator in the relationship of the sample variance is the number of degrees of freedom associated with the sample variance. Therefore, the number of degrees of freedom associated with , , is . The sample variance is also referred to as a mean square because it is obtained by dividing the sum of squares by the respective degrees of freedom. Therefore, the total mean square (abbreviated ) is:
When you attempt to fit a regression model to the observations, you are trying to explain some of the variation of the observations using this model. If the regression model is such that the resulting fitted regression line passes through all of the observations, then you would have a "perfect" model (see (a) of the figure below). In this case the model would explain all of the variability of the observations. Therefore, the model sum of squares (also referred to as the regression sum of squares and abbreviated ) equals the total sum of squares; i.e., the model explains all of the observed variance:
For the perfect model, the regression sum of squares, , equals the total sum of squares, , because all estimated values, , will equal the corresponding observations, . can be calculated using a relationship similar to the one for obtaining by replacing by in the relationship of . Therefore:
The number of degrees of freedom associated with is 1.
Based on the preceding discussion of ANOVA, a perfect regression model exists when the fitted regression line passes through all observed points. However, this is not usually the case, as seen in (b) of the following figure.
In both of these plots, a number of points do not follow the fitted regression line. This indicates that a part of the total variability of the observed data still remains unexplained. This portion of the total variability or the total sum of squares, that is not explained by the model, is called the residual sum of squares or the error sum of squares (abbreviated ). The deviation for this sum of squares is obtained at each observation in the form of the residuals, . The error sum of squares can be obtained as the sum of squares of these deviations:
The number of degrees of freedom associated with , , is . The total variability of the observed data (i.e., total sum of squares, ) can be written using the portion of the variability explained by the model, , and the portion unexplained by the model, , as:
The above equation is also referred to as the analysis of variance identity and can be expanded as follows:
As mentioned previously, mean squares are obtained by dividing the sum of squares by the respective degrees of freedom. For example, the error mean square, , can be obtained as:
The error mean square is an estimate of the variance, , of the random error term, , and can be written as:
Similarly, the regression mean square, , can be obtained by dividing the regression sum of squares by the respective degrees of freedom as follows:
To test the hypothesis , the statistic used is based on the distribution. It can be shown that if the null hypothesis is true, then the statistic:
follows the distribution with degree of freedom in the numerator and degrees of freedom in the denominator. is rejected if the calculated statistic, , is such that:
where is the percentile of the distribution corresponding to a cumulative probability of () and is the significance level.
The analysis of variance approach to test the significance of regression can be applied to the yield data in the preceding table. To calculate the statistic, , for the test, the sum of squares have to be obtained. The sum of squares can be calculated as shown next. The total sum of squares can be calculated as:
The regression sum of squares can be calculated as:
The error sum of squares can be calculated as:
Knowing the sum of squares, the statistic to test can be calculated as follows:
The critical value at a significance level of 0.1 is . Since , is rejected and it is concluded that is not zero. Alternatively, the value can also be used. The value corresponding to the test statistic, , based on the distribution with one degree of freedom in the numerator and 23 degrees of freedom in the denominator is:
Assuming that the desired significance is 0.1, since the value < 0.1, then is rejected, implying that a relation does exist between temperature and yield for the data in the preceding table. Using this result along with the scatter plot of the above figure, it can be concluded that the relationship that exists between temperature and yield is linear. This result is displayed in the ANOVA table as shown in the following figure. Note that this is the same result that was obtained from the test in the section t Tests. The ANOVA and Regression Information tables in DOE++ represent two different ways to test for the significance of the regression model. In the case of multiple linear regression models these tables are expanded to allow tests on individual variables used in the model. This is done using extra sum of squares. Multiple linear regression models and the application of extra sum of squares in the analysis of these models are discussed in Multiple Linear Regression Analysis.
Confidence Intervals in Simple Linear Regression
A confidence interval represents a closed interval where a certain percentage of the population is likely to lie. For example, a 90% confidence interval with a lower limit of and an upper limit of implies that 90% of the population lies between the values of and . Out of the remaining 10% of the population, 5% is less than and 5% is greater than . (For details refer to the Life Data Analysis Reference Book.) This section discusses confidence intervals used in simple linear regression analysis.
Confidence Interval on Regression Coefficients
A 100 () percent confidence interval on is obtained as follows:
Similarly, a 100 () percent confidence interval on is obtained as:
Confidence Interval on Fitted Values
A 100 () percent confidence interval on any fitted value, , is obtained as follows:
It can be seen that the width of the confidence interval depends on the value of and will be a minimum at and will widen as increases.
Confidence Interval on New Observations
For the data in the preceding table, assume that a new value of the yield is observed after the regression model is fit to the data. This new observation is independent of the observations used to obtain the regression model. If is the level of the temperature at which the new observation was taken, then the estimate for this new value based on the fitted regression model is:
If a confidence interval needs to be obtained on , then this interval should include both the error from the fitted model and the error associated with future observations. This is because represents the estimate for a value of that was not used to obtain the regression model. The confidence interval on is referred to as the prediction interval. A 100 () percent prediction interval on a new observation is obtained as follows:
To illustrate the calculation of confidence intervals, the 95% confidence intervals on the response at for the data in the preceding table is obtained in this example. A 95% prediction interval is also obtained assuming that a new observation for the yield was made at .
The fitted value, , corresponding to is:
The 95% confidence interval on the fitted value, , is:
The 95% limits on are 199.95 and 205.2, respectively. The estimated value based on the fitted regression model for the new observation at is:
The 95% prediction interval on is:
The 95% limits on are 189.9 and 207.2, respectively. In DOE++, confidence and prediction intervals can be calculated from the control panel. The prediction interval values calculated in this example are shown in the figure below as Low Prediction Interval and High Prediction Interval, respectively. The columns labeled Mean Predicted and Standard Error represent the values of and the standard error used in the calculations.
Measures of Model Adequacy
It is important to analyze the regression model before inferences based on the model are undertaken. The following sections present some techniques that can be used to check the appropriateness of the model for the given data. These techniques help to determine if any of the model assumptions have been violated.
Coefficient of Determination (R2)
The coefficient of determination is a measure of the amount of variability in the data accounted for by the regression model. As mentioned previously, the total variability of the data is measured by the total sum of squares, . The amount of this variability explained by the regression model is the regression sum of squares, . The coefficient of determination is the ratio of the regression sum of squares to the total sum of squares.
can take on values between 0 and 1 since . For the yield data example, can be calculated as:
Therefore, 98% of the variability in the yield data is explained by the regression model, indicating a very good fit of the model. It may appear that larger values of indicate a better fitting regression model. However, should be used cautiously as this is not always the case. The value of increases as more terms are added to the model, even if the new term does not contribute significantly to the model. Therefore, an increase in the value of cannot be taken as a sign to conclude that the new model is superior to the older model. Adding a new term may make the regression model worse if the error mean square, , for the new model is larger than the of the older model, even though the new model will show an increased value of . In the results obtained from DOE++, is displayed as R-sq under the ANOVA table (as shown in the figure below), which displays the complete analysis sheet for the data in the preceding table.
The other values displayed with are S, R-sq(adj), PRESS and R-sq(pred). These values measure different aspects of the adequacy of the regression model. For example, the value of S is the square root of the error mean square, , and represents the "standard error of the model." A lower value of S indicates a better fitting model. The values of S, R-sq and R-sq(adj) indicate how well the model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. R-sq(adj), PRESS and R-sq(pred) are explained in Multiple Linear Regression Analysis.
In the simple linear regression model the true error terms, , are never known. The residuals, , may be thought of as the observed error terms that are similar to the true error terms. Since the true error terms, , are assumed to be normally distributed with a mean of zero and a variance of , in a good model the observed error terms (i.e., the residuals, ) should also follow these assumptions. Thus the residuals in the simple linear regression should be normally distributed with a mean of zero and a constant variance of . Residuals are usually plotted against the fitted values, , against the predictor variable values, , and against time or run-order sequence, in addition to the normal probability plot. Plots of residuals are used to check for the following:
- 1. Residuals follow the normal distribution.
- 2. Residuals have a constant variance.
- 3. Regression function is linear.
- 4. A pattern does not exist when residuals are plotted in a time or run-order sequence.
- 5. There are no outliers.
Examples of residual plots are shown in the following figure. (a) is a satisfactory plot with the residuals falling in a horizontal band with no systematic pattern. Such a plot indicates an appropriate regression model. (b) shows residuals falling in a funnel shape. Such a plot indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on may be helpful in this case (see Transformations). If the residuals follow the pattern of (c) or (d), then this is an indication that the linear regression model is not adequate. Addition of higher order terms to the regression model or transformation on or may be required in such cases. A plot of residuals may also show a pattern as seen in (e), indicating that the residuals increase (or decrease) as the run order sequence or time progresses. This may be due to factors such as operator-learning or instrument-creep and should be investigated further.
Residual plots for the data of the preceding table are shown in the following figures. One of the following figures is the normal probability plot. It can be observed that the residuals follow the normal distribution and the assumption of normality is valid here. In one of the following figures the residuals are plotted against the fitted values, , and in one of the following figures the residuals are plotted against the run order. Both of these plots show that the 21st observation seems to be an outlier. Further investigations are needed to study the cause of this outlier.
As mentioned in Analysis of Variance Approach, ANOVA, a perfect regression model results in a fitted line that passes exactly through all observed data points. This perfect model will give us a zero error sum of squares (). Thus, no error exists for the perfect model. However, if you record the response values for the same values of for a second time, in conditions maintained as strictly identical as possible to the first time, observations from the second time will not all fall along the perfect model. The deviations in observations recorded for the second time constitute the "purely" random variation or noise. The sum of squares due to pure error (abbreviated ) quantifies these variations. is calculated by taking repeated observations at some or all values of and adding up the square of deviations at each level of using the respective repeated observations at that value.
Assume that there are levels of and repeated observations are taken at each the level. The data is collected as shown next:
The sum of squares of the deviations from the mean of the observations at the level of , , can be calculated as:
where is the mean of the repeated observations corresponding to (). The number of degrees of freedom for these deviations is ( ) as there are observations at the level of but one degree of freedom is lost in calculating the mean, .
The total sum of square deviations (or ) for all levels of can be obtained by summing the deviations for all as shown next:
The total number of degrees of freedom associated with is:
If all , (i.e., repeated observations are taken at all levels of ), then and the degrees of freedom associated with are:
The corresponding mean square in this case will be:
When repeated observations are used for a perfect regression model, the sum of squares due to pure error, , is also considered as the error sum of squares, . For the case when repeated observations are used with imperfect regression models, there are two components of the error sum of squares, . One portion is the pure error due to the repeated observations. The other portion is the error that represents variation not captured because of the imperfect model. The second portion is termed as the sum of squares due to lack-of-fit (abbreviated ) to point to the deficiency in fit due to departure from the perfect-fit model. Thus, for an imperfect regression model:
Knowing and , the previous equation can be used to obtain :
The degrees of freedom associated with can be obtained in a similar manner using subtraction. For the case when repeated observations are taken at all levels of , the number of degrees of freedom associated with is:
Since there are total observations, the number of degrees of freedom associated with is:
Therefore, the number of degrees of freedom associated with is:
The corresponding mean square, , can now be obtained as:
The magnitude of or will provide an indication of how far the regression model is from the perfect model. An test exists to examine the lack-of-fit at a particular significance level. The quantity follows an distribution with degrees of freedom in the numerator and degrees of freedom in the denominator when all equal . The test statistic for the lack-of-fit test is:
If the critical value is such that:
it will lead to the rejection of the hypothesis that the model adequately fits the data.
Assume that a second set of observations are taken for the yield data of the preceding table. The resulting observations are recorded in the following table. To conduct a lack-of-fit test on this data, the statistic , can be calculated as shown next.
Calculation of Least Square Estimates
The parameters of the fitted regression model can be obtained as:
Knowing and , the fitted values, , can be calculated.
Calculation of the Sum of Squares
Using the fitted values, the sum of squares can be obtained as follows:
The error sum of squares, , can now be split into the sum of squares due to pure error, , and the sum of squares due to lack-of-fit, . can be calculated as follows considering that in this example and :
The number of degrees of freedom associated with is:
The corresponding mean square, , can now be obtained as:
can be obtained by subtraction from as:
Similarly, the number of degrees of freedom associated with is:
The lack-of-fit mean square is:
Calculation of the Test Statistic
The test statistic for the lack-of-fit test can now be calculated as:
The critical value for this test is:
Since , we fail to reject the hypothesis that the model adequately fits the data. The value for this case is:
Therefore, at a significance level of 0.05 we conclude that the simple linear regression model, , is adequate for the observed data. The following table presents a summary of the ANOVA calculations for the lack-of-fit test.
The linear regression model may not be directly applicable to certain data. Non-linearity may be detected from scatter plots or may be known through the underlying theory of the product or process or from past experience. Transformations on either the predictor variable, , or the response variable, , may often be sufficient to make the linear regression model appropriate for the transformed data. If it is known that the data follows the logarithmic distribution, then a logarithmic transformation on (i.e., ) might be useful. For data following the Poisson distribution, a square root transformation () is generally applicable.
Transformations on may also be applied based on the type of scatter plot obtained from the data. The following figure shows a few such examples.
For the scatter plot labeled (a), a square root transformation () is applicable. While for the plot labeled (b), a logarithmic transformation (i.e., ) may be applied. For the plot labeled (c), the reciprocal transformation () is applicable. At times it may be helpful to introduce a constant into the transformation of . For example, if is negative and the logarithmic transformation on Y seems applicable, a suitable constant, , may be chosen to make all observed positive. Thus the transformation in this case would be .
The Box-Cox method may also be used to automatically identify a suitable power transformation for the data based on the relation:
Here the parameter is determined using the given data such that is minimized (details on this method are presented in One Factor Designs).