Multiple Linear Regression Analysis

From ReliaWiki

Jump to: navigation, search

This chapter expands on the analysis of simple linear regression models and discusses the analysis of multiple linear regression models. A major portion of the results displayed in DOE++ are explained in this chapter because these results are associated with multiple linear regression. One of the applications of multiple linear regression models is Response Surface Methodology (RSM). RSM is a method used to locate the optimum value of the response and is one of the final stages of experimentation. It is discussed in Response Surface Methods. Towards the end of this chapter, the concept of using indicator variables in regression models is explained. Indicator variables are used to represent qualitative factors in regression models. The concept of using indicator variables is important to gain an understanding of ANOVA models, which are the models used to analyze data obtained from experiments. These models can be thought of as first order multiple linear regression models where all the factors are treated as qualitative factors. ANOVA models are discussed in the One Factor Designs and General Full Factorial Designs chapters.

Multiple Linear Regression Model

A linear regression model that contains more than one predictor variable is called a multiple linear regression model. The following model is a multiple linear regression model with two predictor variables, and .



The model is linear because it is linear in the parameters , and . The model describes a plane in the three-dimensional space of , and . The parameter is the intercept of this plane. Parameters and are referred to as partial regression coefficients. Parameter represents the change in the mean response corresponding to a unit change in when is held constant. Parameter represents the change in the mean response corresponding to a unit change in when is held constant. Consider the following example of a multiple linear regression model with two predictor variables, and  :



This regression model is a first order multiple linear regression model. This is because the maximum power of the variables in the model is 1. (The regression plane corresponding to this model is shown in the figure below.) Also shown is an observed data point and the corresponding random error, . The true regression model is usually never known (and therefore the values of the random error terms corresponding to observed data points remain unknown). However, the regression model can be estimated by calculating the parameters of the model for an observed data set. This is explained in Estimating Regression Models Using Least Squares.

One of the following figures shows the contour plot for the regression model the above equation. The contour plot shows lines of constant mean response values as a function of and . The contour lines for the given regression model are straight lines as seen on the plot. Straight contour lines result for first order regression models with no interaction terms.

A linear regression model may also take the following form:



A cross-product term, , is included in the model. This term represents an interaction effect between the two variables and . Interaction means that the effect produced by a change in the predictor variable on the response depends on the level of the other predictor variable(s). As an example of a linear regression model with interaction, consider the model given by the equation . The regression plane and contour plot for this model are shown in the following two figures, respectively.


Regression plane for the model


Countour plot for the model


Now consider the regression model shown next:



This model is also a linear regression model and is referred to as a polynomial regression model. Polynomial regression models contain squared and higher order terms of the predictor variables making the response surface curvilinear. As an example of a polynomial regression model with an interaction term consider the following equation:



This model is a second order model because the maximum power of the terms in the model is two. The regression surface for this model is shown in the following figure. Such regression models are used in RSM to find the optimum value of the response, (for details see Response Surface Methods for Optimization). Notice that, although the shape of the regression surface is curvilinear, the regression model is still linear because the model is linear in the parameters. The contour plot for this model is shown in the second of the following two figures.


Regression plane for the model


Countour plot for the model


All multiple linear regression models can be expressed in the following general form:



where denotes the number of terms in the model. For example, the model can be written in the general form using , and as follows:


Estimating Regression Models Using Least Squares

Consider a multiple linear regression model with predictor variables:



Let each of the predictor variables, , ... , have levels. Then represents the th level of the th predictor variable . For example, represents the fifth level of the first predictor variable , while represents the first level of the ninth predictor variable, . Observations, , ... , recorded for each of these levels can be expressed in the following way:



The system of equations shown previously can be represented in matrix notation as follows:



where




The matrix is referred to as the design matrix. It contains information about the levels of the predictor variables at which the observations are obtained. The vector contains all the regression coefficients. To obtain the regression model, should be known. is estimated using least square estimates. The following equation is used:



where represents the transpose of the matrix while represents the matrix inverse. Knowing the estimates, , the multiple linear regression model can now be estimated as:



The estimated regression model is also referred to as the fitted model. The observations, , may be different from the fitted values obtained from this model. The difference between these two values is the residual, . The vector of residuals, , is obtained as:



The fitted model can also be written as follows, using :



where . The matrix, , is referred to as the hat matrix. It transforms the vector of the observed response values, , to the vector of fitted values, .

Example

An analyst studying a chemical process expects the yield to be affected by the levels of two factors, and . Observations recorded for various levels of the two factors are shown in the following table. The analyst wants to fit a first order regression model to the data. Interaction between and is not expected based on knowledge of similar processes. Units of the factor levels and the yield are ignored for the analysis.


Observed yield data for various levels of two factors.


The data of the above table can be entered into DOE++ using the multiple linear regression folio tool as shown in the following figure.


Multiple Regression tool in DOE++ with the data in the table.


A scatter plot for the data is shown next.


Three-dimensional scatter plot for the observed data in the table.


The first order regression model applicable to this data set having two predictor variables is:



where the dependent variable, , represents the yield and the predictor variables, and , represent the two factors respectively. The and matrices for the data can be obtained as:



The least square estimates, , can now be obtained:



Thus:



and the estimated regression coefficients are , and . The fitted regression model is:



The fitted regression model can be viewed in DOE++, as shown next.


Equation of the fitted regression model for the data from the table.


A plot of the fitted regression plane is shown in the following figure.


Fitted regression plane for the data from the table.


The fitted regression model can be used to obtain fitted values, , corresponding to an observed response value, . For example, the fitted value corresponding to the fifth observation is:



The observed fifth response value is . The residual corresponding to this value is:



In DOE++, fitted values and residuals are shown in the Diagnostic Information table of the detailed summary of results. The values are shown in the following figure.


Fitted values and residuals for the data in the table.


The fitted regression model can also be used to predict response values. For example, to obtain the response value for a new observation corresponding to 47 units of and 31 units of , the value is calculated using:


Properties of the Least Square Estimators for Beta

The least square estimates, , , ... , are unbiased estimators of , , ... , provided that the random error terms, , are normally and independently distributed. The variances of the s are obtained using the matrix. The variance-covariance matrix of the estimated regression coefficients is obtained as follows:



is a symmetric matrix whose diagonal elements, , represent the variance of the estimated th regression coefficient, . The off-diagonal elements, , represent the covariance between the th and th estimated regression coefficients, and . The value of is obtained using the error mean square, . The variance-covariance matrix for the data in the table (see Estimating Regression Models Using Least Squares) can be viewed in DOE++, as shown next.


The variance-covariance matrix for the data in table.


Calculations to obtain the matrix are given in this example. The positive square root of represents the estimated standard deviation of the th regression coefficient, , and is called the estimated standard error of (abbreviated ).



Hypothesis Tests in Multiple Linear Regression

This section discusses hypothesis tests on the regression coefficients in multiple linear regression. As in the case of simple linear regression, these tests can only be carried out if it can be assumed that the random error terms, , are normally and independently distributed with a mean of zero and variance of . Three types of hypothesis tests can be carried out for multiple linear regression models:

  1. Test for significance of regression: This test checks the significance of the whole regression model.
  2. test: This test checks the significance of individual regression coefficients.
  3. test: This test can be used to simultaneously check the significance of a number of regression coefficients. It can also be used to test individual coefficients.

Test for Significance of Regression

The test for significance of regression in the case of multiple linear regression analysis is carried out using the analysis of variance. The test is used to check if a linear statistical relationship exists between the response variable and at least one of the predictor variables. The statements for the hypotheses are:



The test for is carried out using the following statistic:



where is the regression mean square and is the error mean square. If the null hypothesis, , is true then the statistic follows the distribution with degrees of freedom in the numerator and ( ) degrees of freedom in the denominator. The null hypothesis, , is rejected if the calculated statistic, , is such that:



Calculation of the Statistic

To calculate the statistic , the mean squares and must be known. As explained in Simple Linear Regression Analysis, the mean squares are obtained by dividing the sum of squares by their degrees of freedom. For example, the total mean square, , is obtained as follows:



where is the total sum of squares and is the number of degrees of freedom associated with . In multiple linear regression, the following equation is used to calculate  :



where is the total number of observations, is the vector of observations (that was defined in Estimating Regression Models Using Least Squares), is the identity matrix of order and represents an square matrix of ones. The number of degrees of freedom associated with , , is ( ). Knowing and the total mean square, , can be calculated.

The regression mean square, , is obtained by dividing the regression sum of squares, , by the respective degrees of freedom, , as follows:



The regression sum of squares, , is calculated using the following equation:



where is the total number of observations, is the vector of observations, is the hat matrix and represents an square matrix of ones. The number of degrees of freedom associated with , , is , where is the number of predictor variables in the model. Knowing and the regression mean square, , can be calculated. The error mean square, , is obtained by dividing the error sum of squares, , by the respective degrees of freedom, , as follows:



The error sum of squares, , is calculated using the following equation:



where is the vector of observations, is the identity matrix of order and is the hat matrix. The number of degrees of freedom associated with , , is , where is the total number of observations and is the number of predictor variables in the model. Knowing and , the error mean square, , can be calculated. The error mean square is an estimate of the variance, , of the random error terms, .


Example

The test for the significance of regression, for the regression model obtained for the data in the table (see Estimating Regression Models Using Least Squares), is illustrated in this example. The null hypothesis for the model is:



The statistic to test is:



To calculate , first the sum of squares are calculated so that the mean squares can be obtained. Then the mean squares are used to calculate the statistic to carry out the significance test. The regression sum of squares, , can be obtained as:



The hat matrix, is calculated as follows using the design matrix from the previous example:



Knowing , and , the regression sum of squares, , can be calculated:



The degrees of freedom associated with is , which equals to a value of two since there are two predictor variables in the data in the table (see Multiple Linear Regression Analysis). Therefore, the regression mean square is:



Similarly to calculate the error mean square, , the error sum of squares, , can be obtained as:



The degrees of freedom associated with is . Therefore, the error mean square, , is:



The statistic to test the significance of regression can now be calculated as:



The critical value for this test, corresponding to a significance level of 0.1, is:



Since , is rejected and it is concluded that at least one coefficient out of and is significant. In other words, it is concluded that a regression model exists between yield and either one or both of the factors in the table. The analysis of variance is summarized in the following table.


ANOVA table for the significance of regression test.

Test on Individual Regression Coefficients (t Test)

The test is used to check the significance of individual regression coefficients in the multiple linear regression model. Adding a significant variable to a regression model makes the model more effective, while adding an unimportant variable may make the model worse. The hypothesis statements to test the significance of a particular regression coefficient, , are:



The test statistic for this test is based on the distribution (and is similar to the one used in the case of simple linear regression models in Simple Linear Regression Anaysis):



where the standard error, , is obtained. The analyst would fail to reject the null hypothesis if the test statistic lies in the acceptance region:



This test measures the contribution of a variable while the remaining variables are included in the model. For the model , if the test is carried out for , then the test will check the significance of including the variable in the model that contains and (i.e., the model ). Hence the test is also referred to as partial or marginal test. In DOE++, this test is displayed in the Regression Information table.

Example

The test to check the significance of the estimated regression coefficients for the data is illustrated in this example. The null hypothesis to test the coefficient is:



The null hypothesis to test can be obtained in a similar manner. To calculate the test statistic, , we need to calculate the standard error. In the example, the value of the error mean square, , was obtained as 30.24. The error mean square is an estimate of the variance, .


Therefore:



The variance-covariance matrix of the estimated regression coefficients is:



From the diagonal elements of , the estimated standard error for and is:



The corresponding test statistics for these coefficients are:



The critical values for the present test at a significance of 0.1 are:



Considering , it can be seen that does not lie in the acceptance region of . The null hypothesis, , is rejected and it is concluded that is significant at . This conclusion can also be arrived at using the value noting that the hypothesis is two-sided. The value corresponding to the test statistic, , based on the distribution with 14 degrees of freedom is:



Since the value is less than the significance, , it is concluded that is significant. The hypothesis test on can be carried out in a similar manner.

As explained in Simple Linear Regression Analysis, in DOE++, the information related to the test is displayed in the Regression Information table as shown in the figure below.


Regression results for the data.


In this table, the test for is displayed in the row for the term Factor 2 because is the coefficient that represents this factor in the regression model. Columns labeled Standard Error, T Value and P Value represent the standard error, the test statistic for the test and the value for the test, respectively. These values have been calculated for in this example. The Coefficient column represents the estimate of regression coefficients. These values are calculated as shown in this example. The Effect column represents values obtained by multiplying the coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in Two-Level Factorial Experiments. Columns labeled Low Confidence and High Confidence represent the limits of the confidence intervals for the regression coefficients and are explained in Confidence Intervals in Multiple Linear Regression. The Variance Inflation Factor column displays values that give a measure of multicollinearity. This is explained in Multicollinearity.

Test on Subsets of Regression Coefficients (Partial F Test)

This test can be considered to be the general form of the test mentioned in the previous section. This is because the test simultaneously checks the significance of including many (or even one) regression coefficients in the multiple linear regression model. Adding a variable to a model increases the regression sum of squares, . The test is based on this increase in the regression sum of squares. The increase in the regression sum of squares is called the extra sum of squares. Assume that the vector of the regression coefficients, , for the multiple linear regression model, , is partitioned into two vectors with the second vector, , containing the last regression coefficients, and the first vector, , containing the first ( ) coefficients as follows:



with:



The hypothesis statements to test the significance of adding the regression coefficients in to a model containing the regression coefficients in may be written as:



The test statistic for this test follows the distribution and can be calculated as follows:



where is the the increase in the regression sum of squares when the variables corresponding to the coefficients in are added to a model already containing , and is obtained from the equation given in Simple Linear Regression Analysis. The value of the extra sum of squares is obtained as explained in the next section.

The null hypothesis, , is rejected if . Rejection of leads to the conclusion that at least one of the variables in , ... contributes significantly to the regression model. In DOE++, the results from the partial test are displayed in the ANOVA table.

ANOVA Table for Extra Sum of Squares in DOE++.

Types of Extra Sum of Squares

The extra sum of squares can be calculated using either the partial (or adjusted) sum of squares or the sequential sum of squares. The type of extra sum of squares used affects the calculation of the test statistic for the partial test described above. In DOE++, selection for the type of extra sum of squares is available as shown in the figure below. The partial sum of squares is used as the default setting. The reason for this is explained in the following section on the partial sum of squares.


Partial Sum of Squares

The partial sum of squares for a term is the extra sum of squares when all terms, except the term under consideration, are included in the model. For example, consider the model:



The sum of squares of regression of this model is denoted by . Assume that we need to know the partial sum of squares for . The partial sum of squares for is the increase in the regression sum of squares when is added to the model. This increase is the difference in the regression sum of squares for the full model of the equation given above and the model that includes all terms except . These terms are , and . The model that contains these terms is:



The sum of squares of regression of this model is denoted by . The partial sum of squares for can be represented as and is calculated as follows:



For the present case, and . It can be noted that for the partial sum of squares contains all coefficients other than the coefficient being tested.

DOE++ has the partial sum of squares as the default selection. This is because the test is a partial test, i.e., the test on an individual coefficient is carried by assuming that all the remaining coefficients are included in the model (similar to the way the partial sum of squares is calculated). The results from the test are displayed in the Regression Information table. The results from the partial test are displayed in the ANOVA table. To keep the results in the two tables consistent with each other, the partial sum of squares is used as the default selection for the results displayed in the ANOVA table. The partial sum of squares for all terms of a model may not add up to the regression sum of squares for the full model when the regression coefficients are correlated. If it is preferred that the extra sum of squares for all terms in the model always add up to the regression sum of squares for the full model then the sequential sum of squares should be used.

Example

This example illustrates the test using the partial sum of squares. The test is conducted for the coefficient corresponding to the predictor variable for the data. The regression model used for this data set in the example is:



The null hypothesis to test the significance of is:



The statistic to test this hypothesis is:



where represents the partial sum of squares for , represents the number of degrees of freedom for (which is one because there is just one coefficient, , being tested) and is the error mean square and has been calculated in the second example as 30.24.

The partial sum of squares for is the difference between the regression sum of squares for the full model, , and the regression sum of squares for the model excluding , . The regression sum of squares for the full model has been calculated in the second example as 12816.35. Therefore:



The regression sum of squares for the model is obtained as shown next. First the design matrix for this model, , is obtained by dropping the second column in the design matrix of the full model, (the full design matrix, , was obtained in the example). The second column of corresponds to the coefficient which is no longer in the model. Therefore, the design matrix for the model, , is:



The hat matrix corresponding to this design matrix is . It can be calculated using . Once is known, the regression sum of squares for the model , can be calculated as:



Therefore, the partial sum of squares for is:



Knowing the partial sum of squares, the statistic to test the significance of is:



The value corresponding to this statistic based on the distribution with 1 degree of freedom in the numerator and 14 degrees of freedom in the denominator is:


Assuming that the desired significance is 0.1, since value < 0.1, is rejected and it can be concluded that is significant. The test for can be carried out in a similar manner. In the results obtained from DOE++, the calculations for this test are displayed in the ANOVA table as shown in the following figure. Note that the conclusion obtained in this example can also be obtained using the test as explained in the example in Test on Individual Regression Coefficients (t Test). The ANOVA and Regression Information tables in DOE++ represent two different ways to test for the significance of the variables included in the multiple linear regression model.

Sequential Sum of Squares

The sequential sum of squares for a coefficient is the extra sum of squares when coefficients are added to the model in a sequence. For example, consider the model:



The sequential sum of squares for is the increase in the sum of squares when is added to the model observing the sequence of the equation given above. Therefore this extra sum of squares can be obtained by taking the difference between the regression sum of squares for the model after was added and the regression sum of squares for the model before was added to the model. The model after is added is as follows:



This is because to maintain the sequence all coefficients preceding must be included in the model. These are the coefficients , , , and . Similarly the model before is added must contain all coefficients of the equation given above except . This model can be obtained as follows:



The sequential sum of squares for can be calculated as follows:



For the present case, and . It can be noted that for the sequential sum of squares contains all coefficients proceeding the coefficient being tested.

The sequential sum of squares for all terms will add up to the regression sum of squares for the full model, but the sequential sum of squares are order dependent.

Example

This example illustrates the partial test using the sequential sum of squares. The test is conducted for the coefficient corresponding to the predictor variable for the data. The regression model used for this data set in the example is:



The null hypothesis to test the significance of is:



The statistic to test this hypothesis is:



where represents the sequential sum of squares for , represents the number of degrees of freedom for (which is one because there is just one coefficient, , being tested) and is the error mean square and has been calculated in the second example as 30.24.

The sequential sum of squares for is the difference between the regression sum of squares for the model after adding , , and the regression sum of squares for the model before adding , . The regression sum of squares for the model is obtained as shown next. First the design matrix for this model, , is obtained by dropping the third column in the design matrix for the full model, (the full design matrix, , was obtained in the example). The third column of corresponds to coefficient which is no longer used in the present model. Therefore, the design matrix for the model, , is:



The hat matrix corresponding to this design matrix is . It can be calculated using . Once is known, the regression sum of squares for the model can be calculated as:



Sequential sum of squares for the data.


The regression sum of squares for the model is equal to zero since this model does not contain any variables. Therefore:



The sequential sum of squares for is:



Knowing the sequential sum of squares, the statistic to test the significance of is:



The value corresponding to this statistic based on the distribution with 1 degree of freedom in the numerator and 14 degrees of freedom in the denominator is:



Assuming that the desired significance is 0.1, since value < 0.1, is rejected and it can be concluded that is significant. The test for can be carried out in a similar manner. This result is shown in the following figure.

Confidence Intervals in Multiple Linear Regression

Calculation of confidence intervals for multiple linear regression models are similar to those for simple linear regression models explained in Simple Linear Regression Analysis.

Confidence Interval on Regression Coefficients

A 100 () percent confidence interval on the regression coefficient, , is obtained as follows:



The confidence interval on the regression coefficients are displayed in the Regression Information table under the Low Confidence and High Confidence columns as shown in the following figure.


Confidence interval for the fitted value corresponding to the fifth observation.


Confidence Interval on Fitted Values, A 100 () percent confidence interval on any fitted value, , is given by:



where:



In the above example, the fitted value corresponding to the fifth observation was calculated as . The 90% confidence interval on this value can be obtained as shown in the figure below. The values of 47.3 and 29.9 used in the figure are the values of the predictor variables corresponding to the fifth observation the table.


Confidence Interval on New Observations

As explained in Simple Linear Regression Analysis, the confidence interval on a new observation is also referred to as the prediction interval. The prediction interval takes into account both the error from the fitted model and the error associated with future observations. A 100 () percent confidence interval on a new observation, , is obtained as follows:



where:



,..., are the levels of the predictor variables at which the new observation, , needs to be obtained.


In multiple linear regression, prediction intervals should only be obtained at the levels of the predictor variables where the regression model applies. In the case of multiple linear regression it is easy to miss this. Having values lying within the range of the predictor variables does not necessarily mean that the new observation lies in the region to which the model is applicable. For example, consider the next figure where the shaded area shows the region to which a two variable regression model is applicable. The point corresponding to th level of first predictor variable, , and th level of the second predictor variable, , does not lie in the shaded area, although both of these levels are within the range of the first and second predictor variables respectively. In this case, the regression model is not applicable at this point.


Predicted values and region of model application in multiple linear regression.

Measures of Model Adequacy

As in the case of simple linear regression, analysis of a fitted multiple linear regression model is important before inferences based on the model are undertaken. This section presents some techniques that can be used to check the appropriateness of the multiple linear regression model.

Coefficient of Multiple Determination, R2

The coefficient of multiple determination is similar to the coefficient of determination used in the case of simple linear regression. It is defined as:



indicates the amount of total variability explained by the regression model. The positive square root of is called the multiple correlation coefficient and measures the linear association between and the predictor variables, , ... .

The value of increases as more terms are added to the model, even if the new term does not contribute significantly to the model. An increase in the value of cannot be taken as a sign to conclude that the new model is superior to the older model. A better statistic to use is the adjusted statistic defined as follows:



The adjusted only increases when significant terms are added to the model. Addition of unimportant terms may lead to a decrease in the value of .

In DOE++, and values are displayed as R-sq and R-sq(adj), respectively. Other values displayed along with these values are S, PRESS and R-sq(pred). As explained in Simple Linear Regression Analysis, the value of S is the square root of the error mean square, , and represents the "standard error of the model."

PRESS is an abbreviation for prediction error sum of squares. It is the error sum of squares calculated using the PRESS residuals in place of the residuals, , in the equation for the error sum of squares. The PRESS residual, , for a particular observation, , is obtained by fitting the regression model to the remaining observations. Then the value for a new observation, , corresponding to the observation in question, , is obtained based on the new regression model. The difference between and gives . The PRESS residual, , can also be obtained using , the diagonal element of the hat matrix, , as follows:



R-sq(pred), also referred to as prediction , is obtained using PRESS as shown next:



The values of R-sq, R-sq(adj) and S are indicators of how well the regression model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. For example, higher values of PRESS or lower values of R-sq(pred) indicate a model that predicts poorly. The figure below shows these values for the data. The values indicate that the regression model fits the data well and also predicts well.

Coefficient of multiple determination and related results for the data.

Residual Analysis

Plots of residuals, , similar to the ones discussed in Simple Linear Regression Analysis for simple linear regression, are used to check the adequacy of a fitted multiple linear regression model. The residuals are expected to be normally distributed with a mean of zero and a constant variance of . In addition, they should not show any patterns or trends when plotted against any variable or in a time or run-order sequence. Residual plots may also be obtained using standardized and studentized residuals. Standardized residuals, , are obtained using the following equation:



Standardized residuals are scaled so that the standard deviation of the residuals is approximately equal to one. This helps to identify possible outliers or unusual observations. However, standardized residuals may understate the true residual magnitude, hence studentized residuals, , are used in their place. Studentized residuals are calculated as follows:



where is the th diagonal element of the hat matrix, . External studentized (or the studentized deleted) residuals may also be used. These residuals are based on the PRESS residuals mentioned in Coefficient of Multiple Determination, R2. The reason for using the external studentized residuals is that if the th observation is an outlier, it may influence the fitted model. In this case, the residual will be small and may not disclose that th observation is an outlier. The external studentized residual for the th observation, , is obtained as follows:



Residual values for the data are shown in the figure below. Standardized residual plots for the data are shown in next two figures. DOE++ compares the residual values to the critical values on the distribution for studentized and external studentized residuals.


Residual values for the data.


Residual probability plot for the data.


For other residuals the normal distribution is used. For example, for the data, the critical values on the distribution at a significance of 0.1 are and (as calculated in the example, Test on Individual Regression Coefficients (t Test)). The studentized residual values corresponding to the 3rd and 17th observations lie outside the critical values. Therefore, the 3rd and 17th observations are outliers. This can also be seen on the residual plots in the next two figures.

Residual versus fitted values plot for the data.


Residual versus run order plot for the data.

Outlying x Observations

Residuals help to identify outlying observations. Outlying observations can be detected using leverage. Leverage values are the diagonal elements of the hat matrix, . The values always lie between 0 and 1. Values of greater than are considered to be indicators of outlying observations.

Influential Observations Detection

Once an outlier is identified, it is important to determine if the outlier has a significant effect on the regression model. One measure to detect influential observations is Cook's distance measure which is computed as follows:



To use Cook's distance measure, the values are compared to percentile values on the distribution with degrees of freedom. If the percentile value is less than 10 or 20 percent, then the th case has little influence on the fitted values. However, if the percentile value is close to 50 percent or greater, the th case is influential, and fitted values with and without the th case will differ substantially.


Example

Cook's distance measure can be calculated as shown next. The distance measure is calculated for the first observation of the data. The remaining values along with the leverage values are shown in the figure below (displaying Leverage and Cook's distance measure for the data).


Leverage and Cook's distance measure for the data.


The standardized residual corresponding to the first observation is:



Cook's distance measure for the first observation can now be calculated as:



The 50th percentile value for is 0.83. Since all values are less than this value there are no influential observations.

Lack-of-Fit Test

The lack-of-fit test for simple linear regression discussed in Simple Linear Regression Analysis may also be applied to multiple linear regression to check the appropriateness of the fitted response surface and see if a higher order model is required. Data for replicates may be collected as follows for all levels of the predictor variables:



The sum of squares due to pure error, , can be obtained as discussed in the Simple Linear Regression Analysis as:



The number of degrees of freedom associated with are:



Knowing , sum of squares due to lack-of-fit, , can be obtained as:



The number of degrees of freedom associated with are:



The test statistic for the lack-of-fit test is:



Other Topics in Multiple Linear Regression

Polynomial Regression Models

Polynomial regression models are used when the response is curvilinear. The equation shown next presents a second order polynomial regression model with one predictor variable:



Usually, coded values are used in these models. Values of the variables are coded by centering or expressing the levels of the variable as deviations from the mean value of the variable and then scaling or dividing the deviations obtained by half of the range of the variable.



The reason for using coded predictor variables is that many times and are highly correlated and, if uncoded values are used, there may be computational difficulties while calculating the matrix to obtain the estimates, , of the regression coefficients using the equation for the distribution given in Statistics Background on DOE.

Qualitative Factors

The multiple linear regression model also supports the use of qualitative factors. For example, gender may need to be included as a factor in a regression model. One of the ways to include qualitative factors in a regression model is to employ indicator variables. Indicator variables take on values of 0 or 1. For example, an indicator variable may be used with a value of 1 to indicate female and a value of 0 to indicate male.



In general ( ) indicator variables are required to represent a qualitative factor with levels. As an example, a qualitative factor representing three types of machines may be represented as follows using two indicator variables:



An alternative coding scheme for this example is to use a value of -1 for all indicator variables when representing the last level of the factor:



Indicator variables are also referred to as dummy variables or binary variables.

Example

Consider data from two types of reactors of a chemical process shown where the yield values are recorded for various levels of factor . Assuming there are no interactions between the reactor type and , a regression model can be fitted to this data as shown next.

Since the reactor type is a qualitative factor with two levels, it can be represented by using one indicator variable. Let be the indicator variable representing the reactor type, with 0 representing the first type of reactor and 1 representing the second type of reactor.



Yield data from the two types of reactors for a chemical process.


Data entry in DOE++ for this example is shown in the figure after the table below. The regression model for this data is:



The and matrices for the given data are:


Data from the table above as entered in DOE++.


The estimated regression coefficients for the model can be obtained as:



Therefore, the fitted regression model is:



Note that since represents a qualitative predictor variable, the fitted regression model cannot be plotted simultaneously against and in a two-dimensional space (because the resulting surface plot will be meaningless for the dimension in ). To illustrate this, a scatter plot of the data against is shown in the following figure.


Scatter plot of the observed yield values against (reactor type)


It can be noted that, in the case of qualitative factors, the nature of the relationship between the response (yield) and the qualitative factor (reactor type) cannot be categorized as linear, or quadratic, or cubic, etc. The only conclusion that can be arrived at for these factors is to see if these factors contribute significantly to the regression model. This can be done by employing the partial test discussed in Multiple Linear Regression Analysis (using the extra sum of squares of the indicator variables representing these factors). The results of the test for the present example are shown in the ANOVA table. The results show that (reactor type) contributes significantly to the fitted regression model.


DOE++ results for the data.


Multicollinearity

At times the predictor variables included in a multiple linear regression model may be found to be dependent on each other. Multicollinearity is said to exist in a multiple regression model with strong dependencies between the predictor variables. Multicollinearity affects the regression coefficients and the extra sum of squares of the predictor variables. In a model with multicollinearity the estimate of the regression coefficient of a predictor variable depends on what other predictor variables are included the model. The dependence may even lead to change in the sign of the regression coefficient. In a such models, an estimated regression coefficient may not be found to be significant individually (when using the test on the individual coefficient or looking at the value) even though a statistical relation is found to exist between the response variable and the set of the predictor variables (when using the test for the set of predictor variables). Therefore, you should be careful while looking at individual predictor variables in models that have multicollinearity. Care should also be taken while looking at the extra sum of squares for a predictor variable that is correlated with other variables. This is because in models with multicollinearity the extra sum of squares is not unique and depends on the other predictor variables included in the model.


Multicollinearity can be detected using the variance inflation factor (abbreviated ). for a coefficient is defined as:



where is the coefficient of multiple determination resulting from regressing the th predictor variable, , on the remaining -1 predictor variables. Mean values of considerably greater than 1 indicate multicollinearity problems. A few methods of dealing with multicollinearity include increasing the number of observations in a way designed to break up dependencies among predictor variables, combining the linearly dependent predictor variables into one variable, eliminating variables from the model that are unimportant or using coded variables.

Example

Variance inflation factors can be obtained for the data below.

Observed yield data for various levels of two factors.

To calculate the variance inflation factor for , has to be calculated. is the coefficient of determination for the model when is regressed on the remaining variables. In the case of this example there is just one remaining variable which is . If a regression model is fit to the data, taking as the response variable and as the predictor variable, then the design matrix and the vector of observations are:



The regression sum of squares for this model can be obtained as:



where is the hat matrix (and is calculated using ) and is the matrix of ones. The total sum of squares for the model can be calculated as:



where is the identity matrix. Therefore:



Then the variance inflation factor for is:



The variance inflation factor for , , can be obtained in a similar manner. In DOE++, the variance inflation factors are displayed in the VIF column of the Regression Information table as shown in the following figure. Since the values of the variance inflation factors obtained are considerably greater than 1, multicollinearity is an issue for the data.


Variance inflation factors for the data in.
Personal tools
ReliaWiki.org
Main
Create a book