IES 612/STA 4-573/STA 4-576

Spring 2005

 

Week 03 – IES612-lecture-week03.doc

 

Checking Model Assumptions (OL 13.4) – an initial visit

 

RECALL:  Basic Model

 

Yi = b0 + b1Xi + ei    [“simple linear regression”]

 

ei ~ indep. N(0, s2)

 

Definition:

[Def 1] (Raw) Residuals = observed response – predicted response

or

 

[Def 2] (Standardized Residuals)

[Def 3] (Studentized Residuals)

 

Assumption

Diagnostic? How do you check the assumption?

Remediation?

1. E(ei) = 0 ] –>

E(Yi) = b0 + b1Xi–>

line is a reasonable model for describing mean change as a function of x

D1.1:  Plot ei vs.

D1.2:  Plot ei vs. xi

[check to see if pattern exists]

D1.3:  Plot Yi vs. xi and superimpose plot of vs. xi.

D1.4:   Large R2/signif. slope

Curvature?  Polynomial regression model or nonlinear regression model

Smooth regression?  LOWESS

Transformation?  Log/square root

2.  V(ei) = s2–>

V(Yi) =s2 –>

constant variance –> 

scatter about the line is the same regardless of the value of x

D2.1:  Plot ei vs.

[check to see if you have a constant band about zero]

Weighted Least Squares?

Transformation

3. ei ~  Normal

D3.1:  Normal probability plot of ei [see if linear]

D3.2:  Histogram of residuals [bell-shaped?]

Transformation?

Generalized Linear Models (e.g. logistic/probit regression for dichotomous responses;  Poisson regression for count responses)

4. ei  independent

D4.1:  Generally examining the design can suggest if this is true

D4.2:  Durbin-Watson test

Correlated regression models? Time series/spatial methods

5.*  no important omitted variables {relates to pt. 1}

D5.1:  Plot ei vs. omitted variables [see if pattern]

Add omitted variable to a model (multiple regression)

6.* no points exerting undo influence

D6.1:  Look at statistics that quantify influence (e.g. DFBETAS, DFFITS, etc.)

D6.2:  Look for extreme X values (break in stemplots of X)

Smooth model-robust fitting procedure (e.g. Least Absolute Value regression)

7.* no extreme outliers impacting inference

D7.1:  Large residual (e.g. standardized/studentized residual >3/2?)

D7.2:  Break in stemplot of residuals

Check to see if data sheet correct – fix?  Don’t simply omit.  Report analysis both including/excluding point?

 

Example:  Manatee Deaths predicted from Number of Boats Registered

options ls=75;

 

data example1;

  input year nboats manatees;

  cards;

77    447   13

78    460   21

79    481   24

80    498   16

81    513   24

82    512   20

83    526   15

84    559   34

85    585   33

86    614   33

87    645   39

88    675   43

89    711   50

90    719   47

;

 

ODS RTF;

*file='D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\linreg-output.rtf';

 

proc reg;

title 'Number of Manatees killed regressed on the number of boats registered in Florida';

  model manatees = nboats / p r cli clm;

  plot manatees*nboats p.*nboats / overlay; 

  plot r.*nboats r.*p.;   * residuals vs x and yhat;

  plot r.*nqq.;           * normal qqplot;

run;

 

ODS RTF CLOSE;

 

Residuals plot – model adequate? Constant variance?

* now in Excel

 

* now in Excel

 

 

 

Studentized Residuals – outliers?

Output Statistics

Obs

  -2-1 0 1 2

Cook's
D

1

|      |      |

0.017

2

|      |**    |

0.178

3

|      |**    |

0.149

4

|    **|      |

0.091

5

|      |      |

0.006

6

|     *|      |

0.021

7

|  ****|      |

0.244

8

|      |**    |

0.073

9

|      |      |

0.005

10

|     *|      |

0.015

11

|      |      |

0.000

12

|      |      |

0.000

13

|      |*     |

0.091

14

|      |      |

0.027

 

 

Normal errors? - Normal quantile-quantile plot

 

 

 

 

Multiple Regression (OL Chapter 12)

 

* More than one predictor variable

 

Example:  Lung function in miners exposed to coal dust

 

Example:  Polynomial regression

        or        

 

Example:  Indicator variables – e.g. different lines in different groups

                         

 

where  Igroup2 = 1 (group 2) and Igroup2 = 0 (group 1)

 

 

 

So,

GROUP 2 INTERCEPT differs from GROUP 1 intercept  by b1

GROUP 2 SLOPE differs from GROUP 1 slope by b3

 

GENERAL FORM:

 

 

Comments:

1.         “LINEAR” model because the regression coefficients enter the model  in a linear way – compare           

 

So, how does a multiple regression model  (MR) differ from simple linear regression (SLR)?

i.          SLR is the equation of LINE;  MR is the equation of a (hyper-)PLANE

ii.          b0 is the mean response when X=0 in SLR while b0 is the mean response when ALL X’s=0 in MR

iii.         2 regression coefficients in SLR;  k+1 regression coefficients in MR

iv.         interpretation of coefficients?  Partial coefficients in MR

v.         Model scope (space covered by the Xs)

 

Estimating regression coefficients

 

Least squares – minimize

 

Estimate of s2

 

F Test of any relationship between Y and set of predictor variables

 

H0: b1 = b2 = …=bk = 0

 

Ha: at least one of bi ≠ 0

 

TS:  Fobs = [SS(Reg)/k] / [SS(Resid)/(n-k-1)]= MS(Reg)/MS(Resid)

 

RR:  Reject H0 if Fobs > Fa, k, n-k-1

 

Conclusions

 

Where

(Partial) Test of bj

 

H0: bj = 0

 

Ha: bj ≠ 0                     Ha: bj <0          Ha: bj >0

 

TS: 

where

==> R2 here is the % of one pred. variable accounted for by all of the other predictors

==> VIF=[1/(1-R2)] is a diagnostic of collinearity (max>10 concern – Neter et al.)

 

RR:  Reject H0 if

|tobs | > ta/2, n-k-1 tobs < -ta, n-k-1     tobs > ta, n-k-1

 

Conclusions:  Reject/Fail-to-reject H0?

 

P-value:

P(tn-k-1> |tobs|)                P(tn-k-1< tobs)     P(tn-k-1> tobs)    

 

Testing a subset of the predictors

 

H0: bg+1 = bg+2 = … = bk = 0 [implies only need “g+1” of the “k+1” predictor variables]

 

Ha: not H0

 

TS: 

 

Example:  Life Expectancy across different countries

 

data country;

title ‘country data analysis’;

 infile "\\Casnov5\MST\MSTLab\Baileraj\country.data";  * reads an data file;

 input  name $ area popnsize pcturban lang $ liter lifemen

         lifewom pcGNP;

 logarea = log10(area);

 logpopn = log10(popnsize);

 loggnp  = log10(pcGNP);

 ienglish = (lang="English");

 drop area popnsize pcgnp;

 

proc print;

run;

 

/*

  to generate a scatterplot matrix

  Solutions > Analysis > Interactive Data Analysis

     - open data set WORK > COUNTRY

       - select columns (CTRL and click column labels)

       - Analyze > Scatter Plot (YX)

 

  to generate regression fit via this interactive data analysis

  Analyze > Fit

 

*/

ods html;

proc reg data=country;

title predicting life expectancy of women in different countries;

  model lifewom = loggnp;

    output out=new1 p=yhat r=resid;

  run;

 

proc plot data=new1 hpercent=50 vpercent=75;

title residual plots for LIFEWOM = LOGGNP model;

  plot resid*(yhat liter);

  run;

 

proc reg data=country;

title LITER and LOGGNP as predictors of Life expectancy of women;

  model lifewom = liter;

  model lifewom = loggnp;

  model lifewom = liter loggnp;

run;

 

proc reg;

  title LIFEWOM predicted from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP;

  model lifewom = pcturban liter logarea logpopn loggnp;

  plot r.*p. nqq.*r.;

  run;

 

proc reg data=country;

  title LITER and LOGGNP as predictors of Life expectancy of women;

  model lifewom = liter loggnp/ tol vif collinoint;                          

  output out=new p=yhat r=resid;

run;

 

proc univariate data=new plot;

  id name;

  var resid;

  run;

 

proc plot hpercent=50 vpercent=50;

  plot resid*yhat=ienglish resid*liter=ienglish resid*loggnp=ienglish;

run;

ods html close;

 

predicting life expectancy of women in different countries

 

The REG Procedure

Model: MODEL1

Dependent Variable: lifewom

Number of Observations Read

79

Number of Observations Used

78

Number of Observations with Missing Values

1

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

1

4793.33759

4793.33759

148.93

<.0001

Error

76

2446.11113

32.18567

 

 

Corrected Total

77

7239.44872

 

 

 

 

Root MSE

5.67324

R-Square

0.6621

Dependent Mean

64.85897

Adj R-Sq

0.6577

Coeff Var

8.74704

 

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

19.42550

3.77797

5.14

<.0001

loggnp

1

14.83433

1.21557

12.20

<.0001

 


residual plots for LIFEWOM = LOGGNP model

 

Plot of resid*yhat.  A=1, B=2, etc.        resid*liter. A=1, B=2, etc.    

                                                                          

   15 ˆ       A A                         15 ˆ                    A    A  

      ‚                                      ‚                            

      ‚                                      ‚                            

      ‚                                      ‚                            

      ‚        A                             ‚                      A     

   10 ˆ           A                       10 ˆ                        A   

      ‚                                      ‚                            

      ‚            B                         ‚                        AA  

      ‚         A  A                         ‚                       AA   

      ‚          BB                          ‚                  A  B  A   

    5 ˆ          C B                       5 ˆ                  A AA AA   

R     ‚        A  A B                  R     ‚            A       A  AA   

e     ‚      A      A AA               e     ‚                    AA A  A 

s     ‚          B A A A               s     ‚               AA      A  B 

i     ‚       A    AA  C A             i     ‚                      AD  A 

d   0 ˆ      BA     AA                 d   0 ˆ       A A    AA        A   

u     ‚          AB    A A             u     ‚           A        AA    B 

a     ‚       A         A A            a     ‚              A       A   A 

l     ‚     A AABAA      A             l     ‚       AAA AA  A   A      A 

      ‚       AB       A                     ‚       AA    A           A  

   -5 ˆ     B A  A        A               -5 ˆ     B  A         A       A 

      ‚         A                            ‚                 A          

      ‚                  A                   ‚                  A         

      ‚          AAB                         ‚         A  A     B         

      ‚          A  A                        ‚             A  A           

  -10 ˆ                                  -10 ˆ                             

      ‚       A  AA                          ‚     A           B          

      ‚                                      ‚                            

      ‚                                      ‚                            

      ‚                                      ‚                            

  -15 ˆ                                  -15 ˆ                            

      Šˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ          Šˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒ

      40       60       80      100           0           50           100

                                                                          

        Predicted Value of lifewom                       liter            

                                                                          

NOTE: 1 obs had missing values.        NOTE: 2 obs had missing values.    

                                                                          

                                                                          

                                                                           

                                                                          

                                                                          

                                                                           

                                                                          

                                                                          

                                                                          

                                                                           

                                                                          

                                                                          

                                                                           

                                                                          


LITER and LOGGNP as predictors of Life expectancy of women

 

The REG Procedure

Model: MODEL1

Dependent Variable: lifewom

Number of Observations Read

79

Number of Observations Used

77

Number of Observations with Missing Values

2

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

1

4700.51263

4700.51263

149.13

<.0001

Error

75

2364.00685

31.52009

 

 

Corrected Total

76

7064.51948

 

 

 

 

Root MSE

5.61428

R-Square

0.6654

Dependent Mean

64.68831

Adj R-Sq

0.6609

Coeff Var

8.67896

 

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

41.85909

1.97590

21.18

<.0001

liter

1

0.32529

0.02664

12.21

<.0001

 


LITER and LOGGNP as predictors of Life expectancy of women

 

The REG Procedure

Model: MODEL2

Dependent Variable: lifewom

Number of Observations Read

79

Number of Observations Used

77

Number of Observations with Missing Values

2

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

1

4620.58250

4620.58250

141.80

<.0001

Error

75

2443.93698

32.58583

 

 

Corrected Total

76

7064.51948

 

 

 

 

Root MSE

5.70840

R-Square

0.6541

Dependent Mean

64.68831

Adj R-Sq

0.6494

Coeff Var

8.82447

 

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

19.57316

3.84413

5.09

<.0001

loggnp

1

14.77981

1.24118

11.91

<.0001


LITER and LOGGNP as predictors of Life expectancy of women

 

The REG Procedure

Model: MODEL3

Dependent Variable: lifewom

Number of Observations Read

79

Number of Observations Used

77

Number of Observations with Missing Values

2

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

2

5678.11397

2839.05698

151.54

<.0001

Error

74

1386.40551

18.73521

 

 

Corrected Total

76

7064.51948

 

 

 

 

Root MSE

4.32842

R-Square

0.8038

Dependent Mean

64.68831

Adj R-Sq

0.7984

Coeff Var

6.69119

 

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

23.51270

2.96162

7.94

<.0001

liter

1

0.20117

0.02678

7.51

<.0001

loggnp

1

8.86394

1.22709

7.22

<.0001


LIFEWOM predicted from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP

 

The REG Procedure

Model: MODEL1

Dependent Variable: lifewom

Number of Observations Read

79

Number of Observations Used

67

Number of Observations with Missing Values

12

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

5

4473.89310

894.77862

43.96

<.0001

Error

61

1241.74869

20.35654

 

 

Corrected Total

66

5715.64179

 

 

 

 

Root MSE

4.51182

R-Square

0.7827

Dependent Mean

64.77612

Adj R-Sq

0.7649

Coeff Var

6.96525

 

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

27.79999

4.53708

6.13

<.0001

pcturban

1

0.02241

0.03757

0.60

0.5530

liter

1

0.19211

0.03180

6.04

<.0001

logarea

1

-0.41442

0.93342

-0.44

0.6586

logpopn

1

-0.26259

1.06069

-0.25

0.8053

loggnp

1

7.73888

1.81985

4.25

<.0001

 


The REG Procedure

 


The REG Procedure

Plot of NQQ vs RESIDUAL

 


LITER and LOGGNP as predictors of Life expectancy of women

 

The REG Procedure

Model: MODEL1

Dependent Variable: lifewom

Number of Observations Read

79

Number of Observations Used

77

Number of Observations with Missing Values

2

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

2

5678.11397

2839.05698

151.54

<.0001

Error

74

1386.40551

18.73521

 

 

Corrected Total

76

7064.51948

 

 

 

 

Root MSE

4.32842

R-Square

0.8038

Dependent Mean

64.68831

Adj R-Sq

0.7984

Coeff Var

6.69119

 

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Tolerance

Variance
Inflation

Intercept

1

23.51270

2.96162

7.94

<.0001

.

0

liter

1

0.20117

0.02678

7.51

<.0001

0.58823

1.70001

loggnp

1

8.86394

1.22709

7.22

<.0001

0.58823

1.70001

 

Collinearity Diagnostics (intercept adjusted)

Number

Eigenvalue

Condition
Index

Proportion of Variation

liter

loggnp

1

1.64169

1.00000

0.17915

0.17915

2

0.35831

2.14051

0.82085

0.82085


LITER and LOGGNP as predictors of Life expectancy of women

 

The UNIVARIATE Procedure

Variable: resid (Residual)

Moments

N

77

Sum Weights

77

Mean

0

Sum Observations

0

Std Deviation

4.27108625

Variance

18.2421778

Skewness

-0.5505371

Kurtosis

0.31026345

Uncorrected SS

1386.40551

Corrected SS

1386.40551

Coeff Variation

.

Std Error Mean

0.48673545

 

Basic Statistical Measures

Location

Variability

Mean

0.000000

Std Deviation

4.27109

Median

0.920584

Variance

18.24218

Mode

.

Range

21.17584

 

 

Interquartile Range

5.71274

 

Quantiles (Definition 5)

Quantile

Estimate

100% Max

10.316175

99%

10.316175

95%

5.708294

90%

5.088762

75% Q3

3.100406

50% Median

0.920584

25% Q1

-2.612333

10%

-6.243490

5%

-7.757859

1%

-10.859669

0% Min

-10.859669

 

Extreme Observations

Lowest

Highest

Value

name

Obs

Value

name

Obs

-10.85967

swazilan

47

5.62592

jamaica

4

-10.81760

cameroon

33

5.70829

kenya

37

-10.62208

ethiopia

35

5.97978

elsalvad

9

-7.75786

turkey

31

7.25135

guyana

21

-7.68269

vanuatu

66

10.31617

china

56

 

Missing Values

Missing
Value

Count

Percent Of

All Obs

Missing Obs

.

2

2.53

100.00

 

             Stem Leaf                     #             Boxplot          

               10 3                        1                |             

                9                                           |             

                8                                           |              

                7 3                        1                |             

                6 0                        1                |             

                5 11367                    5                |             

                4 134                      3                |             

                3 1123378899              10             +-----+          

                2 0133457                  7             |     |          

                1 112335578                9             |     |          

                0 33699                    5             *--+--*          

               -0 866311                   6             |     |          

               -1 65332                    5             |     |          

               -2 97763210                 8             +-----+          

               -3 97440                    5                |             

               -4 3                        1                |             

               -5 53                       2                |             

               -6 2                        1                |             

               -7 8733                     4                |             

               -8                                           |             

               -9                                           |             

              -10 986                      3                |             

                  ----+----+----+----+                                    

                                                                           

                                                                          


LITER and LOGGNP as predictors of Life expectancy of women

 

The UNIVARIATE Procedure

Variable: resid (Residual)

                              Normal Probability Plot                     

           10.5+                                                 *+       

               |                                               ++         

               |                                            +++           

               |                                          ++ *            

               |                                        ++                

               |                                     +******              

               |                                   +**                    

            3.5+                               *****                      

               |                             ***+                         

               |                          ****                            

               |                        **++                              

               |                      ***                                 

               |                    ***                                   

               |                 ****                                     

           -3.5+               ***                                        

               |              *+                                           

               |           ++*                                            

               |         ++ *                                             

               |       +****                                              

               |    +++                                                   

               |  ++                                                      

          -10.5++*   * *                                                  

                +----+----+----+----+----+----+----+----+----+----+       

                    -2        -1         0        +1        +2            

                                                                          

                                                                           


LITER and LOGGNP as predictors of Life expectancy of women

 

   Plot of resid*yhat=ienglish.           Plot of resid*liter=ienglish.   

                                                                          

      ‚                                      ‚                            

   10 ˆ             0                     10 ˆ                    0       

      ‚                 1                    ‚                         1  

      ‚          0    0001                   ‚            0  00 0   0 1   

R     ‚      0     00 0001 0  1        R     ‚       0   0   0  0 00 100  

e     ‚      000  1 0   0 100 0  0     e     ‚     0 010 10        0 01 0 

s   0 ˆ     00   1       0 0 001 0     s   0 ˆ     0  0         0   000 0 

i     ‚         00 0 0  0 0 00 1 0     i     ‚         1  0 00    0 000 0 

d     ‚      0   0  0 0 0        0     d     ‚     0       0      00    0 

u     ‚               0        1       u     ‚                   0   0 1  

a     ‚             00 00              a     ‚                000         

l -10 ˆ          0    0                l -10 ˆ                 00         

      ‚                                      ‚                            

      ‚                                      ‚                            

      ‚                                      ‚                            

      ‚                                      ‚                            

  -20 ˆ                                  -20 ˆ                             

      Šˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒ          Šˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒ

      40           60           80            0           50           100

                                                                          

        Predicted Value of lifewom                       liter            

                                                                          

NOTE: 2 missing.  17 hidden.           NOTE: 2 missing.  21 hidden.       

                                                                           

                                                                          

  Plot of resid*loggnp=ienglish.                                          

                                                                           

      ‚                                                                   

   10 ˆ    0                                                              

      ‚      1                                                            

      ‚     0  01 0 0                                                     

R     ‚   0    0 00    1                                                  

e     ‚    00 01 1 0   0  0                                               

s   0 ˆ  0    10 0 0  010 0                                                

i     ‚  00  0 0 0000   1 0                                               

d     ‚   000   0  0      0                                               

u     ‚    0  0        1                                                   

a     ‚      0  000                                                       

l -10 ˆ 0       0                                                         

      ‚                                                                   

      ‚                                                                    

      ‚                                                                   

      ‚                                                                   

  -20 ˆ                                                                    

      Šˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ                                       

       2        3        4        5                                       

                                                                          

                  loggnp                                                  

                                                                          

NOTE: 2 missing.  23 hidden.