IES 612/STA 4-573/STA 4-576

Spring 2005

 

Week 04 – IES612-lecture-week04.doc

 

F Test of any relationship between Y and set of predictor variables

 

H0: b1 = b2 = …=bk = 0

 

Ha: at least one of bi ≠ 0

 

TS:  Fobs = [SS(Reg)/k] / [SS(Resid)/(n-k-1)]= MS(Reg)/MS(Resid)

 

RR:  Reject H0 if Fobs > Fa, k, n-k-1

 

Example:  Life Expectancy across different countries – any association?

 

 

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

2

5678.11397

2839.05698

151.54

<.0001

Error

74

1386.40551

18.73521

 

 

Corrected Total

76

7064.51948

 

 

 

 

H0: b1=b2=0 [LIFE EXPECTANCY is not related to either LITERACY or LOGGNP]

 

H1: Either b1≠0 or b2≠0 or BOTH (b1≠0 AND b2≠0)

 

TS:  Fobs=151.54

 

P-value<0.0001

 

Conclusion:  Reject H0 and conclude LIFE EXPECTANCY is related to either LITERACY or LOGGNP or both.

 

(Partial) Test of bj

 

H0: bj = 0

 

Ha: bj ≠ 0                     Ha: bj <0          Ha: bj >0

 

TS: 

 

RR:  Reject H0 if

|tobs | > ta/2, n-k-1 tobs < -ta, n-k-1     tobs > ta, n-k-1

 

Conclusions:  Reject/Fail-to-reject H0?

 

P-value:

P(tn-k-1> |tobs|)                P(tn-k-1< tobs)     P(tn-k-1> tobs)    

 

Example:  Life Expectancy across different countries – testing single reg. parameters

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

23.51270

2.96162

7.94

<.0001

liter

1

0.20117

0.02678

7.51

<.0001

loggnp

1

8.86394

1.22709

7.22

<.0001

 

H0: b2=0 [the prediction of LIFE EXPECTANCY is NOT improved by adding LOGGNP to a model already containing LITERACY]

 

H1: b2≠0 [LOGGNP is needed in addition to LITERACY for predicting LIFE EXPECTANCY]

 

TS:  tobs=7.22

 

P-value<0.0001

 

Conclusion:  Reject H0 and conclude that LOGGNP is a significant variable for modeling LIFE EXPECTANCY that adds to a model already containing LITERACY.

 

Testing a subset of the predictors [General Linear Test]

 

H0: bg+1 = bg+2 = … = bk = 0 [implies only need “g+1” of the “k+1” predictor variables]

 

Ha: not H0 [more than the REDUCED model is needed]

 

TS: 

 

Example:  Life Expectancy across different countries – all 5 variables needed?

 

“Complete”/”Full” model ->

 

“Reduced” model ->

 

H0: b3=b4=b5=0 [LogAREA, LogPOPN and PCTURBAN do not add to a model already containing LITERACY and LOGGNP]

 

H1: at least one of (b3, b4, b5)≠0

 

TS:  Fobs=0.30 [see SAS output below]

 

P-value=0.8223

 

Conclusion:  Fail to Reject H0 and conclude that LogAREA, LogPOPN and PCTURBAN do not appear to significantly improve a LIFE EXPECTANCY model that already contains LITERACY and LOGGNP as predictor variables.

 

/* SAS code for testing a subset of parameters in a model */

data country;

title ‘country data analysis’;

 infile "\\Casnov5\MST\MSTLab\Baileraj\country.data";  * reads an data file;

 input  name $ area popnsize pcturban lang $ liter lifemen

         lifewom pcGNP;

 logarea = log10(area);

 logpopn = log10(popnsize);

 loggnp  = log10(pcGNP);

 drop area popnsize pcgnp;

 

proc reg;

  title LIFEWOM predicted from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP;

  model lifewom = pcturban liter logarea logpopn loggnp;

  test pcturban=logarea=logpopn=0;  ****** for testing subset;

  run;

 

                   LIFEWOM predicted from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP                                                            The REG Procedure

                                           Model: MODEL1

                                   Dependent Variable: lifewom

 

                      Number of Observations Read                         79

                      Number of Observations Used                         67

                      Number of Observations with Missing Values          12

 

COMMENT:  Some variables were missing on one or more of the predictor variables.  SAS deletes records that are not complete on ALL variables.  You will see that the regression model with only LITER and LOGGNP as predictors has a different number of observations.

 

                                       Analysis of Variance

 

                                              Sum of           Mean

          Source                   DF        Squares         Square    F Value    Pr > F

          Model                     5     4473.89310      894.77862      43.96    <.0001

          Error                    61     1241.74869       20.35654

          Corrected Total          66     5715.64179

 

                       Root MSE              4.51182    R-Square     0.7827

                       Dependent Mean       64.77612    Adj R-Sq     0.7649

                       Coeff Var             6.96525

 

                                       Parameter Estimates

                                    Parameter       Standard

               Variable     DF       Estimate          Error    t Value    Pr > |t|

               Intercept     1       27.79999        4.53708       6.13      <.0001

               pcturban      1        0.02241        0.03757       0.60      0.5530

               liter         1        0.19211        0.03180       6.04      <.0001

               logarea       1       -0.41442        0.93342      -0.44      0.6586

               logpopn       1       -0.26259        1.06069      -0.25      0.8053

               loggnp        1        7.73888        1.81985       4.25      <.0001

 

COMMENT:  The partial (single parameter) tests also casts doubt on whether PCTURBAN, LOGAREA and LOGPOPN add to the model; however, these don’t test ALL of these variables simultaneously.

 

                                           Model: MODEL1

                           Test 1 Results for Dependent Variable lifewom

 

                                                     Mean

                     Source             DF         Square    F Value    Pr > F

                     Numerator           3        6.19064       0.30    0.8223

                     Denominator        61       20.35654

 

How stable is the model fit?  Are the predictor variables highly correlated?

 

Collinearity refers to the predictor variables being highly correlated – i.e. do the variables provide redundant information?  This can be measured by different ways:

1.  Does a scatterplot (or pairwise r) of Xi vs. Xj suggest high correlation?

2.  Is the R2 when Xi is predicted from X1,…,Xi-1,Xi+1,…,Xk large?

Or Tolerance = 1-R2 small?

Or VIF = 1/(1-R2) large? (say >10)

3.  Small eigenvalues/large condition numbers (properties of a matrix defined from the collection of predictor variables.

 

proc reg data=country;

  title LITER and LOGGNP as predictors of Life expectancy of women;

  model lifewom = liter loggnp/ tol vif collinoint;                          

run;

 

LITER and LOGGNP as predictors of Life expectancy of women

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Tolerance

Variance
Inflation

Intercept

1

23.51270

2.96162

7.94

<.0001

.

0

liter

1

0.20117

0.02678

7.51

<.0001

0.58823

1.70001

loggnp

1

8.86394

1.22709

7.22

<.0001

0.58823

1.70001

 

Collinearity Diagnostics (intercept adjusted)

Number

Eigenvalue

Condition
Index

Proportion of Variation

liter

loggnp

1

1.64169

1.00000

0.17915

0.17915

2

0.35831

2.14051

0.82085

0.82085

 

How about points that exhibit high influence on the fit of the model?

 

INFLUENCE diagnostics measure the impact of a particular data point on the fit of a model.  These diagnostics can look at how an estimated coefficient (DFBETAs) or the prediction equation (DFFITs) changes with the inclusion or exclusion of a data point.  These measures are usually standardized first.  Potential concern: 

DFBETAs larger than 1 (small n) or 2/sqrt(n) (large n)

DFFITs larger than 1 (small n) or 2sqrt(k/n) (large n)

Other measures (e.g. Cook’s D also can be used)

 

proc reg data=country;

  title LITER and LOGGNP as predictors of Life expectancy of women;

  model lifewom = liter loggnp/ influence;                          

 

(SAS output edited)


The REG Procedure

Dependent Variable: lifewom

Output Statistics

Obs

Residual

RStudent

Hat Diag
H

Cov
Ratio

DFFITS

DFBETAS

Intercept

liter

loggnp

1

3.8364

0.9008

0.0343

1.0435

0.1699

0.0172

0.1318

-0.0659

2

3.9221

0.9118

0.0146

1.0217

0.1108

0.0508

0.0192

-0.0360

3

1.1223

0.2631

0.0410

1.0831

0.0544

0.0296

-0.0324

-0.0031

4

5.6259

1.3304

0.0356

1.0052

0.2556

0.0842

0.2024

-0.1478

75

-0.5827

-0.1381

0.0629

1.1107

-0.0358

0.0284

0.0012

-0.0252

76

-4.2903

-1.0290

0.0714

1.0743

-0.2852

0.2332

0.0245

-0.2127

77

1.2981

0.3039

0.0383

1.0791

0.0607

-0.0427

-0.0067

0.0418

78

-0.7665

-0.1847

0.0924

1.1461

-0.0589

0.0435

0.0345

-0.0546

79

-7.3187

-1.7272

0.0159

0.9387

-0.2196

0.0469

0.0615

-0.0941

 

Sum of Residuals

0

Sum of Squared Residuals

1386.40551

Predicted Residual SS (PRESS)

1494.11641

 

 

How do you select variables that should be included in a model?

 

1.  Your KNOWLEDGE of the area.  Survey the literature. 

 

2.  Avoid including redundant predictors (examine scatterplot of predictors?  Avoid including 2 Xs with really high correlation).

 

3.  Automatic variable selection methods

i.    All possible regression (SAS PROC RSQUARE)

ii.    Automatic selection (backwards elim., forward selection, stepwise - SAS PROC STEPWISE)

 

4.  Model averaging (newer idea that is gathering steam)