IES 612/STA 4-573/STA 4-576
Spring 2005
Week 04 – IES612-lecture-week04.doc
F Test of any relationship between Y and set of predictor variables
H0: b1 = b2 = …=bk = 0
Ha: at least one of bi ≠ 0
TS: Fobs = [SS(Reg)/k] / [SS(Resid)/(n-k-1)]= MS(Reg)/MS(Resid)
RR: Reject H0 if Fobs > Fa, k, n-k-1
Example: Life Expectancy across different countries –
any association?
![]()
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
2 |
5678.11397 |
2839.05698 |
151.54 |
<.0001 |
|
Error |
74 |
1386.40551 |
18.73521 |
|
|
|
Corrected Total |
76 |
7064.51948 |
|
|
|
H0: b1=b2=0 [LIFE EXPECTANCY is not related to either LITERACY or LOGGNP]
H1: Either b1≠0 or b2≠0 or BOTH (b1≠0 AND b2≠0)
TS: Fobs=151.54
P-value<0.0001
Conclusion: Reject H0 and conclude LIFE EXPECTANCY is related to either LITERACY or LOGGNP or both.
(Partial) Test of bj
H0: bj = 0
Ha: bj ≠ 0 Ha: bj <0 Ha: bj >0
TS: ![]()
RR: Reject H0 if
|tobs | > ta/2, n-k-1 tobs < -ta, n-k-1 tobs > ta, n-k-1
Conclusions: Reject/Fail-to-reject H0?
P-value:
P(tn-k-1> |tobs|) P(tn-k-1< tobs) P(tn-k-1> tobs)
Example: Life Expectancy across different countries –
testing single reg. parameters
![]()
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
23.51270 |
2.96162 |
7.94 |
<.0001 |
|
liter |
1 |
0.20117 |
0.02678 |
7.51 |
<.0001 |
|
loggnp |
1 |
8.86394 |
1.22709 |
7.22 |
<.0001 |
H0: b2=0 [the prediction of LIFE EXPECTANCY is NOT improved by adding LOGGNP to a model already containing LITERACY]
H1: b2≠0 [LOGGNP is needed in addition to LITERACY for predicting LIFE EXPECTANCY]
TS: tobs=7.22
P-value<0.0001
Conclusion: Reject H0 and conclude that LOGGNP is a significant variable for modeling LIFE EXPECTANCY that adds to a model already containing LITERACY.
Testing a subset of the predictors [General Linear Test]
H0: bg+1 = bg+2 = … = bk = 0 [implies only need “g+1” of the “k+1” predictor variables]
Ha: not H0 [more than the REDUCED model is needed]
TS: 
Example: Life Expectancy across different countries –
all 5 variables needed?
“Complete”/”Full” model ->
![]()
“Reduced” model ->
![]()
H0: b3=b4=b5=0 [LogAREA, LogPOPN and PCTURBAN do not add to a model already containing LITERACY and LOGGNP]
H1: at least one of (b3, b4, b5)≠0
TS: Fobs=0.30 [see SAS output below]
P-value=0.8223
Conclusion: Fail to Reject H0 and conclude that LogAREA, LogPOPN and PCTURBAN do not appear to significantly improve a LIFE EXPECTANCY model that already contains LITERACY and LOGGNP as predictor variables.
/* SAS code for testing a subset of parameters in a model */
data
country;
title
‘country data analysis’;
infile "\\Casnov5\MST\MSTLab\Baileraj\country.data"; * reads an data file;
input name $ area popnsize pcturban lang $ liter lifemen
lifewom pcGNP;
logarea = log10(area);
logpopn = log10(popnsize);
loggnp = log10(pcGNP);
drop area popnsize pcgnp;
proc reg;
title LIFEWOM predicted from PCTURBAN LITER LOGAREA LOGPOPN
LOGGNP;
model lifewom = pcturban liter logarea logpopn loggnp;
test pcturban=logarea=logpopn=0; ****** for testing
subset;
run;
LIFEWOM
predicted from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP
The REG Procedure
Model: MODEL1
Dependent Variable: lifewom
Number of Observations Read 79
Number
of Observations Used
67
Number
of Observations with Missing Values
12
COMMENT: Some variables were
missing on one or more of the predictor variables. SAS deletes records that are not complete on
ALL variables. You will see that the
regression model with only LITER and LOGGNP as predictors has a different
number of observations.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Pr > F
Model 5 4473.89310 894.77862 43.96
<.0001
Error 61 1241.74869 20.35654
Corrected Total 66
5715.64179
Root
MSE 4.51182 R-Square
0.7827
Dependent Mean 64.77612
Adj R-Sq
0.7649
Coeff Var 6.96525
Parameter Estimates
Parameter Standard
Variable DF
Estimate Error t Value
Pr > |t|
Intercept 1
27.79999 4.53708 6.13
<.0001
pcturban 1
0.02241 0.03757 0.60
0.5530
liter 1 0.19211 0.03180
6.04 <.0001
logarea 1
-0.41442 0.93342 -0.44
0.6586
logpopn 1
-0.26259 1.06069 -0.25
0.8053
loggnp 1
7.73888 1.81985 4.25
<.0001
COMMENT: The partial (single
parameter) tests also casts doubt on whether PCTURBAN,
LOGAREA and LOGPOPN add to the model; however, these don’t test ALL of these
variables simultaneously.
Model: MODEL1
Test 1 Results for Dependent Variable lifewom
Mean
Source DF Square F Value
Pr > F
Numerator 3 6.19064 0.30 0.8223
Denominator 61
20.35654
How stable is the model fit? Are the predictor variables highly correlated?
Collinearity refers to the predictor variables being highly correlated – i.e. do the variables provide redundant information? This can be measured by different ways:
1. Does a scatterplot (or pairwise r) of Xi vs. Xj suggest high correlation?
2. Is the R2 when Xi is predicted from X1,…,Xi-1,Xi+1,…,Xk large?
Or Tolerance = 1-R2 small?
Or VIF = 1/(1-R2) large? (say >10)
3. Small eigenvalues/large condition numbers (properties of a matrix defined from the collection of predictor variables.
proc reg
data=country;
title LITER and LOGGNP as predictors of Life expectancy of
women;
model lifewom = liter loggnp/ tol vif
collinoint;
run;
|
Parameter Estimates |
|||||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
Tolerance |
Variance |
|
Intercept |
1 |
23.51270 |
2.96162 |
7.94 |
<.0001 |
. |
0 |
|
liter |
1 |
0.20117 |
0.02678 |
7.51 |
<.0001 |
0.58823 |
1.70001 |
|
loggnp |
1 |
8.86394 |
1.22709 |
7.22 |
<.0001 |
0.58823 |
1.70001 |
|
Collinearity Diagnostics (intercept adjusted) |
||||
|
Number |
Eigenvalue |
Condition |
Proportion of Variation |
|
|
liter |
loggnp |
|||
|
1 |
1.64169 |
1.00000 |
0.17915 |
0.17915 |
|
2 |
0.35831 |
2.14051 |
0.82085 |
0.82085 |
How about points that exhibit high influence on the fit of the model?
INFLUENCE diagnostics measure the impact of a particular data point on the fit of a model. These diagnostics can look at how an estimated coefficient (DFBETAs) or the prediction equation (DFFITs) changes with the inclusion or exclusion of a data point. These measures are usually standardized first. Potential concern:
DFBETAs larger than 1 (small n) or 2/sqrt(n) (large n)
DFFITs larger than 1 (small n) or 2sqrt(k/n) (large n)
Other measures (e.g. Cook’s D also can be used)
proc reg
data=country;
title LITER and LOGGNP as predictors of Life expectancy of
women;
model lifewom = liter loggnp/ influence;
(SAS output edited)
Dependent
Variable: lifewom
|
Output Statistics |
||||||||
|
Obs |
Residual |
RStudent |
Hat Diag |
Cov |
DFFITS |
DFBETAS |
||
|
Intercept |
liter |
loggnp |
||||||
|
1 |
3.8364 |
0.9008 |
0.0343 |
1.0435 |
0.1699 |
0.0172 |
0.1318 |
-0.0659 |
|
2 |
3.9221 |
0.9118 |
0.0146 |
1.0217 |
0.1108 |
0.0508 |
0.0192 |
-0.0360 |
|
3 |
1.1223 |
0.2631 |
0.0410 |
1.0831 |
0.0544 |
0.0296 |
-0.0324 |
-0.0031 |
|
4 |
5.6259 |
1.3304 |
0.0356 |
1.0052 |
0.2556 |
0.0842 |
0.2024 |
-0.1478 |
|
… |
… |
… |
… |
… |
… |
… |
… |
… |
|
75 |
-0.5827 |
-0.1381 |
0.0629 |
1.1107 |
-0.0358 |
0.0284 |
0.0012 |
-0.0252 |
|
76 |
-4.2903 |
-1.0290 |
0.0714 |
1.0743 |
-0.2852 |
0.2332 |
0.0245 |
-0.2127 |
|
77 |
1.2981 |
0.3039 |
0.0383 |
1.0791 |
0.0607 |
-0.0427 |
-0.0067 |
0.0418 |
|
78 |
-0.7665 |
-0.1847 |
0.0924 |
1.1461 |
-0.0589 |
0.0435 |
0.0345 |
-0.0546 |
|
79 |
-7.3187 |
-1.7272 |
0.0159 |
0.9387 |
-0.2196 |
0.0469 |
0.0615 |
-0.0941 |
|
Sum of Residuals |
0 |
|
Sum of Squared
Residuals |
1386.40551 |
|
Predicted Residual
SS (PRESS) |
1494.11641 |
How do you select variables that should be included in a model?
1. Your KNOWLEDGE of the area. Survey the literature.
2. Avoid including redundant predictors (examine scatterplot of predictors? Avoid including 2 Xs with really high correlation).
3. Automatic variable selection methods
i. All possible regression (SAS PROC RSQUARE)
ii. Automatic selection (backwards elim., forward selection, stepwise - SAS PROC STEPWISE)
4. Model averaging (newer idea that is gathering steam)