IES 612/STA 4-573/STA 4-576
Spring 2005
Week 03 IES612-lecture-week03.doc
Checking Model Assumptions (OL 13.4) an initial visit
RECALL: Basic Model
Yi = b0 + b1Xi
+ ei [simple linear regression]
ei ~ indep. N(0, s2)
Definition:
[Def 1] (Raw) Residuals = observed response predicted response
or
![]()
[Def 2] (Standardized Residuals) ![]()
[Def 3] (Studentized Residuals) ![]()
|
Assumption |
Diagnostic? How do you check the assumption? |
Remediation? |
|
1. E(ei) = 0 ] > E(Yi) = b0 + b1Xi> line is a reasonable model for describing mean change as a function of x |
D1.1: Plot ei vs. D1.2: Plot ei vs. xi [check to see if pattern exists] D1.3: Plot Yi vs. xi and
superimpose plot of D1.4: Large R2/signif. slope |
Curvature? Polynomial regression model or nonlinear regression model Smooth regression? LOWESS Transformation? Log/square root |
|
2. V(ei) = s2> V(Yi) =s2 > constant variance > scatter about the line is the same regardless of the value of x |
D2.1: Plot ei vs. [check to see if you have a constant band about zero] |
Weighted Least Squares? Transformation |
|
3. ei ~ Normal |
D3.1: Normal probability plot of ei [see if linear] D3.2: Histogram of residuals [bell-shaped?] |
Transformation? Generalized Linear Models (e.g. logistic/probit regression for dichotomous responses; Poisson regression for count responses) |
|
4. ei independent |
D4.1: Generally examining the design can suggest if this is true D4.2: Durbin-Watson test |
Correlated regression models? Time series/spatial methods |
|
5.* no important omitted variables {relates to pt. 1} |
D5.1: Plot ei vs. omitted variables [see if pattern] |
Add omitted variable to a model (multiple regression) |
|
6.* no points exerting undo influence |
D6.1: Look at statistics that quantify influence (e.g. DFBETAS, DFFITS, etc.) D6.2: Look for extreme X values (break in stemplots of X) |
Smooth model-robust fitting procedure (e.g. Least Absolute Value regression) |
|
7.* no extreme outliers impacting inference |
D7.1: Large residual (e.g. standardized/studentized residual >3/2?) D7.2: Break in stemplot of residuals |
Check to see if data sheet correct fix? Dont simply omit. Report analysis both including/excluding point? |
options
ls=75;
data
example1;
input
year nboats manatees;
cards;
77 447 13
78 460 21
79 481 24
80 498 16
81 513 24
82 512 20
83 526 15
84 559 34
85 585 33
86 614 33
87 645 39
88 675 43
89 711 50
90 719 47
;
ODS
RTF;
*file='D:\baileraj\Classes\Fall
2003\sta402\SAS-programs\linreg-output.rtf';
proc
reg;
title
'Number of Manatees killed regressed on the number of boats
registered in Florida';
model
manatees = nboats / p r
cli clm;
plot
manatees*nboats p.*nboats / overlay;
plot
r.*nboats r.*p.; * residuals vs x and
yhat;
plot
r.*nqq.; * normal qqplot;
run;
ODS
RTF CLOSE;
Residuals plot model adequate? Constant variance?

* now in Excel


* now in Excel

Studentized Residuals outliers?
|
Output
Statistics |
||
|
Obs |
-2-1 0 1 2 |
Cook's |
|
1 |
| |
| |
0.017 |
|
2 |
| |**
| |
0.178 |
|
3 |
| |**
| |
0.149 |
|
4 |
| **|
| |
0.091 |
|
5 |
| |
| |
0.006 |
|
6 |
| *|
| |
0.021 |
|
7 |
| ****|
| |
0.244 |
|
8 |
| |**
| |
0.073 |
|
9 |
| |
| |
0.005 |
|
10 |
| *|
| |
0.015 |
|
11 |
| |
| |
0.000 |
|
12 |
| |
| |
0.000 |
|
13 |
| |*
| |
0.091 |
|
14 |
| |
| |
0.027 |

Normal errors? - Normal quantile-quantile plot

Multiple Regression (OL Chapter 12)
* More than one predictor variable
Example: Lung function in miners exposed to coal dust
![]()
Example: Polynomial regression
or ![]()
Example: Indicator variables e.g. different lines in different groups
where Igroup2 = 1 (group 2) and Igroup2 = 0 (group 1)
![]()
So,
GROUP 2 INTERCEPT differs from GROUP 1 intercept by b1
GROUP 2 SLOPE differs from GROUP 1 slope by b3
GENERAL FORM:

Comments:
1. LINEAR model because the regression
coefficients enter the model in a linear
way compare 
So, how
does a multiple regression model (MR)
differ from simple linear regression (SLR)?
i. SLR is the
equation of LINE; MR is the equation of
a (hyper-)PLANE
ii. b0 is the mean response when X=0 in SLR while b0 is the mean response when ALL Xs=0 in MR
iii. 2
regression coefficients in SLR; k+1
regression coefficients in MR
iv. interpretation
of coefficients? Partial coefficients in
MR
v. Model scope
(space covered by the Xs)
Estimating regression coefficients
Least squares minimize ![]()
Estimate of s2

F Test of any relationship between Y and set of predictor variables
H0: b1 = b2 = =bk = 0
Ha: at least one of bi ≠ 0
TS: Fobs = [SS(Reg)/k] / [SS(Resid)/(n-k-1)]= MS(Reg)/MS(Resid)
RR: Reject H0 if Fobs > Fa, k, n-k-1
Conclusions
Where

(Partial) Test of bj
H0: bj = 0
Ha: bj ≠ 0 Ha: bj <0 Ha: bj >0
TS: ![]()
where

==> R2 here is the % of one pred. variable accounted for by all of the other predictors
==> VIF=[1/(1-R2)] is a diagnostic of collinearity (max>10 concern Neter et al.)
RR: Reject H0 if
|tobs | > ta/2, n-k-1 tobs < -ta, n-k-1 tobs > ta, n-k-1
Conclusions: Reject/Fail-to-reject H0?
P-value:
P(tn-k-1> |tobs|) P(tn-k-1< tobs) P(tn-k-1> tobs)
Testing a subset of the predictors
H0: bg+1 = bg+2 = = bk = 0 [implies only need g+1 of the k+1 predictor variables]
Ha: not H0
TS: ![]()
Example: Life Expectancy across different countries
data
country;
title
country data analysis;
infile "\\Casnov5\MST\MSTLab\Baileraj\country.data"; * reads an data file;
input
name $ area popnsize pcturban lang $ liter lifemen
lifewom pcGNP;
logarea = log10(area);
logpopn = log10(popnsize);
loggnp
= log10(pcGNP);
ienglish = (lang="English");
drop area popnsize pcgnp;
proc
print;
run;
/*
to generate a scatterplot matrix
Solutions > Analysis > Interactive Data
Analysis
- open data set WORK > COUNTRY
-
select columns (CTRL and click column labels)
-
Analyze > Scatter Plot (YX)
to generate regression fit via this
interactive data analysis
Analyze > Fit
*/
ods
html;
proc reg data=country;
title predicting life expectancy of women in different countries;
model lifewom = loggnp;
output out=new1 p=yhat
r=resid;
run;
proc plot data=new1 hpercent=50 vpercent=75;
title residual plots for LIFEWOM = LOGGNP model;
plot resid*(yhat liter);
run;
proc reg data=country;
title LITER and LOGGNP as predictors of Life expectancy of women;
model lifewom = liter;
model lifewom = loggnp;
model lifewom = liter
loggnp;
run;
proc reg;
title LIFEWOM predicted
from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP;
model lifewom = pcturban
liter logarea logpopn loggnp;
plot r.*p. nqq.*r.;
run;
proc reg data=country;
title LITER and LOGGNP as
predictors of Life expectancy of women;
model lifewom = liter
loggnp/ tol vif collinoint;
output out=new p=yhat r=resid;
run;
proc
univariate data=new plot;
id name;
var resid;
run;
proc plot hpercent=50 vpercent=50;
plot resid*yhat=ienglish
resid*liter=ienglish resid*loggnp=ienglish;
run;
ods html close;

The REG
Procedure
Model: MODEL1
Dependent
Variable: lifewom
|
Number of Observations Read |
79 |
|
Number of Observations Used |
78 |
|
Number of Observations with Missing
Values |
1 |
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
1 |
4793.33759 |
4793.33759 |
148.93 |
<.0001 |
|
Error |
76 |
2446.11113 |
32.18567 |
|
|
|
Corrected Total |
77 |
7239.44872 |
|
|
|
|
Root MSE |
5.67324 |
R-Square |
0.6621 |
|
Dependent Mean |
64.85897 |
Adj R-Sq |
0.6577 |
|
Coeff Var |
8.74704 |
|
|
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
19.42550 |
3.77797 |
5.14 |
<.0001 |
|
loggnp |
1 |
14.83433 |
1.21557 |
12.20 |
<.0001 |
|
Plot of
resid*yhat. A=1, B=2, etc. resid*liter. A=1, B=2, etc.
15
A A 15
A A
A
A 10 A 10 A B AA A
A AA
BB A B A 5 C B 5 A AA AA R
A A B R A A
AA e
A A AA e
AA A A s B A A A s AA A
B i
A AA C A i AD A d 0
BA AA d 0
A A AA A
u AB A A u A AA
B a
A A A a A A
A l
A AABAA A l
AAA AA A A
A
AB A AA
A A -5
B A A A -5 B
A A A
A A A A AAB A
A B A
A A A
-10 -10
A AA A B -15 -15 40
60 80 100 0 50 100
Predicted Value of lifewom liter
NOTE: 1 obs
had missing values. NOTE: 2 obs
had missing values.
|
The REG
Procedure
Model: MODEL1
Dependent
Variable: lifewom
|
Number of Observations Read |
79 |
|
Number of Observations Used |
77 |
|
Number of Observations with Missing
Values |
2 |
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
1 |
4700.51263 |
4700.51263 |
149.13 |
<.0001 |
|
Error |
75 |
2364.00685 |
31.52009 |
|
|
|
Corrected Total |
76 |
7064.51948 |
|
|
|
|
Root MSE |
5.61428 |
R-Square |
0.6654 |
|
Dependent Mean |
64.68831 |
Adj R-Sq |
0.6609 |
|
Coeff Var |
8.67896 |
|
|
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
41.85909 |
1.97590 |
21.18 |
<.0001 |
|
liter |
1 |
0.32529 |
0.02664 |
12.21 |
<.0001 |
The REG
Procedure
Model: MODEL2
Dependent
Variable: lifewom
|
Number of Observations Read |
79 |
|
Number of Observations Used |
77 |
|
Number of Observations with Missing
Values |
2 |
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
1 |
4620.58250 |
4620.58250 |
141.80 |
<.0001 |
|
Error |
75 |
2443.93698 |
32.58583 |
|
|
|
Corrected Total |
76 |
7064.51948 |
|
|
|
|
Root MSE |
5.70840 |
R-Square |
0.6541 |
|
Dependent Mean |
64.68831 |
Adj R-Sq |
0.6494 |
|
Coeff Var |
8.82447 |
|
|
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
19.57316 |
3.84413 |
5.09 |
<.0001 |
|
loggnp |
1 |
14.77981 |
1.24118 |
11.91 |
<.0001 |
The REG
Procedure
Model: MODEL3
Dependent
Variable: lifewom
|
Number of Observations Read |
79 |
|
Number of Observations Used |
77 |
|
Number of Observations with Missing
Values |
2 |
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
2 |
5678.11397 |
2839.05698 |
151.54 |
<.0001 |
|
Error |
74 |
1386.40551 |
18.73521 |
|
|
|
Corrected Total |
76 |
7064.51948 |
|
|
|
|
Root MSE |
4.32842 |
R-Square |
0.8038 |
|
Dependent Mean |
64.68831 |
Adj R-Sq |
0.7984 |
|
Coeff Var |
6.69119 |
|
|
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
23.51270 |
2.96162 |
7.94 |
<.0001 |
|
liter |
1 |
0.20117 |
0.02678 |
7.51 |
<.0001 |
|
loggnp |
1 |
8.86394 |
1.22709 |
7.22 |
<.0001 |
|
LIFEWOM predicted from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP |
The REG Procedure
Model: MODEL1
Dependent
Variable: lifewom
|
Number of Observations Read |
79 |
|
Number of Observations Used |
67 |
|
Number of Observations with Missing
Values |
12 |
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
5 |
4473.89310 |
894.77862 |
43.96 |
<.0001 |
|
Error |
61 |
1241.74869 |
20.35654 |
|
|
|
Corrected Total |
66 |
5715.64179 |
|
|
|
|
Root MSE |
4.51182 |
R-Square |
0.7827 |
|
Dependent Mean |
64.77612 |
Adj R-Sq |
0.7649 |
|
Coeff Var |
6.96525 |
|
|
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
27.79999 |
4.53708 |
6.13 |
<.0001 |
|
pcturban |
1 |
0.02241 |
0.03757 |
0.60 |
0.5530 |
|
liter |
1 |
0.19211 |
0.03180 |
6.04 |
<.0001 |
|
logarea |
1 |
-0.41442 |
0.93342 |
-0.44 |
0.6586 |
|
logpopn |
1 |
-0.26259 |
1.06069 |
-0.25 |
0.8053 |
|
loggnp |
1 |
7.73888 |
1.81985 |
4.25 |
<.0001 |

The REG
Procedure
Model: MODEL1
Dependent
Variable: lifewom
|
Number of Observations Read |
79 |
|
Number of Observations Used |
77 |
|
Number of Observations with Missing
Values |
2 |
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
2 |
5678.11397 |
2839.05698 |
151.54 |
<.0001 |
|
Error |
74 |
1386.40551 |
18.73521 |
|
|
|
Corrected Total |
76 |
7064.51948 |
|
|
|
|
Root MSE |
4.32842 |
R-Square |
0.8038 |
|
Dependent Mean |
64.68831 |
Adj R-Sq |
0.7984 |
|
Coeff Var |
6.69119 |
|
|
|
Parameter Estimates |
|||||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
Tolerance |
Variance |
|
Intercept |
1 |
23.51270 |
2.96162 |
7.94 |
<.0001 |
. |
0 |
|
liter |
1 |
0.20117 |
0.02678 |
7.51 |
<.0001 |
0.58823 |
1.70001 |
|
loggnp |
1 |
8.86394 |
1.22709 |
7.22 |
<.0001 |
0.58823 |
1.70001 |
|
Collinearity Diagnostics (intercept
adjusted) |
||||
|
Number |
Eigenvalue |
Condition |
Proportion of Variation |
|
|
liter |
loggnp |
|||
|
1 |
1.64169 |
1.00000 |
0.17915 |
0.17915 |
|
2 |
0.35831 |
2.14051 |
0.82085 |
0.82085 |
The UNIVARIATE
Procedure
Variable:
resid (Residual)
|
Moments |
|||
|
N |
77 |
Sum Weights |
77 |
|
Mean |
0 |
Sum Observations |
0 |
|
Std Deviation |
4.27108625 |
Variance |
18.2421778 |
|
Skewness |
-0.5505371 |
Kurtosis |
0.31026345 |
|
Uncorrected SS |
1386.40551 |
Corrected SS |
1386.40551 |
|
Coeff Variation |
. |
Std Error Mean |
0.48673545 |
|
Basic Statistical Measures |
|||
|
Location |
Variability |
||
|
Mean |
0.000000 |
Std Deviation |
4.27109 |
|
Median |
0.920584 |
Variance |
18.24218 |
|
Mode |
. |
Range |
21.17584 |
|
|
|
Interquartile Range |
5.71274 |
|
Quantiles (Definition 5) |
|
|
Quantile |
Estimate |
|
100% Max |
10.316175 |
|
99% |
10.316175 |
|
95% |
5.708294 |
|
90% |
5.088762 |
|
75% Q3 |
3.100406 |
|
50% Median |
0.920584 |
|
25% Q1 |
-2.612333 |
|
10% |
-6.243490 |
|
5% |
-7.757859 |
|
1% |
-10.859669 |
|
0% Min |
-10.859669 |
|
Extreme Observations |
|||||
|
Lowest |
Highest |
||||
|
Value |
name |
Obs |
Value |
name |
Obs |
|
-10.85967 |
swazilan |
47 |
5.62592 |
jamaica |
4 |
|
-10.81760 |
cameroon |
33 |
5.70829 |
kenya |
37 |
|
-10.62208 |
ethiopia |
35 |
5.97978 |
elsalvad |
9 |
|
-7.75786 |
turkey |
31 |
7.25135 |
guyana |
21 |
|
-7.68269 |
vanuatu |
66 |
10.31617 |
china |
56 |
|
Missing Values |
|||
|
Missing |
Count |
Percent Of |
|
|
All Obs |
Missing Obs |
||
|
. |
2 |
2.53 |
100.00 |
|
Stem Leaf # Boxplot 10 3 1 | 9
| 8
| 7 3 1 | 6 0 1 | 5 11367 5 | 4 134 3 | 3 1123378899 10 +-----+ 2 0133457 7 | |
1 112335578 9 |
| 0 33699 5 *--+--* -0 866311 6 | |
-1 65332 5 | |
-2 97763210
8 +-----+ -3 97440 5 | -4 3 1 | -5 53 2 | -6 2 1 | -7 8733 4 | -8 | -9
| -10 986 3 | ----+----+----+----+
|
The UNIVARIATE
Procedure
Variable:
resid (Residual)
|
Normal
Probability Plot 10.5+
*+ |
++ | +++ | ++
* |
++ | +****** | +** 3.5+ ***** | ***+ | **** | **++ | *** | *** | **** -3.5+ *** | *+ | ++* | ++ *
| +****
|
+++
| ++
-10.5++* * *
+----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
|
|
Plot of resid*yhat=ienglish. Plot of resid*liter=ienglish.
10 0 10 0 1 1 0
0001 0
00 0 0 1 R
0 00 0001 0 1
R 0
0 0 0 00 100
e
000 1 0 0 100 0
0 e
0 010 10 0 01 0 s 0
00 1 0 0 001 0 s
0 0 0
0 000 0 i
00 0 0 0 0 00 1 0 i
1 0 00
0 000 0 d
0 0 0 0 0
0 d
0 0 00
0 u 0 1 u
0 0 1
a 00 00 a 000 l -10 0
0 l -10 00 -20 -20 40 60 80 0 50 100
Predicted Value of lifewom liter
NOTE: 2
missing. 17 hidden. NOTE: 2 missing. 21 hidden.
Plot of resid*loggnp=ienglish.
10
0
1
0 01 0 0
R
0 0 00 1
e
00 01 1 0 0 0
s 0
0 10 0 0 010 0 i
00 0 0 0000 1 0
d
000 0 0
0
u
0 0 1
a
0 000
l -10 0 0
-20 2 3 4 5
loggnp
NOTE: 2
missing. 23 hidden.
|