IES 612/STA 4-573/STA 4-576
Spring 2005
Week 02 – IES612-lecture-week02.doc
UPDATED: 19 Jan. 2005
Using SAS …
Check
out www.muohio.edu/quantapps for
links to SAS help
Confidence Interval for b1 à ![]()
Example: Manatee data – 90% CI for the SLOPE
90% CI => a=0.10 => a/2=0.05 => t.05,12 = 1.782
n=14 => n-2 = 12
SE(b1) = 0.0129
b1 = 0.125
0.125 ± (1.782)(0.129)
0.125 ± 0.023
0.102 < b1 < 0.148
Could use SAS to do this calculation
/*
tconfint.sas – ‘tinv’/‘quantile’
(v 8/9) function
*/
options ls=80;
data myci;
b1 = 0.12486; * slope
estimate;
SE = 0.01290; * Std.
Error of b1;
* Tcrit = quantile(‘T’,.95,12);* Pr(T(12) < Tcrit) = 0.95;
Tcrit = tinv(.95,12);
/* Comment: Area LEFT of
Tcrit = .95
Area RIGHT of
Tcrit = .05
*/
ME = Tcrit*SE;
LCL = b1 - ME;
UCL = b1 + ME;
proc print;
run;
from PROC PRINT results in the SAS LISTING file
Obs b1
1 0.12486 0.0129 1.78229 0.022992 0.10187 0.14785
F Test of b1
H0: b1 = 0
Ha: b1 ≠ 0
TS: Fobs = [SS(Reg)/1] / [SS(Resid)/(n-2)]
RR: Reject H0 if Fobs > Fa, 1, n-2
Conclusions
Where

FROM SAS OUTPUT …
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
1 |
1711.97866 |
1711.97866 |
93.61 |
<.0001 |
|
Error |
12 |
219.44991 |
18.28749 |
|
|
|
Corrected
Total |
13 |
1931.42857 |
|
|
|
Fobs = 93.61 with
associated P-value <0.0001
SS(Reg) =
1711.97866 and SS(Resid) = 219.4491
s2 = MSE = 18.28749
Thoughts about the ingredients of an ANOVA table.
1. ANOVA = ANalysis Of VAriance
2. “Sum of Squares” represents a partitioning of the TOTAL variation into variability “explained” by a model (the linear regression model here) and the variability NOT explained (residual error)
3. SS(Total) [Corrected Total SS= 1931.43 above] is “partitioned” into the SS(Regression) [Model SS =1711.98 above] and SS(Residual) [Error SS = 219.45].
4. Mean Squares (MS) are defined as SS/(degrees of freedom).
5. A good regression model will have SS(Regression) > SS(Residual) which often translates into a large value of Fobs.
6. Alternative interpretation: SS(Residual) = error in predicting response “y” when using the linear regression model. SS(Total) = error in predicting response “y” when using YBAR. SS(Regression) = SS(Total) - SS(Residual) measures how much better the YHAT prediction model is when compared to YBAR. (more to come later)
Alternatively, T Test of b1
H0: b1 = 0
Ha: b1 ≠ 0 [some assoc.] Ha: b1 <0 [negative assoc.] Ha: b1 >0 [positive association]
TS: ![]()
RR: Reject H0 if
|tobs | > ta/2, n-2 tobs < -ta, n-2 tobs > ta, n-2
Conclusions: Reject/Fail-to-reject H0?
P-value:
P(tn-2> |tobs|) P(tn-2< tobs) P(tn-2> tobs)
* take a look at the Manatee example from SAS output
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
-41.43044 |
7.41222 |
-5.59 |
0.0001 |
|
nboats |
1 |
0.12486 |
0.01290 |
9.68 |
<.0001 |
H0: b1 = 0
Ha:
b1 ≠ 0 [some
assoc.]
TS:
= 0.12486/0.01290 = 9.68
P-value <
0.0001
Decision/Conclusion: REJECT H0 and conclude that there
is a linear relationship between the number of manatees killed and the number
of boats registered in
Comments: Always
write your conclusions in the words of the problem. Translate the symbol representation back to
the real world.
A confidence interval
demonstrates the magnitude of the linear effect.
Tests and Confidence intervals
are related. For example, if a 100(1-a)% confidence interval for a parameter, say b1, does NOT contain 0 (e.g. 0.102
< b1 < 0.148), then you would
reject H0: b1 =
0 in favor of Ha: b1 ≠ 0 at significance level a.
/*
tPvalue.sas
*/
options ls=80;
data myci;
b1 = 0.12486; * slope
estimate;
SE = 0.01290; * Std. Error
of b1;
tcalc = 9.68; * t statistic value;
df = 12;
P_lower = probt(tcalc,
df);
P_upper = 1-probt(tcalc,
df);
P_two_tail =
2*(1-probt(abs(tcalc),df));
* Note: SAS version 9 uses
'CDF' as a generalization
of 'probt';
proc print;
run
from PROC PRINT results in the SAS LISTING file
Obs b1
SE tcalc df
P_lower P_upper P_two_tail
1
0.12486 0.0129 9.68
12 1.00000 .000000254 .000000508
* Hypothesis tests / Confidence intervals for the intercept, b0, are similar.
* 
* Can you select design points to have more precision when estimating the slope?
Remedial Measures and Transformations
RECALL: Basic Model
Yi = b0 + b1Xi + ei [“simple linear regression”]
Y = response variable (dependent variable)
X = predictor variable (independent variable, covariate)
Formal assumptions:
1. relation linear – on average error = 0 [ E(ei) = 0 ] –> E(Yi) = b0 + b1Xi
2. Constant variance - V(ei) = s2–> V(Yi) =s2
3. ei independent
4. ei ~ Normal
We will talk more about model adequacy. Now, a few remarks about a special case when the first assumption might be violated
There may be times when a nonlinear relationship might be modeled by linear regression.
Example: MPH and Vehicle Density on a

What if we plot the Log(MPH) vs. Vehicle Density?

Ref: http://lib.stat.cmu.edu/DASL/Datafiles/transformationdat.html and
B.D. Greenshields and F.M. Weida, Statistics with Applications to Highway Traffic Analysis, Eno Foundation, 1978, 129-131. (DENS, MPH below)
* other common examples– exponential growth and decay
* LOG10 transformations are also commonly used when the range of the response or predictor variables span many orders of magnitude (e.g. per capita gnp, population size, geographic area).
Other Inference in Regression – Average responses or prediction of new observations at a particular value of x
X values in the dataset – x1, …, xn
Denote new value of X: xn+1
Prediction of the mean response (or
new response) at this x value: ![]()
SE of this prediction: 
Confidence Interval for the Mean Response:

Observation: As xn+1 get farther from
, the SE of the prediction increases (an “extrapolation”
penalty)
<See sketch>
Prediction Interval for a New Response
Both Uncertainty in the location of the MEAN RESPONSE and variability associated with individual value given the mean response must be considered.

Comment:
* SAS Proc GLM options “clm” = mean response CI and “cli” = prediction intervals
From Manatee SAS output
|
Obs |
Dep Var |
Predicted |
Std Error |
95% CL Mean |
95% CL Predict |
Residual |
Std Error |
Student |
||
|
1 |
13.0000 |
14.3827 |
1.9299 |
10.1779 |
18.5876 |
4.1604 |
24.6050 |
-1.3827 |
3.816 |
-0.362 |
|
2 |
21.0000 |
16.0059 |
1.7974 |
12.0896 |
19.9222 |
5.8989 |
26.1130 |
4.9941 |
3.880 |
1.287 |
|
3 |
24.0000 |
18.6280 |
1.5976 |
15.1472 |
22.1089 |
8.6816 |
28.5745 |
5.3720 |
3.967 |
1.354 |
|
4 |
16.0000 |
20.7507 |
1.4528 |
17.5853 |
23.9161 |
10.9102 |
30.5911 |
-4.7507 |
4.022 |
-1.181 |
|
5 |
24.0000 |
22.6236 |
1.3420 |
19.6997 |
25.5475 |
12.8582 |
32.3891 |
1.3764 |
4.060 |
0.339 |
|
6 |
20.0000 |
22.4987 |
1.3488 |
19.5600 |
25.4375 |
12.7288 |
32.2687 |
-2.4987 |
4.058 |
-0.616 |
|
7 |
15.0000 |
24.2468 |
1.2622 |
21.4968 |
26.9968 |
14.5320 |
33.9616 |
-9.2468 |
4.086 |
-2.263 |
|
8 |
34.0000 |
28.3672 |
1.1482 |
25.8656 |
30.8689 |
18.7198 |
38.0147 |
5.6328 |
4.119 |
1.367 |
Suppose xn+1
= 559 (corresponds to the 8th observation)
25.87 <
E(Yn+1) < 30.87
18.72 < Yn+1
< 38.01
Correlation
and Coefficient of Determination – Measures of strength of Association
Slope Estimator: 
Correlation
Coefficient: 
So rYX
= (Estimated slope) TIMES [SD(X) / SD(Y)] = “rescaled” slope estimate
Observations:
1. Pearson
product-moment correlation (other types of correlation coefficients defined –
e.g. Spearman’s rho)
2. –1 <= rYX
<= 1
3. rYX =
0 IMPLIES no LINEAR relationship
4. correlation
coefficient tends to increase as range increases
5. test of
population correlation coefficient =0 given but not discussed since equivalent
to the test of slopes
SKETCH
various scatterplots associated with r=0.9,
r=0.3, r=0, r=-0.3,
r=-0.9
Coefficient of Determination “R-square”: 
“proportionate
reduction in prediction error when using YHAT instead of YBAR to predict y”
“proportion
of total variability accounted for/explained by the linear regression model”
Comments:
* Coefficient of
determination = (rYX)2 = (correlation coefficient)2
for simple linear regression – NOT for multiple regression!
* When people report
a significant correlation coefficient of 0.40 between two variables X and Y,
recognize that this means that 16% (.4x.4) of the variation in one variable is
accounted for by its linear association with some other variable.
* SAS Proc CORR can
be used to determine the correlation between variables
Example: Manatee deaths and boats registered
|
Root
MSE |
4.27639 |
R-Square |
0.8864 |
|
Dependent
Mean |
29.42857 |
Adj
R-Sq |
0.8769 |
|
Coeff Var |
14.53141 |
|
|
r2
= 0.8864 so approx. 89% of the variation in the number of manatees killed is
explained by a linear relationship with the number of boats registered.