IES 612/STA 4-573/STA 4-576
Spring 2005
Week 1 – IES612-lecture-week1.doc
Info Card
|
IES 612 or STA 4-573 |
Spring 2005 |
|
1. Name |
|
|
2. Department/degree |
|
|
3. Major/concentration/advisor |
|
|
4. Previous Stat classes? |
|
|
5. Previous Math classes? |
|
|
6. Previous Computing classes/experience? |
|
|
7. What do you hope to learn from this class? |
|
|
8. Something that will help me get to know you better. |
|
|
|
|
|
SYLLABUS |
|
Regression (5 weeks) |
|
Experimental Design (5 weeks) |
|
Sampling (2+ weeks) |
|
Math modeling (2+ weeks) |
REVIEW (prerequisite material)
* We are moving from DESCRIPTIVE STATISTICS and simple HYPOTHESIS TESTS towards MODELS for describing ASSOCIATION and PREDICTION
* CONCEPTS:
POPULATION = collection of all units of interest
SAMPLE = subset of population selected to represent the population
PARAMETERS = characteristic of the population (m, s2, r, b0)
STATISTICS = characteristic of the sample (xbar, s2, r, b0)
Sampling – selecting elements from a population into a sample
Inference – making statements about a population based on information in a sample
* refer to IES612-lecture-week0.doc for more detailed review suggestions
Hypothesis Tests
H0 – null/no-effect hypothesis
Ha (or H1 or HA) – research or alternative hypothesis
Test statistic (TS)
Rejection Region / P-value
Conclusion
Errors? Type I (False Positive); Type II (False Negative)
a, b
Confidence Intervals
(point estimate) +/- (multiple) (std. Error)
* other ways to forms confidence intervals but this general form applies in many general cases
Association
Categorical data – multiway tables (see OL Ch. 10)
Numeric data – regression data
(x1, y1), (x2, y2) … (xn, yn) or in shorthand, (xi, yi) i = 1, …, n
Example: Manatee deaths due to motorboats in Florida
|
YEAR |
Number Boats (1000s) |
Manatees Killed |
|
77 |
447 |
13 |
|
78 |
460 |
21 |
|
79 |
481 |
24 |
|
80 |
498 |
16 |
|
81 |
513 |
24 |
|
82 |
512 |
20 |
|
83 |
526 |
15 |
|
84 |
559 |
34 |
|
85 |
585 |
33 |
|
86 |
614 |
33 |
|
87 |
645 |
39 |
|
88 |
675 |
43 |
|
89 |
711 |
50 |
|
90 |
719 |
47 |
Graphical display? Scatterplot or scatterdiagram

Example: Progesterone level as a function of gestation day in sheep pregnant with singletons
|
Singleton Gestation
Days |
Singleton Progesterone |
|
53 |
3.8 |
|
60 |
5 |
|
66 |
4.5 |
|
72 |
4.2 |
|
73 |
5.5 |
|
76 |
5.8 |
|
77 |
4.6 |
|
78 |
5.3 |
|
78 |
7.2 |
|
79 |
5.7 |
|
80 |
6 |
|
80 |
6.3 |
|
81 |
4.8 |
|
82 |
5.6 |
|
83 |
4.9 |
|
84 |
4.3 |
|
87 |
4.9 |
|
89 |
4.2 |
|
98 |
3.4 |
|
105 |
4.8 |
|
72 |
5.2 |
|
72 |
5.9 |
|
77 |
5.7 |
|
77 |
2.8 |
|
82 |
6.6 |
|
98 |
6.1 |
|
98 |
9.3 |
|
104 |
7.7 |
|
104 |
5.3 |
|
109 |
7.8 |

Basic Model
Yi = b0 + b1Xi + ei [“simple linear regression”]
Y = response variable (dependent variable)
X = predictor variable (independent variable, covariate)
Formal assumptions:
1. relation linear – on average error = 0 [ E(ei) = 0 ] –> E(Yi) = b0 + b1Xi
2. Constant variance - V(ei) = s2–> V(Yi) =s2
3. ei independent
4. ei ~ Normal
Issue of causality Observational versus experimental studies.
Why not y = mx + b? Form above can be more easily generalized to more than one predictor variable.
b0 = y-intercept, value of “Y” at “X=0”
b1 = slope, how “Y” changes with unit change in “X”
Which parameter is generally of more interest? Why?
b1 = contains information about the relationship between the two variables.
Estimating regression coefficients
Least squares – minimize ![]()
Solution:

![]()
![]()
Interpretation: Units?
Interpretation: graphical (quadrants defined by the means)
Example (Manatee): b0 = -41.43 and b1 = 0.125
Interpretation:
Intercept: When no boats were registered, predict –41.4 manatee death ?!?!? Notice that x=0 is well outside the SCOPE of the model.
Slope: For each additional x=1 (1000) boats, predict an increase of 0.1 manatee deaths. Maybe a better interpretation, for each additional x=10 (10,000) boats, predict an additional manatee death.
How do you deal with the intercept? Reparameterize the model by rescaling the X variable.
[ intercept is the
average response at the mean X level]
[intercept is the
average response at X=447]
Issues
Leverage = points with high/low values of the predictor variable X (“outliers” in the X direction)
Influential = omitting point causes estimates of the regression coefficients to change dramatically
Outlier = point with a large residual (more to come!)
Estimate of s2
Recall from your first stat class,
with “n-1” degrees of
freedom
Pay penalty b/c mean unknown and estimated by ybar
How about in regression?
Mean at any value of “x” is estimated by ![]()
So in regression, we estimate the variance by 
“mean squared residual”
“mean squared error”
“s” = sample std. dev. around the regression line/ std. error of estimate/residual std. dev.
How do we use the estimate of s2?
1. If e ~ N, then expect approx. 95% of residuals to be within +/- 2 s of 0 (more to come)
2. Used in inference for the regression coefficients
Using SAS to fit the simple regression model
/*
example sas program that does simple linear
regression
*/
options ls=75;
data example1;
input year nboats manatees;
cards;
77 447 13
78 460 21
79 481 24
80 498 16
81 513 24
82 512 20
83 526 15
84 559 34
85 585 33
86 614 33
87 645 39
88 675 43
89 711 50
90 719 47
;
ODS RTF
file='D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\linreg-output.rtf’;
proc reg;
title ‘Number of Manatees
killed regressed on the number of boats registered in Florida’;
model manatees = nboats / p r cli clm;
plot manatees*nboats=”o” p.*nboats=”+” /
overlay;
plot r.*nboats r.*p.;
run;
ODS RTF CLOSE;
|
Analysis of Variance |
|||||
|
Source |
DF |
Sum of |
Mean |
F Value |
Pr > F |
|
Model |
1 |
1711.97866 |
1711.97866 |
93.61 |
<.0001 |
|
Error |
12 |
219.44991 |
18.28749 |
|
|
|
Corrected
Total |
13 |
1931.42857 |
|
|
|
|
Root
MSE |
4.27639 |
R-Square |
0.8864 |
|
Dependent
Mean |
29.42857 |
Adj
R-Sq |
0.8769 |
|
Coeff Var |
14.53141 |
|
|
|
Parameter Estimates |
|||||
|
Variable |
DF |
Parameter |
Standard |
t Value |
Pr > |t| |
|
Intercept |
1 |
-41.43044 |
7.41222 |
-5.59 |
0.0001 |
|
nboats |
1 |
0.12486 |
0.01290 |
9.68 |
<.0001 |
|
Output Statistics |
||||||||||
|
Obs |
Dep Var |
Predicted |
Std Error |
95% CL Mean |
95% CL Predict |
Residual |
Std Error |
Student |
||
|
1 |
13.0000 |
14.3827 |
1.9299 |
10.1779 |
18.5876 |
4.1604 |
24.6050 |
-1.3827 |
3.816 |
-0.362 |
|
2 |
21.0000 |
16.0059 |
1.7974 |
12.0896 |
19.9222 |
5.8989 |
26.1130 |
4.9941 |
3.880 |
1.287 |
|
3 |
24.0000 |
18.6280 |
1.5976 |
15.1472 |
22.1089 |
8.6816 |
28.5745 |
5.3720 |
3.967 |
1.354 |
|
4 |
16.0000 |
20.7507 |
1.4528 |
17.5853 |
23.9161 |
10.9102 |
30.5911 |
-4.7507 |
4.022 |
-1.181 |
|
5 |
24.0000 |
22.6236 |
1.3420 |
19.6997 |
25.5475 |
12.8582 |
32.3891 |
1.3764 |
4.060 |
0.339 |
|
6 |
20.0000 |
22.4987 |
1.3488 |
19.5600 |
25.4375 |
12.7288 |
32.2687 |
-2.4987 |
4.058 |
-0.616 |
|
7 |
15.0000 |
24.2468 |
1.2622 |
21.4968 |
26.9968 |
14.5320 |
33.9616 |
-9.2468 |
4.086 |
-2.263 |
|
8 |
34.0000 |
28.3672 |
1.1482 |
25.8656 |
30.8689 |
18.7198 |
38.0147 |
5.6328 |
4.119 |
1.367 |
|
9 |
33.0000 |
31.6137 |
1.1650 |
29.0753 |
34.1520 |
21.9566 |
41.2707 |
1.3863 |
4.115 |
0.337 |
|
10 |
33.0000 |
35.2346 |
1.2909 |
32.4221 |
38.0472 |
25.5019 |
44.9673 |
-2.2346 |
4.077 |
-0.548 |
|
11 |
39.0000 |
39.1054 |
1.5187 |
35.7963 |
42.4144 |
29.2178 |
48.9929 |
-0.1054 |
3.998 |
-0.0264 |
|
12 |
43.0000 |
42.8512 |
1.7974 |
38.9349 |
46.7675 |
32.7442 |
52.9582 |
0.1488 |
3.880 |
0.0383 |
|
13 |
50.0000 |
47.3462 |
2.1762 |
42.6048 |
52.0877 |
36.8917 |
57.8007 |
2.6538 |
3.681 |
0.721 |
|
14 |
47.0000 |
48.3451 |
2.2647 |
43.4109 |
53.2794 |
37.8018 |
58.8884 |
-1.3451 |
3.628 |
-0.371 |
|
Output Statistics |
||||
|
Obs |
-2-1 0 1 2 |
Cook's |
||
|
1 |
|
| | |
0.017 |
||
|
2 |
|
|** | |
0.178 |
||
|
3 |
|
|** | |
0.149 |
||
|
4 |
|
**| | |
0.091 |
||
|
5 |
|
| | |
0.006 |
||
|
6 |
|
*| | |
0.021 |
||
|
7 |
|
****| | |
0.244 |
||
|
8 |
|
|** | |
0.073 |
||
|
9 |
|
| | |
0.005 |
||
|
10 |
|
*| | |
0.015 |
||
|
11 |
|
| | |
0.000 |
||
|
12 |
|
| | |
0.000 |
||
|
13 |
| |*
| |
0.091 |
||
|
14 |
|
| | |
0.027 |
||
|
Sum
of Residuals |
0 |
|
||
|
Sum
of Squared Residuals |
219.44991 |
|
||
|
Predicted Residual SS
(PRESS) |
281.76275 |
|
||
Confidence Interval for b1 à ![]()
Example: Manatee data – 90% CI for the SLOPE
90% CI => a=0.10 => a/2=0.05 => t.05,12 = 1.782
n=14 => n-2 = 12
SE(b1) = 0.0129
b1 = 0.125
0.125 ± (1.782)(0.129)
0.125 ± .023
0.102 < b1 < 0.148
F Test of b1
H0: b1 = 0
Ha: b1 ≠ 0
TS: Fobs = [SS(Reg)/1] / [SS(Resid)/(n-2)]
RR: Reject H0 if Fobs > Fa, 1, n-2
Conclusions
Where

Alternatively, T Test of b1
H0: b1 = 0
Ha: b1 ≠ 0 Ha: b1 <0 Ha: b> >0
TS: ![]()
RR: Reject H0 if
|tobs | > ta/2, n-2 tobs < -ta, n-2 tobs > ta, n-2
Conclusions: Reject/Fail-to-reject H0?
P-value:
P(tn-2> |tobs|) P(tn-2< tobs) P(tn-2> tobs)
* take a look at the Manatee example from SAS output above
* Hypothesis tests / Confidence intervals for the intercept, b0, are similar.
* 
Other Inference in Regression – Average responses or prediction of new observations at a particular value of x
X values in the dataset – x1, …, xn
Denote new value of X: xn+1