IES 612/STA 4-573/STA 4-576

Spring 2005

 

Week 1 – IES612-lecture-week1.doc

 

Info Card

IES 612 or STA 4-573

Spring 2005

1. Name

 

2. Department/degree

 

3. Major/concentration/advisor

 

4. Previous Stat classes?

 

5.  Previous Math classes?

 

6.  Previous Computing classes/experience?

 

7.  What do you hope to learn from this class?

 

8.  Something that will help me get to know you better.

 

 

 

 

SYLLABUS

Regression (5 weeks)

Experimental Design (5 weeks)

Sampling (2+ weeks)

Math modeling (2+ weeks)

 

REVIEW (prerequisite material)

 

* We are moving from DESCRIPTIVE STATISTICS and simple HYPOTHESIS TESTS towards MODELS for describing ASSOCIATION and PREDICTION

 

* CONCEPTS:

POPULATION = collection of all units of interest

 

SAMPLE = subset of population selected to represent the population

 

PARAMETERS = characteristic of the population (m, s2, r, b0)

 

STATISTICS = characteristic of the sample  (xbar, s2, r, b0)

 

Sampling – selecting elements from a population into a sample

 

Inference – making statements about a population based on information in a sample

 

* refer to IES612-lecture-week0.doc for more detailed review suggestions

 

Hypothesis Tests

H0 – null/no-effect hypothesis

Ha (or H1 or HA) – research or alternative hypothesis

Test statistic (TS)

Rejection Region / P-value

Conclusion

 

Errors?  Type I (False Positive);  Type II (False Negative)

 

a, b

 

Confidence Intervals

 

(point estimate) +/- (multiple) (std. Error)

 

* other ways to forms confidence intervals but this general form applies in many general cases

 

Association

 

Categorical data – multiway tables (see OL Ch. 10)

 

Numeric data – regression data

 

(x1, y1), (x2, y2) … (xn, yn) or in shorthand, (xi, yi) i = 1, …, n

 

 

Example:  Manatee deaths due to motorboats in Florida

 

YEAR

Number Boats (1000s)

Manatees Killed

77

447

13

78

460

21

79

481

24

80

498

16

81

513

24

82

512

20

83

526

15

84

559

34

85

585

33

86

614

33

87

645

39

88

675

43

89

711

50

90

719

47

 

Graphical display?  Scatterplot or scatterdiagram

 

 

Example:  Progesterone level as a function of gestation day in sheep pregnant with singletons

 

Singleton

Gestation Days

Singleton

Progesterone

53

3.8

60

5

66

4.5

72

4.2

73

5.5

76

5.8

77

4.6

78

5.3

78

7.2

79

5.7

80

6

80

6.3

81

4.8

82

5.6

83

4.9

84

4.3

87

4.9

89

4.2

98

3.4

105

4.8

72

5.2

72

5.9

77

5.7

77

2.8

82

6.6

98

6.1

98

9.3

104

7.7

104

5.3

109

7.8

 

 

Basic Model

 

Yi = b0 + b1Xi + ei    [“simple linear regression”]

 

Y = response variable (dependent variable)

 

X = predictor variable (independent variable, covariate)

 

Formal assumptions:

1.  relation linear – on average error = 0 [ E(ei) = 0 ] –> E(Yi) = b0 + b1Xi

2. Constant variance - V(ei) = s2–> V(Yi) =s2

3. ei independent

4. ei ~ Normal

 

Issue of causality  Observational versus experimental studies.

 

Why not y = mx + b?  Form above can be more easily generalized to more than one predictor variable.

 

b0  = y-intercept, value of “Y” at “X=0”

b1 = slope, how “Y” changes with unit change in “X”

 

Which parameter is generally of more interest?  Why?

b1 = contains information about the relationship between the two variables.

 

Estimating regression coefficients

 

Least squares – minimize

 

Solution:

 

Interpretation:  Units?

 

Interpretation:  graphical (quadrants defined by the means)

 

Example (Manatee): b0 = -41.43 and b1 = 0.125

 

Interpretation:

Intercept:  When no boats were registered, predict –41.4 manatee death ?!?!?  Notice that x=0 is well outside the SCOPE of the model.

Slope:  For each additional x=1 (1000) boats, predict an increase of 0.1 manatee deaths.  Maybe a better interpretation, for each additional x=10 (10,000) boats, predict an additional manatee death.

 

How do you deal with the intercept?  Reparameterize the model by rescaling the X variable. 

 

 [ intercept is the average response at the mean X level]

 

  [intercept is the average response at X=447]

 

 

Issues

 

Leverage = points with high/low values of the predictor variable X (“outliers” in the X direction)

 

Influential = omitting point causes estimates of the regression coefficients to change dramatically

 

Outlier = point with a large residual (more to come!)

 

Estimate of s2

 

Recall from your first stat class,   with “n-1” degrees of freedom

Pay penalty b/c mean unknown and estimated by ybar

 

How about in regression?

Mean at any value of “x” is estimated by

 

So in regression, we estimate the variance by

“mean squared residual”

“mean squared error”

 

“s” = sample std. dev. around the regression line/ std. error of estimate/residual std. dev.

 

How do we use the estimate of s2?

1.  If e ~ N, then expect approx. 95% of residuals to be within +/- 2 s of 0 (more to come)

2.  Used in inference for the regression coefficients

 

Using SAS to fit the simple regression model

 

/*

  example sas program that does simple linear regression

*/

 

options ls=75;

 

data example1;

  input year nboats manatees;

  cards;

77   447  13

78   460  21

79   481  24

80   498  16

81   513  24

82   512  20

83   526  15

84   559  34

85   585  33

86   614  33

87   645  39

88   675  43

89   711  50

90   719  47

;

 

ODS RTF file='D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\linreg-output.rtf’;

 

proc reg;

title ‘Number of Manatees killed regressed on the number of boats registered in Florida’;

  model manatees = nboats / p r cli clm;

  plot manatees*nboats=”o” p.*nboats=”+” / overlay;

  plot r.*nboats r.*p.;

run;

 

ODS RTF CLOSE;

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

1

1711.97866

1711.97866

93.61

<.0001

Error

12

219.44991

18.28749

 

 

Corrected Total

13

1931.42857

 

 

 

 

Root MSE

4.27639

R-Square

0.8864

Dependent Mean

29.42857

Adj R-Sq

0.8769

Coeff Var

14.53141

 

 

 

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

-41.43044

7.41222

-5.59

0.0001

nboats

1

0.12486

0.01290

9.68

<.0001


Output Statistics

Obs

Dep Var
manatees

Predicted
Value

Std Error
Mean Predict

95% CL Mean

95% CL Predict

Residual

Std Error
Residual

Student
Residual

1

13.0000

14.3827

1.9299

10.1779

18.5876

4.1604

24.6050

-1.3827

3.816

-0.362

2

21.0000

16.0059

1.7974

12.0896

19.9222

5.8989

26.1130

4.9941

3.880

1.287

3

24.0000

18.6280

1.5976

15.1472

22.1089

8.6816

28.5745

5.3720

3.967

1.354

4

16.0000

20.7507

1.4528

17.5853

23.9161

10.9102

30.5911

-4.7507

4.022

-1.181

5

24.0000

22.6236

1.3420

19.6997

25.5475

12.8582

32.3891

1.3764

4.060

0.339

6

20.0000

22.4987

1.3488

19.5600

25.4375

12.7288

32.2687

-2.4987

4.058

-0.616

7

15.0000

24.2468

1.2622

21.4968

26.9968

14.5320

33.9616

-9.2468

4.086

-2.263

8

34.0000

28.3672

1.1482

25.8656

30.8689

18.7198

38.0147

5.6328

4.119

1.367

9

33.0000

31.6137

1.1650

29.0753

34.1520

21.9566

41.2707

1.3863

4.115

0.337

10

33.0000

35.2346

1.2909

32.4221

38.0472

25.5019

44.9673

-2.2346

4.077

-0.548

11

39.0000

39.1054

1.5187

35.7963

42.4144

29.2178

48.9929

-0.1054

3.998

-0.0264

12

43.0000

42.8512

1.7974

38.9349

46.7675

32.7442

52.9582

0.1488

3.880

0.0383

13

50.0000

47.3462

2.1762

42.6048

52.0877

36.8917

57.8007

2.6538

3.681

0.721

14

47.0000

48.3451

2.2647

43.4109

53.2794

37.8018

58.8884

-1.3451

3.628

-0.371

 

Output Statistics

Obs

  -2-1 0 1 2

Cook's
D

1

|      |      |

0.017

2

|      |**    |

0.178

3

|      |**    |

0.149

4

|    **|      |

0.091

5

|      |      |

0.006

6

|     *|      |

0.021

7

|  ****|      |

0.244

8

|      |**    |

0.073

9

|      |      |

0.005

10

|     *|      |

0.015

11

|      |      |

0.000

12

|      |      |

0.000

13

|      |*     |

0.091

14

|      |      |

0.027

Sum of Residuals

0

 

Sum of Squared Residuals

219.44991

 

Predicted Residual SS (PRESS)

281.76275

 

 

Confidence Interval for b1 à 

 

Example:  Manatee data – 90% CI for the SLOPE

 

90% CI => a=0.10 => a/2=0.05 => t.05,12 = 1.782

n=14 => n-2 = 12

 

SE(b1) = 0.0129

b1 = 0.125

 

0.125 ± (1.782)(0.129)

0.125 ± .023

0.102 < b1  < 0.148

 

F Test of b1

 

H0: b1 = 0

 

Ha: b1 ≠ 0

 

TS:  Fobs = [SS(Reg)/1] / [SS(Resid)/(n-2)]

 

RR:  Reject H0 if Fobs > Fa, 1, n-2

 

Conclusions

 

Where

 

Alternatively, T Test of b1

 

H0: b1 = 0

 

Ha: b1 ≠ 0         Ha: b1 <0          Ha: b> >0

 

TS: 

 

RR:  Reject H0 if

|tobs | > ta/2, n-2     tobs < -ta, n-2       tobs > ta, n-2

 

Conclusions:  Reject/Fail-to-reject H0?

 

P-value:

P(tn-2> |tobs|)      P(tn-2< tobs)       P(tn-2> tobs)      

 

 

* take a look at the Manatee example from SAS output above

 

* Hypothesis tests / Confidence intervals for the intercept, b0, are similar.

 

*

 

Other Inference in Regression – Average responses or prediction of new observations at a particular value of x

 

X values in the dataset – x1, …, xn

 

Denote new value of X:  xn+1