IES 612/STA 4-573/STA 4-576

Spring 2005

 

Week 02 – IES612-lecture-week02.doc

 

UPDATED:  19 Jan. 2005

 

Using SAS …

 

Check out www.muohio.edu/quantapps for links to SAS help

 

Confidence Interval for b1 à 

 

Example:  Manatee data – 90% CI for the SLOPE

 

90% CI => a=0.10 => a/2=0.05 => t.05,12 = 1.782

n=14 => n-2 = 12

 

SE(b1) = 0.0129

b1 = 0.125

 

0.125 ± (1.782)(0.129)

0.125 ± 0.023

0.102 < b1 < 0.148

 

Could use SAS to do this calculation

 

/*

  tconfint.sas – ‘tinv’/‘quantile’ (v 8/9) function

*/

options ls=80;

data myci;

  b1 = 0.12486;                   * slope estimate;

  SE = 0.01290;                   * Std. Error of b1;

 

* Tcrit = quantile(‘T’,.95,12);* Pr(T(12) < Tcrit) = 0.95;

  Tcrit = tinv(.95,12);    

 

/* Comment:  Area LEFT of Tcrit  = .95

             Area RIGHT of Tcrit = .05 

*/

 

  ME = Tcrit*SE;

  LCL = b1 - ME;

  UCL = b1 + ME;

 

proc print;

  run;

 

from PROC PRINT results in the SAS LISTING file

 

Obs       b1        SE       Tcrit        ME         LCL        UCL

  1     0.12486    0.0129    1.78229    0.022992    0.10187    0.14785

 

F Test of b1

 

H0: b1 = 0

 

Ha: b1 ≠ 0

 

TS:  Fobs = [SS(Reg)/1] / [SS(Resid)/(n-2)]

 

RR:  Reject H0 if Fobs > Fa, 1, n-2

 

Conclusions

 

Where

 

FROM SAS OUTPUT …

 

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

F Value

Pr > F

Model

1

1711.97866

1711.97866

93.61

<.0001

Error

12

219.44991

18.28749

 

 

Corrected Total

13

1931.42857

 

 

 

 

Fobs = 93.61 with associated P-value <0.0001

 

SS(Reg) = 1711.97866 and SS(Resid) = 219.4491

 

s2 = MSE = 18.28749

 

Thoughts about the ingredients of an ANOVA table.

 

1.   ANOVA = ANalysis Of VAriance

 

2.   “Sum of Squares” represents a partitioning of the TOTAL variation into variability “explained” by a model (the linear regression model here) and the variability NOT explained (residual error)

 

3.  SS(Total) [Corrected Total SS= 1931.43 above] is “partitioned” into the SS(Regression) [Model SS =1711.98 above] and SS(Residual) [Error SS = 219.45].

 

4.   Mean Squares (MS) are defined as SS/(degrees of freedom).

 

5.   A good regression model will have SS(Regression) > SS(Residual) which often translates into a large value of Fobs.

 

 6.  Alternative interpretation:  SS(Residual) = error in predicting response “y” when using the linear regression model.  SS(Total) = error in predicting response “y” when using YBAR.  SS(Regression) = SS(Total) - SS(Residual) measures how much better the YHAT prediction model is when compared to YBAR.  (more to come later)

 

 

Alternatively, T Test of b1

 

H0: b1 = 0

 

Ha: b1 ≠ 0 [some assoc.]           Ha: b1 <0 [negative assoc.]        Ha: b1 >0 [positive association]

 

TS: 

 

RR:  Reject H0 if

|tobs | > ta/2, n-2                                     tobs < -ta, n-2                                     tobs > ta, n-2

 

Conclusions:  Reject/Fail-to-reject H0?

 

P-value:

P(tn-2> |tobs|)                              P(tn-2< tobs)                               P(tn-2> tobs)      

 

* take a look at the Manatee example from SAS output

Parameter Estimates

Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept

1

-41.43044

7.41222

-5.59

0.0001

nboats

1

0.12486

0.01290

9.68

<.0001

 

H0: b1 = 0

Ha: b1 ≠ 0 [some assoc.]

TS:  = 0.12486/0.01290 = 9.68

P-value < 0.0001

Decision/Conclusion:   REJECT H0 and conclude that there is a linear relationship between the number of manatees killed and the number of boats registered in Florida.

 

Comments:       Always write your conclusions in the words of the problem.  Translate the symbol representation back to the real world.

 

A confidence interval demonstrates the magnitude of the linear effect. 

 

Tests and Confidence intervals are related.  For example, if a 100(1-a)%  confidence interval for a parameter, say b1, does NOT contain 0 (e.g. 0.102 < b1 < 0.148), then you would reject H0: b1 = 0 in favor of Ha: b1 ≠ 0 at significance level a.

 

 

/*

  tPvalue.sas

*/

options ls=80;

data myci;

  b1 = 0.12486;                   * slope estimate;

  SE = 0.01290;                   * Std. Error of b1;

  tcalc = 9.68;                 * t statistic value;

  df = 12;

  P_lower = probt(tcalc, df); 

  P_upper = 1-probt(tcalc, df);

  P_two_tail = 2*(1-probt(abs(tcalc),df));

 

* Note:  SAS version 9 uses 'CDF' as a generalization

         of 'probt';

 

proc print;

  run

 

from PROC PRINT results in the SAS LISTING file

 

Obs       b1        SE      tcalc    df    P_lower       P_upper    P_two_tail

 

  1     0.12486    0.0129     9.68    12    1.00000    .000000254    .000000508

 

 

* Hypothesis tests / Confidence intervals for the intercept, b0, are similar.

 

*

 

*  Can you select design points to have more precision when estimating the slope?

 

Remedial Measures and Transformations

 

RECALL:  Basic Model

 

Yi = b0 + b1Xi + ei    [“simple linear regression”]

 

Y = response variable (dependent variable)

 

X = predictor variable (independent variable, covariate)

 

Formal assumptions:

1.  relation linear – on average error = 0 [ E(ei) = 0 ] –> E(Yi) = b0 + b1Xi

2. Constant variance - V(ei) = s2–> V(Yi) =s2

3. ei independent

4. ei ~ Normal

 

 

We will talk more about model adequacy.   Now,  a few remarks about a special  case when the first assumption might be violated

 

There may be times when a nonlinear relationship might be modeled by linear regression.

 

Example:  MPH and Vehicle Density on a Connecticut Highway

 

 

 

What if we plot the Log(MPH) vs. Vehicle Density?

 

 

 

 

Ref: http://lib.stat.cmu.edu/DASL/Datafiles/transformationdat.html and

B.D. Greenshields and F.M. Weida, Statistics with Applications to Highway Traffic Analysis, Eno Foundation, 1978, 129-131. (DENS, MPH below)

 

*    other common examples– exponential growth and decay

 

*    LOG10 transformations are also commonly used when the range of the response or predictor variables span many orders of magnitude (e.g. per capita gnp,  population size, geographic area).

 

Other Inference in Regression – Average responses or prediction of new observations at a particular value of x

 

X values in the dataset – x1, …, xn

 

Denote new value of X:  xn+1

 

Prediction of the mean response (or new response) at this x value: 

 

SE of this prediction:    

 

Confidence Interval for the Mean Response:

 

Observation:  As xn+1 get farther from , the SE of the prediction increases (an “extrapolation” penalty)

 

<See sketch>

 

Prediction Interval for a New Response

 

Both Uncertainty in the location of the MEAN RESPONSE and variability associated with individual value given the mean response must be considered.

 

 

Comment:

*  SAS Proc GLM options “clm” = mean response CI and “cli” = prediction intervals

 

From Manatee SAS output

Obs

Dep Var
manatees

Predicted
Value

Std Error
Mean Predict

95% CL Mean

95% CL Predict

Residual

Std Error
Residual

Student
Residual

1

13.0000

14.3827

1.9299

10.1779

18.5876

4.1604

24.6050

-1.3827

3.816

-0.362

2

21.0000

16.0059

1.7974

12.0896

19.9222

5.8989

26.1130

4.9941

3.880

1.287

3

24.0000

18.6280

1.5976

15.1472

22.1089

8.6816

28.5745

5.3720

3.967

1.354

4

16.0000

20.7507

1.4528

17.5853

23.9161

10.9102

30.5911

-4.7507

4.022

-1.181

5

24.0000

22.6236

1.3420

19.6997

25.5475

12.8582

32.3891

1.3764

4.060

0.339

6

20.0000

22.4987

1.3488

19.5600

25.4375

12.7288

32.2687

-2.4987

4.058

-0.616

7

15.0000

24.2468

1.2622

21.4968

26.9968

14.5320

33.9616

-9.2468

4.086

-2.263

8

34.0000

28.3672

1.1482

25.8656

30.8689

18.7198

38.0147

5.6328

4.119

1.367

 

Suppose xn+1 = 559 (corresponds to the 8th observation)

 

25.87 < E(Yn+1) < 30.87

 

18.72 < Yn+1 < 38.01

 

Correlation and Coefficient of Determination – Measures of strength of Association

 

Slope Estimator:        

Correlation Coefficient:          

 

So rYX = (Estimated slope) TIMES [SD(X) / SD(Y)] = “rescaled” slope estimate

 

Observations:

1.    Pearson product-moment correlation (other types of correlation coefficients defined – e.g. Spearman’s rho)

2.   –1 <= rYX <= 1

3.   rYX = 0 IMPLIES no LINEAR  relationship

4.   correlation coefficient tends to increase as range increases

5.   test of population correlation coefficient =0 given but not discussed since equivalent to the test of slopes

 

 

SKETCH various scatterplots associated with r=0.9,  r=0.3,  r=0,  r=-0.3,  r=-0.9

 

Coefficient of Determination “R-square”:

 

 

“proportionate reduction in prediction error when using YHAT instead of YBAR to predict y”

 

“proportion of total variability accounted for/explained by the linear regression model”

 

Comments:

*    Coefficient of determination = (rYX)2 = (correlation coefficient)2 for simple linear regression – NOT for multiple regression!

 

*    When people report a significant correlation coefficient of 0.40 between two variables X and Y, recognize that this means that 16% (.4x.4) of the variation in one variable is accounted for by its linear association with some other variable.

 

*    SAS Proc CORR can be used to determine the correlation between variables

 

Example:  Manatee deaths and boats registered

 

Root MSE

4.27639

R-Square

0.8864

Dependent Mean

29.42857

Adj R-Sq

0.8769

Coeff Var

14.53141

 

 

 

r2 = 0.8864 so approx. 89% of the variation in the number of manatees killed is explained by a linear relationship with the number of boats registered.