IES 612/STA 4-573/STA 4-576

Spring 2005

 

Week 05 – IES612-lecture-week05.doc

 

Model-Building Comments

 

* Brain-first, computer-second – pick variables for inclusion in the model that are important based on your knowledge of the system.  This doesn’t exclude the possibility of exploratory analyses; it simply asks for careful consideration of variables to be considered for inclusion in a model. 

 

Hierarchical Principle:  If you have a higher order term, include all lower order constituent terms.  For example,  “X2  should be in a model if “X” (the linear term) is already present.  The interaction term, “X1*X2” should be in the model if “X1” and “X2” are in the model. 

 

Aside:  Interactions can be used to allow for the relationship between the response variable “Y” and an “X” variable, say “X1”, to vary with levels of another predictor variable “X2”.  We will discuss this in more detail when we consider factorial designs in anova models.  SYNERGY = observed response is beyond what would be predicted from the additive effect of the two variables [e.g. lung cancer as a function of smoking and asbestos exposure].  ANTAGONISM = observed response is less than would be predicted from the additive effects of the two variables [e.g. competition for the same binding sites].

 

Principle of Parsimony:  The model should be no more complicated than required to describe the response.

 

Variable selection methods

 

i.          ALL POSSIBLE REGRESSIONS (SAS Proc RSQUARE)

 

Suppose you have 5 possible predictor variables – X1, X2, X3, X4, X5 - then you have 25-1 = 31 possible models

 

#variables present

X1

X2

X3

X4

X5

1

X1

 

 

 

 

 

X2

 

 

 

 

 

X3

 

 

 

 

 

X4

 

 

 

 

 

X5

2

X1

X2

 

 

 

X1

 

X3

 

 

X1

 

 

X4

 

X1

 

 

 

X5

 

X2

X3

 

 

 

X2

 

X4

 

 

X2

 

 

X5

 

 

X3

X4

 

 

 

X3

 

X5

 

 

 

X4

X5

3

X1

X2

X3

 

 

X1

X2

 

X4

 

X1

X2

 

 

X5

X1

 

X3

X4

 

X1

 

X3

 

X5

X1

 

 

X4

X5

 

X2

X3

X4

 

 

X2

X3

 

X5

 

X2

 

X4

X5

 

 

X3

X4

X5

4

X1

X2

X3

X4

 

X1

X2

X3

 

X5

X1

X2

 

X4

X5

X1

 

X3

X4

X5

 

X2

X3

X4

X5

5

X1

X2

X3

X4

X5

 

Which model do you select?

 

1.  Models with smallest incremental increase in R2 or smallest incremental decrease in SSE – look for the “elbow” in these indices.

 

2.  Adjusted R2 = 1 – [(n-1)/(n-p)]*SSE/SST and MSE=SSE/(n-p) adjust for the number of parameters in the model and are frequently used.

 

3.  Cp is an index of model biasedness – best if it is close in value to the number of parameters in the model.

 

4.  PRESS statistic = PREdiction Sum of Square criterion.  Similar to SSE however predict each point from the remaining n-1 observations (i.e. delete ith case and then predict). 

 

5.  After you select a model, you still need to do all of the adequacy checks as we discussed previously.

 

Example:  Country data set – see country_mreg_varsel_27jan2003.doc

 

proc rsquare data=country outest=est mse cp sse;

title2 all possible regression;

  model lifewom = logarea logpopn pcturban liter loggnp;

  run;

 

OR

data country;

title ‘country data analysis’;

 infile "\\Casnov5\MST\MSTLab\Baileraj\country.data";  * reads an data file;

 input  name $ area popnsize pcturban lang $ liter lifemen

         lifewom pcGNP;

 logarea = log10(area);

 logpopn = log10(popnsize);

 loggnp  = log10(pcGNP);

 ienglish = (lang="English");

 drop area popnsize pcgnp;

 

 proc reg data=country;

 model lifewom= logarea logpopn pcturban liter loggnp

            / selection=rsquare;

  plot cp.*np. rsq.*np. mse.*np.

   run;

 

[output – Edited]

 

The REG Procedure

Model: MODEL1

Dependent Variable: lifewom

 

R-Square Selection Method

 

Number of Observations Read                         79

Number of Observations Used                         67

Number of Observations with Missing Values          12

 

Number in

  Model      R-Square    Variables in Model

       1       0.6583    liter

       1       0.6299    loggnp

       1       0.4643    pcturban

       1       0.0161    logpopn

       1       0.0146    logarea

---------------------------------------------------------------

       2       0.7795    liter loggnp

       2       0.7100    pcturban liter

       2       0.6611    logpopn liter

       2       0.6610    logarea liter

       2       0.6449    pcturban loggnp

       2       0.6335    logarea loggnp

       2       0.6330    logpopn loggnp

       2       0.4930    logarea pcturban

       2       0.4843    logpopn pcturban

       2       0.0183    logarea logpopn

---------------------------------------------------------------

       3       0.7813    logarea liter loggnp

       3       0.7810    logpopn liter loggnp

       3       0.7801    pcturban liter loggnp

       3       0.7179    logarea pcturban liter

       3       0.7158    logpopn pcturban liter

       3       0.6616    logarea logpopn liter

       3       0.6523    logarea pcturban loggnp

       3       0.6503    logpopn pcturban loggnp

       3       0.6339    logarea logpopn loggnp

       3       0.4943    logarea logpopn pcturban

---------------------------------------------------------------

       4       0.7825    logarea pcturban liter loggnp

       4       0.7820    logpopn pcturban liter loggnp

       4       0.7815    logarea logpopn liter loggnp

       4       0.7183    logarea logpopn pcturban liter

       4       0.6527    logarea logpopn pcturban loggnp

---------------------------------------------------------------

       5       0.7827    logarea logpopn pcturban liter loggnp

 

 

Plot of R2 vs. number of parameters in the model

Plot of MSE vs. number of parameters in the model

 

 

ii.          STEPWISE and other automatic variable selection methods

 

Backwards elimination:  Start with all predictor variables in a model and eliminate those that are not important (don’t meet a significance level to stay in a model – SLSTAY in SAS).

 

Forward selection:  Start with none of predictor variables in a model and add those that are important (meet a significance level to enter in a model – SLENTER in SAS).

 

Stepwise (Forward) selection:  Start with none of predictor variables in a model and add those that are important (meet a significance level to enter in a model) but also check to see if variables previously added should be allowed to stay.

 

proc stepwise data=country;

title2 automatic variable selection methods;

  model lifewom = pcturban liter logarea logpopn loggnp /

                  forward backward stepwise;

  run;

 

FORWARD RESULTS:

No other variable met the 0.5000 significance level for entry into the model.

                                The STEPWISE Procedure

                                       Model: MODEL1

                               Dependent Variable: lifewom

                                Summary of Forward Selection

 

                Variable     Number     Partial      Model

        Step    Entered      Vars In    R-Square    R-Square     C(p)      F Value    Pr > F

          1     liter            1       0.6583      0.6583     32.9375     125.23    <.0001

          2     loggnp           2       0.1212      0.7795      0.9123      35.17    <.0001

          3     logarea          3       0.0018      0.7813      2.4144       0.51    0.4774

 

BACKWARD RESULTS:

All variables left in the model are significant at the 0.1000 level

                                       Summary of Backward Elimination

                 Variable     Number     Partial      Model

         Step    Removed      Vars In    R-Square    R-Square     C(p)      F Value    Pr > F

           1     logpopn          4       0.0002      0.7825      4.0613       0.06    0.8053

           2     pcturban         3       0.0013      0.7813      2.4144       0.36    0.5515

           3     logarea          2       0.0018      0.7795      0.9123       0.51    0.4774

 

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into the model.

 

Alternatively, can also do this in SAS PROC REG by specifying other “selection” methods.

proc reg data=country;

 model lifewom= logarea logpopn pcturban liter loggnp

            / selection=backward;

 

What if the response is not linearly related to the regression parameters?

 

Examples:

1.         Exponential forms - one compartment models+ (e.g. toxicokinetics)

2.         Periodic forms - cyclic responses (e.g. time series)

3.         Saturable forms – constrained growth (e.g. logistic growth models)

 

Strategy?

1.         Transformation – is the model intrinsically linear?  For example, the allometric (power) model, Y = aXb can be linearized, log(Y) = log(a) + b log(X) => log(Y) = a* + b log(X).

 

2.         Fit a nonlinear model Y = g(X1, X2, … Xp; b1, … b p) + e

[SAS Proc NLIN].  Still use least squares to obtain estimates; however, may be tougher to fit.  Starting values are important (may not converge if start with crummy values). 

 

EXAMPLE:   SAS code to fit Michaelis-Menten curve – uses gradients
options nocenter formdlim="-";
 data puromycn;
   input conc vel @@;
   datalines;
   0.02   76     0.02   47     0.06   97     0.06  107   
   0.11  123     0.11  139     0.22  159     0.22  152   
   0.56  191     0.56  201     1.10  207     1.10  200   
  proc nlin method=marquardt;
     parameters  b1  = 195.80271
                 b2  = 0.04841  ;
     model vel = b1*conc/(b2 + conc) ;
     der.b1 = conc/(b2 + conc);
     der.b2 = -b1*conc/((b2 + conc)**2);
     output out=mmout predicted=vel_hat residual=vel_res;
  run;
 
  proc gplot;
     plot vel*conc vel_hat*conc/overlay;
     plot vel_res*vel_hat / vref=0;
     run;

 

The NLIN Procedure

Dependent Variable vel

Method: Marquardt

 

             Iterative Phase

                                   Sum of

 Iter          b1          b2     Squares

    0       195.8      0.0484      1920.6

    1       210.9      0.0614      1207.9

    2       212.5      0.0638      1195.6

    3       212.7      0.0641      1195.5

    4       212.7      0.0641      1195.4

    5       212.7      0.0641      1195.4

 

NOTE: Convergence criterion met.

 

       Estimation Summary

Method                  Marquardt

Iterations                      5

R                        9.865E-6

PPC(b2)                  4.029E-6

RPC(b2)                  0.000042

Object                   1.149E-8

Objective                1195.449

Observations Read              12

Observations Used              12

Observations Missing            0

------------------------------------------------------------------------------------------------

 

The NLIN Procedure

NOTE: An intercept was not specified for this model.

 

                                  Sum of        Mean               Approx

Source                    DF     Squares      Square    F Value    Pr > F

Model                      2      270214      135107    1130.18    <.0001

Error                     10      1195.4       119.5

Uncorrected Total         12      271409

 

                              Approx

Parameter      Estimate    Std Error    Approximate 95% Confidence Limits

b1                212.7       6.9471       197.2       228.2

b2               0.0641      0.00828      0.0457      0.0826

 

  Approximate Correlation Matrix

                b1              b2

b1       1.0000000       0.7650834

b2       0.7650834       1.0000000

 

Plot of observed data with superimposed fit

Plot of residuals versus predicted values.

 

 

 

EXAMPLE: continuous three-segment model to Pond Water Level Data;

 

data dpond;

   input  aquifer pond @@;

   datalines;

  41.54    54.267      43.60    54.227      42.92    54.207

  42.45    54.217      42.61    54.257      43.01    54.347

  43.88    54.267      43.31    54.21       41.87    54.207

  41.52    54.237      41.29    54.217      44.53    54.257

  44.22    54.247      44.03    54.227      43.45    54.217

  42.85    54.227      42.25    54.197      41.64    54.177

  40.63    54.197      42.21    54.257      42.63    54.247

  42.37    54.237      41.92    54.227      41.38    54.227

  41.09    54.227      40.80    54.237      40.64    54.257

  41.54    54.247      41.61    54.237      43.60    54.267

  43.99    54.267      43.78    54.247      43.58    54.267

  43.13    54.227      43.17    54.247      42.75    54.247

  42.37    54.237      41.95    54.207      41.49    54.187

  40.90    54.157      40.38    54.087      39.99    54.007

  39.93    53.987      39.74    53.957      39.41    53.827

  40.20    53.957      40.79    53.987      39.71    53.897

  39.28    53.807      38.51    53.627      38.06    53.487

  36.58    53.527      36.58    53.487      36.58    53.487

  36.56    53.487      36.71    53.487

  36.98    53.487      37.18    53.487      38.15    53.607

;

/*

 E(Y) = b0                            [aquifer <= tau1        ]

      = b0 + b1*(aquifer-tau1)        [tau1 < aquifer <= tau2 ]

      = b0 + b1*(tau2-tau1) + b3*(aquifer-tau2)

                                      [aquifer > tau2         ]

*/

 

  proc nlin data=dpond method=dud;

   parameters  b0 = 53.5

               b1 =  0.1641

               b3 =  0.0114

               tau1 = 38.0

               tau2 = 41.0   ;

   if aquifer <= tau1 then do;              *left segment;     

      model pond = b0;

   end;

 

   else if tau1 < aquifer <= tau2 then do;  * middle segment;

      model pond = b0 + b1*(aquifer-tau1) ;

   end;

 

   else do;                               *right segment;

      model pond = b0 + b1*(tau2-tau1)+b3*(aquifer-tau2);

   end;

 

   output out=pondout predicted=pond_hat residual=pond_res;

 

  run;

 

  proc gplot data=pondout;

    plot pond*aquifer pond_hat*aquifer/overlay;

      plot pond_res*pond_hat / vref=0;

      run;

 

The NLIN Procedure

Dependent Variable pond

Method: Gauss-Newton

 

                               Iterative Phase

                                                                       Sum of

 Iter          b0          b1          b3        tau1        tau2     Squares

    0     53.5000      0.1641      0.0114     38.0000     41.0000      2.3237

    1     53.4927      0.2335      0.0113     37.8476     40.9986      0.1064

    2     53.4927      0.2335      0.0113     37.8929     40.9990      0.1006

NOTE: Convergence criterion met.

------------------------------------------------------------------------------------------------

 

The NLIN Procedure

                                  Sum of        Mean               Approx

Source                    DF     Squares      Square    F Value    Pr > F

Model                      4      4.3020      1.0755     577.07    <.0001

Error                     54      0.1006     0.00186

Corrected Total           58      4.4026

 

                              Approx

Parameter      Estimate    Std Error    Approximate 95% Confidence Limits

b0              53.4927       0.0163     53.4600     53.5254

b1               0.2335       0.0120      0.2094      0.2575

b3               0.0113      0.00760    -0.00397      0.0265

tau1            37.8929       0.1296     37.6329     38.1528

tau2            40.9990       0.1032     40.7921     41.2060

 

                           Approximate Correlation Matrix

                  b0              b1              b3            tau1            tau2

b0         1.0000000      -0.0000000       0.0000000       0.5390895       0.0000000

b1        -0.0000000       1.0000000       0.0000000       0.7630448      -0.6160398

b3         0.0000000       0.0000000       1.0000000       0.0000000      -0.5483407

tau1       0.5390895       0.7630448       0.0000000       1.0000000      -0.3022819

tau2       0.0000000      -0.6160398      -0.5483407      -0.3022819       1.0000000

 

What if the response is not normal?

 

Examples:

1.         Dichotomous response - Survive/Die;  Possess characteristic / don’t [binomial]

2.         Count of events - number of young produced in 3 broods [Poisson, neg. binomial]

3.         Event when first response occurs - number of cycles until conception [geometric]

4.         Non-negative response – survival time [weibull, log-normal]

 

Strategy?

1.         Transformation – can the response be normalized?  For example, the log transformation can help normalize right skewed data, square root transformations are often recommended for count data and arcsin square root / angular transformations are often recommended for dichotomous data.

 

2.         Fit a non-normal response model Y = g(X1, X2, … Xp; b1, … b p) + e

[SAS Proc GENMOD or LOGISTIC or PROBIT or …]

 

A generalized linear model has a RESPONSE Variable with a distribution that is a member of the exponential family of distributions (e.g. Normal, Binomial, Poisson) AND has a LINK function that is linearly related to the predictor variables.  LOGISTIC Regresion and PROBIT regression are both models for dichotomous response (assumes binomial response).

 

/* ============================================================

GliM-logistic-example-04feb03.doc

Example of SAS statements used to compare the  concentration-mortality curves for  fathead minnows exposed to pentachlorobenzene  or dichlorobenzene.

Ref:  Oris, J.T. and Bailer, A.J. (1997) Equivalence of concentration-response distributions in aquatic toxicology: testing and implications for potency estimation. Environmental Toxicology and Chemistry 16: 2204-2209

 

Variables:

chem   = chemical tested where PCBZ is pentachlorobenzene and DCBZ is dichlorobenzene       

 conc  = concentration of compound in micromoles per liter (measured)

 mort  = number of organisms dead after 96 hours exposure to respective chemical        

    n    = number of organisms on test at each concentration  

 

NOTES:

  Create an indicator variable 'iPCBZ' = 0 for DCBZ and =1 for PCBZ.

  Variable 'iPCBZ' is used to generate beta2 (additional intercept parameter PCBZ).        

  Variable 'ic' is used to generate beta3 (additional slope parameter for PCBZ). 

============================================================ */

 

data crcomp; 

    input chem $ conc mort n;   

    iPCBZ = (chem = 'PCBZ');

    ic = iPCBZ*conc;

    obs_resp = mort/n;         

datalines;

   PCBZ                0              13           200

   PCBZ               23.3            35           192

   PCBZ               34              54           203

   PCBZ               40.8            79           164

   PCBZ               50.6           126           162

   PCBZ               75.2           204           207

   DBCZ                0              14           200

   DBCZ              262              55           197

   DBCZ              376              80           151

   DBCZ              650             210           219

;

proc gplot data=crcomp;

  plot obs_resp *conc = chem;

  run;

 

proc logist data=crcomp;

   model mort/n=conc iPCBZ ic;

   test iPCBZ=ic=0;  * same dose-response relationship;

   run;

 

proc genmod data=crcomp;

  class chem;

  model mort/n = conc chem conc*chem/d=binomial link=logit

        type1 lrci;

    run;

 

proc sort data=crcomp; by chem;

proc probit data=crcomp; by chem;

   model mort/n = conc / d=logistic inversecl;

   title1 'LCx estimation by chemical';

   run;

 

 

 

The LOGISTIC Procedure

 

               Model Information

Data Set                       WORK.CRCOMP

Response Variable (Events)     mort

Response Variable (Trials)     n

Model                          binary logit

Optimization Technique         Fisher's scoring

 

Number of Observations Read          10

Number of Observations Used          10

Sum of Frequencies Read            1895

Sum of Frequencies Used            1895

 

          Response Profile

 Ordered     Binary           Total

   Value     Outcome      Frequency

       1     Event              870

       2     Nonevent          1025

 

                    Model Convergence Status

         Convergence criterion (GCONV=1E-8) satisfied.

 

------------------------------------------------------------------------------------------------

 

The LOGISTIC Procedure

 

         Model Fit Statistics

                             Intercept

              Intercept            and

Criterion          Only     Covariates

AIC            2616.336       1603.083

SC             2621.883       1625.271

-2 Log L       2614.336       1595.083

 

        Testing Global Null Hypothesis: BETA=0

Test                 Chi-Square       DF     Pr > ChiSq

Likelihood Ratio      1019.2525        3         <.0001

Score                  824.7519        3         <.0001

Wald                   449.3479        3         <.0001

 

             Analysis of Maximum Likelihood Estimates

                               Standard          Wald

Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept     1     -3.1330      0.2339      179.3568        <.0001

conc          1     0.00895    0.000634      199.2060        <.0001

iPCBZ         1     -0.6071      0.3340        3.3046        0.0691

ic            1      0.0840     0.00591      201.8771        <.0001

 

---------------------------------------------------------------------------------------------

           Odds Ratio Estimates

             Point          95% Wald

Effect    Estimate      Confidence Limits

conc         1.009       1.008       1.010

iPCBZ        0.545       0.283       1.049

ic           1.088       1.075       1.100

 

Association of Predicted Probabilities and Observed Responses

Percent Concordant      85.5    Somers' D    0.762

Percent Discordant       9.3    Gamma        0.803

Percent Tied             5.1    Tau-a        0.379

Pairs                 891750    c            0.881

 

      Linear Hypotheses Testing Results

                   Wald

 Label       Chi-Square      DF    Pr > ChiSq

 Test 1        365.9918       2        <.0001  /*----- same intercept and slope? -----*/

 

-------------------------------------------------------------------------------------------------

 

The GENMOD Procedure

            Model Information

Data Set                      WORK.CRCOMP

Distribution                     Binomial

Link Function                       Logit

Response Variable (Events)           mort

Response Variable (Trials)              n

Number of Observations Read          10

Number of Observations Used          10

Number of Events                    870

Number of Trials                   1895

 

    Class Level Information

Class      Levels    Values

chem            2    DBCZ PCBZ

 

           Criteria For Assessing Goodness Of Fit

Criterion                 DF           Value        Value/DF

Deviance                   6         32.5947          5.4324

Scaled Deviance            6         32.5947          5.4324

Pearson Chi-Square         6         36.5932          6.0989

Scaled Pearson X2          6         36.5932          6.0989

Log Likelihood                     -797.5415

 

-------------------------------------------------------------------------------------------------

 

The GENMOD Procedure

Algorithm converged.

                                Analysis Of Parameter Estimates

                                       Standard    Likelihood Ratio 95%       Chi-

Parameter            DF    Estimate       Error      Confidence Limits      Square    Pr > ChiSq

Intercept             1     -3.7401      0.2383     -4.2267     -3.2912     246.23        <.0001

conc                  1      0.0929      0.0059      0.0819      0.1050     250.04        <.0001

chem         DBCZ     1      0.6071      0.3340     -0.0494      1.2631       3.30        0.0691

chem         PCBZ     0      0.0000      0.0000      0.0000      0.0000        .           .

conc*chem    DBCZ     1     -0.0840      0.0059     -0.0961     -0.0729     201.88        <.0001

conc*chem    PCBZ     0      0.0000      0.0000      0.0000      0.0000        .           .

Scale                 0      1.0000      0.0000      1.0000      1.0000

 

NOTE: The scale parameter was held fixed.

 

            LR Statistics For Type 1 Analysis

                                        Chi-

Source         Deviance        DF     Square    Pr > ChiSq

Intercept     1051.8472

conc           770.9489         1     280.90        <.0001

chem           448.7573         1     322.19        <.0001

conc*chem       32.5947         1     416.16        <.0001

 

-------------------------------------------------------------------------------------------------

 

Probit Procedure

chem=DBCZ

 

                  Analysis of Parameter Estimates

                      Standard   95% Confidence     Chi-

Parameter DF Estimate    Error       Limits       Square Pr > ChiSq

 

Intercept  1  -3.1330   0.2339  -3.5915  -2.6745  179.36     <.0001

conc       1   0.0089   0.0006   0.0077   0.0102  199.21     <.0001

 

Probit Model in Terms of Tolerance Distribution

 

        MU         SIGMA

350.247319    111.793148

 

       Estimated Covariance Matrix

        for Tolerance Parameters

 

                     MU             SIGMA

MU           125.437167          9.142389

SIGMA          9.142389         62.737616

 

Probability          conc       95% Fiducial Limits

       0.25        227.43        198.62        252.05

       0.50        350.25        328.40        372.74  /*----- LC50 for DCBZ -----*/

 

-------------------------------------------------------------------------------------------------

 

chem=PCBZ

 

Probit Procedure

                 Analysis of Parameter Estimates

                      Standard   95% Confidence     Chi-

Parameter DF Estimate    Error       Limits       Square Pr > ChiSq

Intercept  1  -3.7401   0.2383  -4.2073  -3.2730  246.23     <.0001

conc       1   0.0929   0.0059   0.0814   0.1044  250.04     <.0001

 

Probit Model in Terms of Tolerance Distribution

        MU         SIGMA

40.2463613    10.7607534

 

       Estimated Covariance Matrix

        for Tolerance Parameters

                     MU             SIGMA

MU             0.742593          0.085862

SIGMA          0.085862          0.463092

 

Probability          conc       95% Fiducial Limits

       0.25        28.424        26.182        30.360

       0.50        40.246        38.575        41.980 /*----- LC50 for PCBZ -----*/

 

What if the response variability is not constant?

 

Example:

1.         Often response variability is observed to increase with mean response.

 

Strategy?

1.         Transformation – can the response variability be stabilized?  For example, the log transformation can help as can often stabilize the variance along with helping to symmetrize the response distribution.

 

2.         Use weighted least squares – [SAS Proc REG with WEIGHT option]

Weighted Least squares – minimize where wi = 1/var(Yi)

[Assumes that weights are known; an approximation if not.]

 

options nocenter formdlim="-";

/* data from Neter et al. ALSM Ch. 10, Table 10.1 */

data nknw10p1;

  input age blood_pressure @@;

  wtage = 1/age;

  wtage2 = 1/(age**2);

  datalines;

  27   73  21   66  22   63   24   75  25   71  23   70

  20   65  20   70  29   79  24   72  25   68  28   67

  26   79  38   91  32   76  33   69  31   66  34   73

  37   78  38   87  33   76  35   79  30   73  31   80

  37   68  39   75  46   89  49  101  40   70  42   72

  43   80  46   83  43   75  44   71  46   80  47   96

  45   92  49   80  48   70  40   90  42   85  55   76

  54   71  57   99  52   86  53   79  56   92  52   85

  50   71  59   90  50   91  52  100  58   80  57  109

;

proc reg data=nknw10p1;

title "Unweighted LS: BP = age";

  model blood_pressure = age;

  plot r.*age;

  output out=bpres r=resid;

run;

 

/*----  Weighted analysis 1  ----*/

proc reg data=nknw10p1;

title "Weighted LS BP = age with WT = 1/age";

title2 "[assumes that V(BP) proportional to age]";

  model blood_pressure = age;

  weight wtage;

run;

 

/*----  Weighted analysis 2  ----*/

proc reg data=nknw10p1;

title "Weighted LS BP = age with WT = 1/age";

title2 "[assumes that SD(BP) proportional to age]";

  model blood_pressure = age;

  weight wtage2;

run;

 

/*----  Weighted analysis 3   ----*/

/* (as suggested in Neter et al.) */

 

data newblood; set bpres;

title "Weighted LS BP = age with WT = 1/(est SD^2)";

title2 "[assumes that SD=b0+b1*age]";

  absres = abs(resid);

 

proc reg data=newblood;

title3 "Regress abs(resid) vs. age to obtain SD function";

  model absres = age;

  output out=sdfunc p=sdfit;

  run;

 

data weights; set sdfunc;

title3 "";

  mywts = 1/(sdfit**2);

 

proc reg data=weights;

  model blood_pressure = age;

  weight mywts;

  output out=newres r=wtres;

run;

 

proc gplot data=newres;

  plot wtres*age;

  run;

 

 

 

Summary of Parameter Estimates and SEs from the OLS and WLS models

 

intercept

SE(intercept)

Slope

SE(Slope)

Unweighted

56.157

3.994

0.580

0.097

WLS (1/age)

56.050

3.279

0.583

0.087

WLS (1/age^2)

55.831

2.781

0.589

0.082

WLS (1/SD^2)

55.566

2.521

0.596

0.079

 

 

What if the responses are correlated?

 

Examples:

1.         Respondents in the same household

2.         Students in the same classroom

3.         Pups in the same litter

4.         Multiple measurements of the same individual [repeated measurements, longitudinal, time series, growth curves]

 

options nocenter formdlim="-";

/* data from Verbeke and Molenberghs 2.5

   age(yrs); dist (center of pituitary to maxillary fissure)

*/

data growth;

  input girl age dist @@;

  datalines;

1 8 21 1 10 20 1 12 21.5 1 14 23

2 8 21 2 10 21.5 2 12 24 2 14 25.5

3 8 20.5 3 10 24.0 3 12 24.5 3 14 26

4 8 23.5 4 10 24.5 4 12 25.0 4 14 26.5

5 8 21.5 5 10 23 4 12 22.5 4 14 23.5

6 8 20 6 10 21 6 12 21 6 14 22.5

7 8 21.5 7 10 22.5 7 12 23.0 7 14 25

8 8 23 8 10 23 8 12 23.5 8 14 24

9 8 20 9 10 21 9 12 22 9 14 21.5

10 8 16.5 10 10 19 10 12 19 10 14 19.5

11 8 24.5 11 10 25 11 12 28 11 14 28

;

proc gplot;

title “Growth data for 11 girls – pituitary to maxillary fissure”;

  plot dist*age=girl;

  run;

 

proc reg data = growth;

title2 "OLS ignoring multiple measurements per girl";

  model dist = age;

  run;

 

proc mixed data=growth;

  title2 "Random intercept for each girl";

  class girl;

  model dist = age / solution;

  random intercept / type=un subject=girl;

  run;

 

proc mixed data=growth;

  title2 "Random intercept and slope for each girl";

  class girl;

  model dist = age / solution;

  random intercept age / type=un subject=girl;

  run;