IES 612/STA 4-573/STA 4-576
Spring 2005
Week 05 – IES612-lecture-week05.doc
Model-Building Comments
* Brain-first, computer-second – pick variables for inclusion in the model that are important based on your knowledge of the system. This doesn’t exclude the possibility of exploratory analyses; it simply asks for careful consideration of variables to be considered for inclusion in a model.
Hierarchical Principle: If you have a higher order term, include all lower order constituent terms. For example, “X2” should be in a model if “X” (the linear term) is already present. The interaction term, “X1*X2” should be in the model if “X1” and “X2” are in the model.
Aside: Interactions can be used to allow for the relationship between the response variable “Y” and an “X” variable, say “X1”, to vary with levels of another predictor variable “X2”. We will discuss this in more detail when we consider factorial designs in anova models. SYNERGY = observed response is beyond what would be predicted from the additive effect of the two variables [e.g. lung cancer as a function of smoking and asbestos exposure]. ANTAGONISM = observed response is less than would be predicted from the additive effects of the two variables [e.g. competition for the same binding sites].
Principle of Parsimony: The model should be no more complicated than required to describe the response.
Variable selection methods
i. ALL
POSSIBLE REGRESSIONS (SAS Proc RSQUARE)
Suppose you have 5 possible predictor variables – X1, X2, X3, X4, X5 - then you have 25-1 = 31 possible models
|
#variables present |
X1 |
X2 |
X3 |
X4 |
X5 |
|
1 |
X1 |
|
|
|
|
|
|
X2 |
|
|
|
|
|
|
|
X3 |
|
|
|
|
|
|
|
X4 |
|
|
|
|
|
|
|
X5 |
|
|
2 |
X1 |
X2 |
|
|
|
|
X1 |
|
X3 |
|
|
|
|
X1 |
|
|
X4 |
|
|
|
X1 |
|
|
|
X5 |
|
|
|
X2 |
X3 |
|
|
|
|
|
X2 |
|
X4 |
|
|
|
|
X2 |
|
|
X5 |
|
|
|
|
X3 |
X4 |
|
|
|
|
|
X3 |
|
X5 |
|
|
|
|
|
X4 |
X5 |
|
|
3 |
X1 |
X2 |
X3 |
|
|
|
X1 |
X2 |
|
X4 |
|
|
|
X1 |
X2 |
|
|
X5 |
|
|
X1 |
|
X3 |
X4 |
|
|
|
X1 |
|
X3 |
|
X5 |
|
|
X1 |
|
|
X4 |
X5 |
|
|
|
X2 |
X3 |
X4 |
|
|
|
|
X2 |
X3 |
|
X5 |
|
|
|
X2 |
|
X4 |
X5 |
|
|
|
|
X3 |
X4 |
X5 |
|
|
4 |
X1 |
X2 |
X3 |
X4 |
|
|
X1 |
X2 |
X3 |
|
X5 |
|
|
X1 |
X2 |
|
X4 |
X5 |
|
|
X1 |
|
X3 |
X4 |
X5 |
|
|
|
X2 |
X3 |
X4 |
X5 |
|
|
5 |
X1 |
X2 |
X3 |
X4 |
X5 |
Which model do you
select?
1. Models with smallest incremental increase in R2 or smallest incremental decrease in SSE – look for the “elbow” in these indices.
2. Adjusted R2 = 1 – [(n-1)/(n-p)]*SSE/SST and MSE=SSE/(n-p) adjust for the number of parameters in the model and are frequently used.
3. Cp is an index of model biasedness – best if it is close in value to the number of parameters in the model.
4. PRESS statistic = PREdiction Sum of Square criterion. Similar to SSE however predict each point from the remaining n-1 observations (i.e. delete ith case and then predict).
5. After you select a model, you still need to do all of the adequacy checks as we discussed previously.
Example: Country data set – see country_mreg_varsel_27jan2003.doc
proc rsquare data=country outest=est mse cp sse;
title2 all possible regression;
model lifewom = logarea
logpopn pcturban liter loggnp;
run;
OR
data
country;
title
‘country data analysis’;
infile "\\Casnov5\MST\MSTLab\Baileraj\country.data"; * reads an data file;
input
name $ area popnsize pcturban lang $ liter lifemen
lifewom pcGNP;
logarea = log10(area);
logpopn = log10(popnsize);
loggnp
= log10(pcGNP);
ienglish = (lang="English");
drop area popnsize pcgnp;
proc reg data=country;
model lifewom= logarea logpopn pcturban liter loggnp
/ selection=rsquare;
plot cp.*np. rsq.*np. mse.*np.
run;
[output – Edited]
The
REG Procedure
Model:
MODEL1
Dependent
Variable: lifewom
R-Square
Selection Method
Number
of Observations Read
79
Number
of Observations Used
67
Number
of Observations with Missing Values
12
Number
in
Model
R-Square Variables in Model
1
0.6583 liter
1
0.6299 loggnp
1
0.4643 pcturban
1
0.0161 logpopn
1
0.0146 logarea
---------------------------------------------------------------
2
0.7795 liter loggnp
2
0.7100 pcturban liter
2
0.6611 logpopn liter
2
0.6610 logarea liter
2
0.6449 pcturban loggnp
2
0.6335 logarea loggnp
2
0.6330 logpopn loggnp
2
0.4930 logarea pcturban
2
0.4843 logpopn pcturban
2
0.0183 logarea logpopn
---------------------------------------------------------------
3
0.7813 logarea liter loggnp
3
0.7810 logpopn liter loggnp
3
0.7801 pcturban liter loggnp
3
0.7179 logarea pcturban liter
3
0.7158 logpopn pcturban liter
3
0.6616 logarea logpopn liter
3
0.6523 logarea pcturban loggnp
3
0.6503 logpopn pcturban loggnp
3
0.6339 logarea logpopn loggnp
3
0.4943 logarea logpopn pcturban
---------------------------------------------------------------
4
0.7825 logarea pcturban liter
loggnp
4
0.7820 logpopn pcturban liter
loggnp
4
0.7815 logarea logpopn liter
loggnp
4
0.7183 logarea logpopn pcturban
liter
4
0.6527 logarea logpopn pcturban
loggnp
---------------------------------------------------------------
5 0.7827 logarea logpopn pcturban liter loggnp
Plot of R2 vs. number of parameters in the model

Plot of MSE vs. number of parameters in the model

ii. STEPWISE
and other automatic variable selection methods
Backwards elimination: Start with all predictor variables in a model and eliminate those that are not important (don’t meet a significance level to stay in a model – SLSTAY in SAS).
Forward selection: Start with none of predictor variables in a model and add those that are important (meet a significance level to enter in a model – SLENTER in SAS).
Stepwise (Forward) selection: Start with none of predictor variables in a model and add those that are important (meet a significance level to enter in a model) but also check to see if variables previously added should be allowed to stay.
proc stepwise data=country;
title2 automatic variable selection methods;
model lifewom = pcturban
liter logarea logpopn loggnp /
forward
backward stepwise;
run;
FORWARD RESULTS:
No other
variable met the 0.5000 significance level for entry into the model.
The STEPWISE
Procedure
Model:
MODEL1
Dependent
Variable: lifewom
Summary of
Forward Selection
Variable Number
Partial Model
Step
Entered Vars In R-Square
R-Square C(p) F Value
Pr > F
1
liter 1 0.6583 0.6583
32.9375 125.23 <.0001
2
loggnp 2 0.1212 0.7795 0.9123 35.17
<.0001
3
logarea 3 0.0018 0.7813 2.4144 0.51
0.4774
BACKWARD RESULTS:
All variables left in the model are significant at the 0.1000 level
Summary
of Backward Elimination
Variable Number
Partial Model
Step
Removed Vars In R-Square
R-Square C(p) F Value
Pr > F
1
logpopn 4
0.0002 0.7825 4.0613 0.06
0.8053
2
pcturban 3 0.0013 0.7813 2.4144 0.36
0.5515
3
logarea 2 0.0018 0.7795 0.9123 0.51
0.4774
All variables
left in the model are significant at the 0.1500 level.
No other
variable met the 0.1500 significance level for entry into the model.
Alternatively, can also do this in SAS
PROC REG by specifying other “selection” methods.
proc reg data=country;
model lifewom= logarea logpopn pcturban liter loggnp
/ selection=backward;
What if the response is not linearly related to the regression parameters?
Examples:
1. Exponential forms - one compartment models+ (e.g. toxicokinetics)
2. Periodic forms - cyclic responses (e.g. time series)
3. Saturable forms – constrained growth (e.g. logistic growth models)
Strategy?
1. Transformation – is the model intrinsically linear? For example, the allometric (power) model, Y = aXb can be linearized, log(Y) = log(a) + b log(X) => log(Y) = a* + b log(X).
2. Fit a nonlinear model Y = g(X1, X2, … Xp; b1, … b p) + e
[SAS Proc NLIN]. Still use least squares to obtain estimates; however, may be tougher to fit. Starting values are important (may not converge if start with crummy values).
EXAMPLE: SAS code to fit Michaelis-Menten curve – uses gradientsoptions nocenter formdlim="-"; data puromycn; input conc vel @@; datalines; 0.02 76 0.02 47 0.06 97 0.06 107 0.11 123 0.11 139 0.22 159 0.22 152 0.56 191 0.56 201 1.10 207 1.10 200 proc nlin method=marquardt; parameters b1 = 195.80271 b2 = 0.04841 ; model vel = b1*conc/(b2 + conc) ; der.b1 = conc/(b2 + conc); der.b2 = -b1*conc/((b2 + conc)**2); output out=mmout predicted=vel_hat residual=vel_res; run; proc gplot; plot vel*conc vel_hat*conc/overlay; plot vel_res*vel_hat / vref=0; run;
The
NLIN Procedure
Dependent
Variable vel
Method:
Marquardt
Iterative Phase
Sum of
Iter
b1 b2 Squares
0
195.8 0.0484 1920.6
1
210.9 0.0614 1207.9
2
212.5 0.0638 1195.6
3
212.7 0.0641 1195.5
4
212.7 0.0641 1195.4
5
212.7 0.0641 1195.4
NOTE:
Convergence criterion met.
Estimation Summary
Method Marquardt
Iterations 5
R 9.865E-6
PPC(b2) 4.029E-6
RPC(b2) 0.000042
Object 1.149E-8
Objective 1195.449
Observations
Read 12
Observations
Used 12
Observations
Missing 0
------------------------------------------------------------------------------------------------
The
NLIN Procedure
NOTE:
An intercept was not specified for this model.
Sum of Mean Approx
Source DF Squares Square
F Value Pr > F
Model 2 270214 135107
1130.18 <.0001
Error 10 1195.4 119.5
Uncorrected
Total 12 271409
Approx
Parameter Estimate Std Error
Approximate 95% Confidence Limits
b1 212.7 6.9471 197.2 228.2
b2 0.0641 0.00828 0.0457 0.0826
Approximate Correlation Matrix
b1 b2
b1 1.0000000 0.7650834
b2 0.7650834 1.0000000
Plot of observed data with superimposed fit

Plot of residuals versus predicted values.

EXAMPLE: continuous three-segment model to Pond Water Level Data;
data dpond;
input aquifer pond @@;
datalines;
41.54 54.267 43.60
54.227 42.92 54.207
42.45 54.217 42.61
54.257 43.01 54.347
43.88 54.267 43.31
54.21 41.87 54.207
41.52 54.237 41.29
54.217 44.53 54.257
44.22 54.247 44.03
54.227 43.45 54.217
42.85 54.227 42.25
54.197 41.64 54.177
40.63 54.197 42.21
54.257 42.63 54.247
42.37 54.237 41.92
54.227 41.38 54.227
41.09 54.227 40.80
54.237 40.64 54.257
41.54 54.247 41.61
54.237 43.60 54.267
43.99 54.267 43.78
54.247 43.58 54.267
43.13 54.227 43.17
54.247 42.75 54.247
42.37 54.237 41.95
54.207 41.49 54.187
40.90 54.157
40.38 54.087 39.99
54.007
39.93 53.987 39.74
53.957 39.41 53.827
40.20 53.957 40.79
53.987 39.71 53.897
39.28 53.807 38.51
53.627 38.06 53.487
36.58 53.527 36.58
53.487 36.58 53.487
36.56 53.487 36.71
53.487
36.98 53.487 37.18
53.487 38.15 53.607
;
/*
E(Y) = b0 [aquifer <=
tau1 ]
= b0 + b1*(aquifer-tau1)
[tau1 < aquifer <= tau2 ]
= b0 + b1*(tau2-tau1) + b3*(aquifer-tau2)
[aquifer
> tau2 ]
*/
proc nlin data=dpond method=dud;
parameters b0 = 53.5
b1 = 0.1641
b3 = 0.0114
tau1 = 38.0
tau2 = 41.0 ;
if aquifer <= tau1 then do; *left segment;
model pond = b0;
end;
else if tau1 < aquifer <= tau2 then do; * middle segment;
model pond = b0 + b1*(aquifer-tau1) ;
end;
else do;
*right segment;
model pond = b0 + b1*(tau2-tau1)+b3*(aquifer-tau2);
end;
output out=pondout predicted=pond_hat residual=pond_res;
run;
proc gplot data=pondout;
plot pond*aquifer pond_hat*aquifer/overlay;
plot
pond_res*pond_hat / vref=0;
run;
The
NLIN Procedure
Dependent
Variable pond
Method:
Gauss-Newton
Iterative Phase
Sum of
Iter
b0 b1 b3 tau1 tau2
Squares
0
53.5000 0.1641 0.0114
38.0000 41.0000 2.3237
1
53.4927 0.2335 0.0113
37.8476 40.9986 0.1064
2
53.4927 0.2335 0.0113
37.8929 40.9990 0.1006
NOTE:
Convergence criterion met.
------------------------------------------------------------------------------------------------
The
NLIN Procedure
Sum of Mean Approx
Source DF Squares Square
F Value Pr > F
Model 4 4.3020 1.0755
577.07 <.0001
Error 54 0.1006
0.00186
Corrected
Total 58 4.4026
Approx
Parameter Estimate Std Error
Approximate 95% Confidence Limits
b0 53.4927 0.0163 53.4600
53.5254
b1 0.2335 0.0120 0.2094 0.2575
b3 0.0113 0.00760
-0.00397 0.0265
tau1 37.8929 0.1296 37.6329
38.1528
tau2 40.9990 0.1032 40.7921
41.2060
Approximate
Correlation Matrix
b0 b1 b3 tau1 tau2
b0 1.0000000 -0.0000000 0.0000000 0.5390895 0.0000000
b1 -0.0000000 1.0000000 0.0000000 0.7630448 -0.6160398
b3 0.0000000 0.0000000 1.0000000 0.0000000 -0.5483407
tau1 0.5390895 0.7630448 0.0000000 1.0000000 -0.3022819
tau2 0.0000000 -0.6160398 -0.5483407 -0.3022819 1.0000000

What if the response is not normal?
Examples:
1. Dichotomous response - Survive/Die; Possess characteristic / don’t [binomial]
2. Count of events - number of young produced in 3 broods [Poisson, neg. binomial]
3. Event when first response occurs - number of cycles until conception [geometric]
4. Non-negative response – survival time [weibull, log-normal]
Strategy?
1. Transformation – can the response be normalized? For example, the log transformation can help normalize right skewed data, square root transformations are often recommended for count data and arcsin square root / angular transformations are often recommended for dichotomous data.
2. Fit a non-normal response model Y = g(X1, X2, … Xp; b1, … b p) + e
[SAS Proc GENMOD or LOGISTIC or PROBIT or …]
A generalized linear model has a RESPONSE Variable with a distribution that is a member of the exponential family of distributions (e.g. Normal, Binomial, Poisson) AND has a LINK function that is linearly related to the predictor variables. LOGISTIC Regresion and PROBIT regression are both models for dichotomous response (assumes binomial response).
Example of SAS
statements used to compare the
concentration-mortality curves for
fathead minnows exposed to pentachlorobenzene or dichlorobenzene.
Ref:
Oris, J.T. and Bailer, A.J. (1997) Equivalence of concentration-response
distributions in aquatic toxicology: testing and implications for potency
estimation. Environmental Toxicology and
Chemistry 16: 2204-2209
Variables:
chem =
chemical tested where PCBZ is pentachlorobenzene and DCBZ is
dichlorobenzene
conc =
concentration of compound in micromoles per liter (measured)
mort =
number of organisms dead after 96 hours exposure to respective chemical
n = number of organisms on test
at each concentration
NOTES:
Create
an indicator variable 'iPCBZ' = 0 for DCBZ and =1 for PCBZ.
Variable 'iPCBZ' is used to generate beta2 (additional intercept
parameter PCBZ).
Variable 'ic' is used to generate beta3 (additional slope parameter for
PCBZ).
============================================================ */
data crcomp;
input chem $ conc mort n;
iPCBZ = (chem = 'PCBZ');
ic = iPCBZ*conc;
obs_resp = mort/n;
datalines;
PCBZ 0 13 200
PCBZ 23.3 35 192
PCBZ 34 54 203
PCBZ 40.8 79 164
PCBZ 50.6 126 162
PCBZ 75.2 204 207
DBCZ 0 14 200
DBCZ 262 55 197
DBCZ 376 80 151
DBCZ 650 210 219
;
proc gplot data=crcomp;
plot obs_resp *conc = chem;
run;
proc logist data=crcomp;
model mort/n=conc iPCBZ ic;
test iPCBZ=ic=0; * same dose-response relationship;
run;
proc genmod data=crcomp;
class chem;
model mort/n = conc chem conc*chem/d=binomial
link=logit
type1 lrci;
run;
proc sort data=crcomp; by chem;
proc probit data=crcomp; by chem;
model mort/n = conc / d=logistic inversecl;
title1 'LCx estimation by chemical';
run;

The LOGISTIC Procedure
Model Information
Data
Set WORK.CRCOMP
Response
Variable (Events) mort
Response
Variable (Trials) n
Model binary logit
Optimization
Technique Fisher's scoring
Number
of Observations Read 10
Number
of Observations Used 10
Sum
of Frequencies Read 1895
Sum
of Frequencies Used 1895
Response Profile
Ordered
Binary Total
Value
Outcome Frequency
1
Event 870
2
Nonevent 1025
Model Convergence Status
Convergence criterion (GCONV=1E-8)
satisfied.
------------------------------------------------------------------------------------------------
The
LOGISTIC Procedure
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 2616.336 1603.083
SC 2621.883 1625.271
-2
Log L 2614.336 1595.083
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF
Pr > ChiSq
Likelihood
Ratio 1019.2525 3 <.0001
Score 824.7519 3 <.0001
Wald 449.3479 3 <.0001
Analysis of Maximum Likelihood
Estimates
Standard Wald
Parameter DF
Estimate Error Chi-Square Pr > ChiSq
Intercept 1
-3.1330 0.2339 179.3568 <.0001
conc 1
0.00895 0.000634 199.2060 <.0001
iPCBZ 1
-0.6071 0.3340 3.3046 0.0691
ic 1 0.0840
0.00591 201.8771 <.0001
---------------------------------------------------------------------------------------------
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
conc 1.009 1.008 1.010
iPCBZ 0.545 0.283 1.049
ic 1.088 1.075 1.100
Association
of Predicted Probabilities and Observed Responses
Percent
Concordant 85.5 Somers' D
0.762
Percent
Discordant 9.3 Gamma
0.803
Percent
Tied 5.1 Tau-a
0.379
Pairs 891750 c
0.881
Linear Hypotheses Testing Results
Wald
Label
Chi-Square DF Pr > ChiSq
Test 1
365.9918 2 <.0001 /*----- same intercept and slope? -----*/
-------------------------------------------------------------------------------------------------
The GENMOD Procedure
Model Information
Data
Set WORK.CRCOMP
Distribution Binomial
Link
Function Logit
Response
Variable (Events) mort
Response
Variable (Trials) n
Number
of Observations Read 10
Number
of Observations Used 10
Number
of Events 870
Number
of Trials 1895
Class Level Information
Class Levels
Values
chem 2
DBCZ PCBZ
Criteria For Assessing Goodness Of
Fit
Criterion DF Value Value/DF
Deviance 6 32.5947 5.4324
Scaled
Deviance 6 32.5947 5.4324
Pearson
Chi-Square 6 36.5932 6.0989
Scaled
Pearson X2 6 36.5932 6.0989
Log
Likelihood -797.5415
-------------------------------------------------------------------------------------------------
The
GENMOD Procedure
Algorithm
converged.
Analysis Of
Parameter Estimates
Standard Likelihood Ratio
95% Chi-
Parameter DF Estimate Error Confidence Limits Square
Pr > ChiSq
Intercept 1 -3.7401 0.2383
-4.2267 -3.2912 246.23 <.0001
conc 1 0.0929 0.0059 0.0819 0.1050
250.04 <.0001
chem DBCZ 1
0.6071 0.3340 -0.0494 1.2631 3.30 0.0691
chem PCBZ 0
0.0000 0.0000 0.0000 0.0000 . .
conc*chem DBCZ
1 -0.0840 0.0059
-0.0961 -0.0729 201.88 <.0001
conc*chem PCBZ
0 0.0000 0.0000 0.0000 0.0000 . .
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE:
The scale parameter was held fixed.
LR Statistics For Type 1 Analysis
Chi-
Source Deviance DF
Square Pr > ChiSq
Intercept 1051.8472
conc 770.9489 1
280.90 <.0001
chem 448.7573 1
322.19 <.0001
conc*chem 32.5947 1
416.16 <.0001
-------------------------------------------------------------------------------------------------
Probit Procedure
chem=DBCZ
Analysis of Parameter
Estimates
Standard 95% Confidence Chi-
Parameter
DF Estimate Error Limits Square Pr > ChiSq
Intercept 1
-3.1330 0.2339 -3.5915
-2.6745 179.36 <.0001
conc 1
0.0089 0.0006 0.0077
0.0102 199.21 <.0001
Probit
Model in Terms of Tolerance Distribution
MU SIGMA
350.247319 111.793148
Estimated Covariance Matrix
for Tolerance Parameters
MU SIGMA
MU 125.437167 9.142389
SIGMA 9.142389 62.737616
Probability conc 95% Fiducial Limits
0.25 227.43 198.62 252.05
0.50 350.25 328.40 372.74
/*----- LC50 for DCBZ -----*/
-------------------------------------------------------------------------------------------------
chem=PCBZ
Probit
Procedure
Analysis of Parameter
Estimates
Standard 95% Confidence Chi-
Parameter
DF Estimate Error Limits Square Pr > ChiSq
Intercept 1
-3.7401 0.2383 -4.2073
-3.2730 246.23 <.0001
conc 1
0.0929 0.0059 0.0814
0.1044 250.04 <.0001
Probit
Model in Terms of Tolerance Distribution
MU
SIGMA
40.2463613 10.7607534
Estimated Covariance Matrix
for Tolerance Parameters
MU SIGMA
MU 0.742593 0.085862
SIGMA 0.085862 0.463092
Probability conc 95% Fiducial Limits
0.25 28.424 26.182 30.360
0.50 40.246 38.575 41.980 /*----- LC50 for PCBZ -----*/
What if the response variability is not constant?
Example:
1. Often response variability is observed to increase with mean response.
Strategy?
1. Transformation – can the response variability be stabilized? For example, the log transformation can help as can often stabilize the variance along with helping to symmetrize the response distribution.
2. Use weighted least squares – [SAS Proc REG with WEIGHT option]
Weighted Least squares – minimize
where wi = 1/var(Yi)
[Assumes that weights are known; an approximation if not.]
options
nocenter formdlim="-";
/*
data from Neter et al. ALSM Ch. 10, Table 10.1 */
data
nknw10p1;
input age blood_pressure @@;
wtage = 1/age;
wtage2 = 1/(age**2);
datalines;
27
73 21 66
22 63 24
75 25 71
23 70
20
65 20 70
29 79 24
72 25 68
28 67
26 79 38
91 32 76
33 69 31
66 34 73
37
78 38 87
33 76 35
79 30 73
31 80
37
68 39 75
46 89 49
101 40 70
42 72
43
80 46 83
43 75 44
71 46 80
47 96
45
92 49 80
48 70 40
90 42 85
55 76
54
71 57 99
52 86 53
79 56 92
52 85
50
71 59 90
50 91 52
100 58 80
57 109
;
proc reg
data=nknw10p1;
title
"Unweighted LS: BP = age";
model blood_pressure = age;
plot r.*age;
output out=bpres r=resid;
run;
/*---- Weighted analysis 1 ----*/
proc reg
data=nknw10p1;
title
"Weighted LS BP = age with WT = 1/age";
title2
"[assumes that V(BP) proportional to age]";
model blood_pressure = age;
weight wtage;
run;
/*---- Weighted analysis 2 ----*/
proc reg
data=nknw10p1;
title
"Weighted LS BP = age with WT = 1/age";
title2
"[assumes that SD(BP) proportional to age]";
model blood_pressure = age;
weight wtage2;
run;
/*---- Weighted analysis 3 ----*/
/*
(as suggested in Neter et al.) */
data
newblood; set bpres;
title
"Weighted LS BP = age with WT = 1/(est SD^2)";
title2
"[assumes that SD=b0+b1*age]";
absres = abs(resid);
proc reg
data=newblood;
title3
"Regress abs(resid) vs. age to obtain SD function";
model absres = age;
output out=sdfunc p=sdfit;
run;
data
weights; set sdfunc;
title3
"";
mywts = 1/(sdfit**2);
proc reg
data=weights;
model blood_pressure = age;
weight mywts;
output
out=newres r=wtres;
run;
proc gplot
data=newres;
plot wtres*age;
run;

Summary
of Parameter Estimates and SEs from the OLS and WLS models
|
|
intercept |
SE(intercept) |
Slope |
SE(Slope) |
|
Unweighted |
56.157 |
3.994 |
0.580 |
0.097 |
|
WLS (1/age) |
56.050 |
3.279 |
0.583 |
0.087 |
|
WLS (1/age^2) |
55.831 |
2.781 |
0.589 |
0.082 |
|
WLS (1/SD^2) |
55.566 |
2.521 |
0.596 |
0.079 |
What if the responses are correlated?
Examples:
1. Respondents in the same household
2. Students in the same classroom
3. Pups in the same litter
4. Multiple measurements of the same individual [repeated measurements, longitudinal, time series, growth curves]
options
nocenter formdlim="-";
/*
data from Verbeke and Molenberghs 2.5
age(yrs); dist (center of pituitary to
maxillary fissure)
*/
data growth;
input girl age dist @@;
datalines;
1
8 21 1 10 20 1 12 21.5 1 14 23
2
8 21 2 10 21.5 2 12 24 2 14 25.5
3
8 20.5 3 10 24.0 3 12 24.5 3 14 26
4
8 23.5 4 10 24.5 4 12 25.0 4 14 26.5
5
8 21.5 5 10 23 4 12 22.5 4 14 23.5
6
8 20 6 10 21 6 12 21 6 14 22.5
7
8 21.5 7 10 22.5 7 12 23.0 7 14 25
8
8 23 8 10 23 8 12 23.5 8 14 24
9
8 20 9 10 21 9 12 22 9 14 21.5
10
8 16.5 10 10 19 10 12 19 10 14 19.5
11
8 24.5 11 10 25 11 12 28 11 14 28
;
proc
gplot;
title
“Growth data for 11 girls – pituitary to maxillary fissure”;
plot dist*age=girl;
run;
proc reg
data = growth;
title2
"OLS ignoring multiple measurements per girl";
model dist = age;
run;
proc mixed
data=growth;
title2 "Random intercept for each
girl";
class girl;
model dist = age / solution;
random intercept / type=un subject=girl;
run;
proc mixed
data=growth;
title2 "Random intercept and slope for
each girl";
class girl;
model dist = age / solution;
random intercept age / type=un subject=girl;
run;
