What is a model? * representation/general "rule-of-thumb" of real-life(?) process * representation usually a simplified version * abstraction of reality but based on subject-matter understanding [PURPOSEFUL REPRESENTATION OF REALITY] [CARICATURE(sp?) OF REALITY] [GE Box: All models are wrong but some are useful ...] * some predictive value * often some mathematical/functional form * + random elements incorporated via statistical ideas * provides formalization of description => replication by others / communication * limited in some ways: 1) assumptions 2) real data * evaluate model performance => improve it with testing with data * may be more than one model that explains the data => comparing performance and selecting between competing models? synthesizing estimates over models * models vary between disciplines (both in origin and use) * complexity, cost effectiveness, time constraints ... * good models add to general understanding - generalizable * model development is an iterative process What do we learn from models? * models have implications - predictions that implications for management, policy, social, physical, monetary * understand relationships between variables Example: Suppose I asked you what is the probability of 2 boys and 2 girls in a family of 4 kids. How would you answer this question? Possible outcomes? BBBB GBBB BGBB ... Assumptions? independence of outcome in family P(B)=P(G) = 1/2 ***** single birth (no twins, triplets, quadruplets) Strategies? 1) enumerate outcomes - use TREE diagram ... recognize probability of each branch (used idea of equal prob + independence) accumulate probs for outcomes of interest 6/16 satisfy the condition of interest 2) consider some random variable - Y= number of girls in family = {0,1,2,3,4} Find P(Y=2) Is there a probaility model for this? maybe ... binomial P(Y=2) = 4!/(2! 2!) * (1/2)^2 * (1- 1/2)^2 R ... dbinom(2,4,prob=1/2) [1] 0.375 3) simulation based ... i) simulate family ii) do this lots of times iii) determine proportion of simulated families with 2G,2B 14/36 3/8 pi.hat <- 14/36 B <- 2*sqrt(pi.hat*(1-pi.hat)/36) pi.hat B # form CI pi.hat + c(-1,1)*B # simulated families to be within 5% or .5% of true proportion B05 <- 4*(pi.hat)*(1-pi.hat)/(.05^2) B005 <- 4*(pi.hat)*(1-pi.hat)/(.005^2) B05 B005 B05 380.2469 B005 38024.69 # # MC problem HW 2 hints # BB <- runif(10) BB 1-BB max(BB, 1-BB) pmax(BB, 1-BB) pmin(BB, 1-BB) ?pmax ratio <- pmax(BB, 1-BB)/pmin(BB, 1-BB) ratio hist(ratio) BB <- runif(4000) ratio <- pmax(BB, 1-BB)/pmin(BB, 1-BB) hist(ratio) BB <- runif(400) ratio <- pmax(BB, 1-BB)/pmin(BB, 1-BB) hist(ratio) quantile(ratio) boxplot(ratio) boxplot(log10(ratio)) # only break between .2 and .8 BB <- runif(4000, min=0.2, max=0.8) ratio <- pmax(BB, 1-BB)/pmin(BB, 1-BB) hist(ratio) history() # COMMENTS # BONUS t <- 50 # tagged in 1st sample N <- 200 # population size n <- 40 # size of the second sample mypop <- rep( c(0, 1), c(N-t, t)) mypop [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [149] 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ?sample sample(mypop, size=n) [1] 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 [39] 1 0 s.tagged <- sum(sample(mypop, size=n)) > s.tagged [1] 10 N.hat <- t*n/s.tagged N.hat samp1 <- sample(mypop, size=n) s.tagged <- sum(sample(mypop, size=n)) N.hat <- t*n/s.tagged N.hat samp2 <- sample(mypop, size=n) rbind(samp1, samp2) # results matrix set up res.mat<- matrix(nrow=2000, ncol=n, 0) res.mat dim(res.mat) # filling the results matrix with # each row corresponding to a different sample (1) for (irow in 1:2000) res.mat[irow,] <- sample(mypop, size=n) res.mat[1:5,] # figuring out number tagged in each sample (2) ?apply s.tagged <- apply(res.mat, 1, sum) length(s.tagged) summary(s.tagged) # estimating the population size (3) Nhat <- t*n/s.tagged hist(Nhat) mean(Nhat) summary(Nhat) sd(Nhat) B <- 2*sd(Nhat)/2000 B # what about inverse sampling - s=10 fixed # and n varies res.mat<- matrix(nrow=2000, ncol=N, 0) for (irow in 1:2000) res.mat[irow,] <- sample(mypop, N) cumsum(permute.pop) csum <- cumsum(permute.pop) 1:200[csum==10] (1:200)[csum==10] min((1:200)[csum==10]) # # even more homework comments # 1/sqrt(2*pi)*exp(-x*x/2) 1/sqrt(2*pi)*exp(-x^2/2) my.norm<- function(x) (1/sqrt(2*pi))*exp(-x*x/2) ?log pi arcos(-1) my.norm<- function(x) (1/sqrt(2*pi))*exp(-x*x/2) my.norm(0) my.norm(1) my.norm(-1) dnorm(0) dnorm(1) ?rnorm e xx <- runif(10, min=-2, max=2) yy <- runif(10, min=0, max=0.4) xx yy iunder <- as.numeric(yy <= 1/sqrt(2*pi)*exp(-xx^2)) iunder iunder <- as.numeric(yy <= 1/sqrt(2*pi)*exp(-(xx^2)/2))