STA 402/502 - Statistical Programming

Fall 2009

Course (Section)

STA 402/502 (A)

Meeting Time:

800-850 MWF (plus other make-up times to be arranged in consultation with students)

Meeting Location

106 Bachelor Hall

Prerequisites:

STA 401/501; STA 671; or permission of the instructor. Willingness to work.

Professor:

Dr. John Bailer

E-mail:

baileraj@muohio.edu

URL:

http://www.users.muohio.edu/baileraj

Office (phone)

292 or 122B Bachelor Hall (529-3538)

FAX: 529-1493

Office Hours

10:00 - 11:30 M W F

(other hours by appointmentdon't be shy!) 

Purpose of Course:

To introduce the use of computers to process and analyze data Techniques and strategies for managing, manipulating and analyzing data are discussed. Emphasis is on the use of the SAS system. SAS data steps including infile, input, merge, set, looping structures, conditional execution (if-then), etc. are presented. SAS mathematical, statistical and data functions are discussed along with discussion of macro construction, extensive matrix manipulation and programming (PROC IML) and graphics procedures. Other quantitative programming environments (e.g. R) are considered for constructing specialized statistical analysis functions and graphical displays. Statistical computing topics, such as random number generation, randomization tests and Monte Carlo simulation, will be used to illustrate these programming ideas.

Course Objectives:

Develop programming and computing skills to address data analysis problems using statistical programming tools.

Texts:

Required (and provided):

Bailer AJ. 2010. Statistical Programming in SAS. SAS Institute. Cary, NC ISBN:  978-1-59994-656-6

Recommended (will discuss in class - wait to purchase)
SAS certification guide - SAS Certification Prep Guide:  Base Programming for SAS9. SBN-10: 159047922X / ISBN-13: 978-1590479223. Approx. 22 chapters with CD.

Other books that you might like ...

Delwiche LD and Slaughter SJ. 2003. The Little SAS Book: A Primer  3rd edition. SAS Institute. Cary, NC  ISBN 1-59047-333-7

Cody R and Pass R. 1995. SAS Programming by Example. SAS Institute. Cary, NC ISBN 1-55544-681-7

Cody R 2004. SAS Functions by Example. SAS Institute. Cary, NC ISBN 1-59047-378-7- lots of great examples - worth browsing to see what you can do with functions in SAS

 

[BM] Braun WJ and Murdoch DJ  2007. A First Course in Statistical Programming with R. Cambridge University Press. Cambridge, UK  ISBN 978-0-521-69424-7

Belief and Style: 

You learn programming by doing it.  Actually, you tend to learn a lot more from failing and fixing code than by getting it right the first time.  So, you will get the most out of this class by trying the various code “displays” and the suggested exercises.  This class does not follow a simple linear trajectory in which topics are not used until they are fully defined and developed.  In programming, you may find that you need functions, procedures, etc. long before you learned about them in any formal way.  Thus, I unapologetically, brashly and frustratingly use ideas that might be formally defined in later discussions if it helps tell a more interesting programming story earlier.  In addition, many of the homework problems may require that you dig for additional programming information in order to successfully complete the assignment.  Problems will be assigned early in the discussion and then I will serve as a “consultant” to the students over the next class periods as you work on the assignments.    

Other resources:

SAS docs

http://support.sas.com/documentation

SAS www.sas.com
R www.r-project.org

Grading:

Homework and projects will contribute to the final grade. Homework will contribute 80% of the grade while a mid-term project report and a final project report will each contribute a total of 20% to the final grade. Homework will be typed on a computer with appropriate output included and annotated. It is expected that programs will be internally documented with adequate amounts of commenting. Homework hint: start early! Programming projects always take longer than you estimate.

Expectations for independent work:

I expect you to struggle with implementing solutions to homework problems. During this struggle, you will ask me questions, and you will talk to your classmates about sticky programming issues. Talking about programming projects is an opportunity to learn. Helping others with debugging and coding is useful. HOWEVER, copying code and changing variable names is not equivalent to struggling with the work. Don’t cheat. Don’t plagiarize. When you are caught, your academic career could be ruined.

* STA 502 Project: Students enrolled in STA 502 will be required to complete an additional project and grades will be separately assigned in 402 and 502. This extra project will involve either: 1) an additional simulation study in which the impact of violating (at least one) assumption underlying a statistical inference procedure is investigated; 2) a large scale data management project or 3) a description of a statistical methods/ideas not discussed in class but implemented in SAS (e.g. power and sample size planning, MCMC and Bayesian analyses, incorporating survey sampling weights in an analysis). A written report detailing this project is due Nov. 20. Feel free to discuss possible projects with other faculty or me.

* Homework must be in my mailbox by 4 p.m. on the assigned due date in order to be considered.

Calendar:

MU Calendar

Course Outline (rough guide to 402/502)

# weeks

Tentative topics (associated with estimated # weeks!)

1

BASIC CONCEPTS (Ch. 1)

*Review basic concepts of statistical computing and research data management
* Introduce SAS data sets
* Review the form of SAS Statements and SAS names
* Introduce SAS procedures
* Review the structure of SAS programs
* Describe SAS data libraries and what they can contain
* Show documenting SAS programs using comments
* Illustrate running SAS programs and basic debugging

1-2

Constructing a data set for analysis:  reading, combining, managing and manipulating data sets (Ch. 2)

1.    Temporary versus permanent status of data sets – LIBNAME
2.    Reading data into a SAS data set
2.1  Reading data directly as part of a program – anyone for datalines?
2.2  Reading data sets that were saved as text – infile can be your friend
2.3  Sometimes variables are in particular columns or in particular formats
2.4  Reading comma-separated values – text files with “,” delimiters
2.5  Reading Excel spreadsheets directly
2.6  Reading SPSS data files – the little (SPSS) engine that could
3.     Writing out a file or constructing a simple report
4.     Concatenating data sets/adding observations – SET
5.     Merging data sets/adding variables – MERGE
6      Database processing with PROC SQL

 

1-2

Using SAS procedures (Ch. 3)

1    SAS system options – options
2    Statements that can modify output of most procedures –TITLE, LABEL, FORMAT
3    Defining your own formats for variable values
4    Selecting or stratifying an analysis by values of a variable – WHERE, BY, SORT
5    Displaying data set contents and properties – PRINT and CONTENT
6    PROC PRINT for listing the observations in a data set
7    Basic graphical displays
8    Using Scatter plots to display relationships between numeric variables
9    Summarizing categorical variables - FREQ
10  Summarizing numeric variables – UNIVARIATE, MEANS
11  Selecting a simple random sample - SURVEYSELECT
12  Randomly assigning treatments to observations - PLAN

 

1

Complex table construction and output control (i.e., “pretty” output) (Ch. 4)

1.  PROC TABULATE
2. Building from simple specifications:  nitrofen data
3. Enhancing PROC TABULATE output
4.  Output Delivery System (ODS)
4.1 Basic Ideas
4.2 Favorite Destinations – RTF, HTML and even PDF
4.3 What’s produced (i.e. output objects) and how to select them
4.4 Another Destination that Stat Programmers should visit – OUTPUT

1

Basic models in SAS (Ch. 5)

1.  Overview of modeling
2.   Linear Regression models – REG, GLM
2.1  Example 1:  Motorboats and manatees – a look at simple linear regression
2.2 Example 2:  Big brains and big bodies – specifying and fitting a multiple regression model
3.     ANOVA models – GLM for one-way anova
3.1 Example 3:  Rotting meat – package comparisons tested with one-way anova model
4.     ANOVA models – GLM for anova model with two or more factors

1-2

Producing Statistical Graphics (Ch. 6)

1.  Old School (device-based) / New School (template-based) SAS graphics
2.   ODS stat graphics (New School)
3.   Modifying graphics using the statistical graphics editor (New School)
4.   Graphing with style (and templates) (New School)
5.   Statistical Graphics – entering the land of SG* (New School)
                  5.1  Case Study using SG* graphics
6.   Back to (old) school (graphics)
7.   Customizing graphics (Old School)
8.   Why you need to learn about annotate data sets (Old School)
9.   Case study:  comparing distributions of responses (Old School)
10. Descriptive displays of spatial data (Old School)

 

1-2

Formatting, basic DATA step manipulations and programming (Ch. 7)

1.  Internal representations and output displays
2.  Character, numeric and date formats
3.  Recoding and transforming variables in a DATA step
4.  Ordering how tasks are done – precedence of operations
5.  What goes and what stays in a data set – DROP, KEEP, IF, WHERE, OUTPUT
6. Structured thinking about writing programs – pseudo-code and modules
7. CASE STUDY 6.1:  Is the two-sample t-test robust to heterogeneous variances?
8. CASE STUDY 6.2:  Monte Carlo integration to estimate Pr(0<Z<1.645) for Z~N(0,1)
9. CASE STUDY 6.3:  Simple percentile-based bootstrap
10.Throw out your tables of statistical distributions – CDF, PDF, QUANTILE
11. Generating variables using random number generators – RAND

 

1-2

Programming in a DATA step (Ch. 8)

1. Storage bins for collections of values - ARRAYS
1.1 Example 1: Defining values in ARRAY variable list directly
1.2 Example 2: Inputting values in ARRAY variable list
1.3 Example 3: Changing missing value codes for numeric variables to “.”
1.4 Example 4: Recoding missing values for all numeric and character variables
1.5 Example 5: Creating multiple observations from a single record
2. Case Study 1: Monte Carlo P-value for test of spatial randomness
3. Remembering variable values across observations – RETAIN
3.1 Example 6: Processing multiple observations for an individual
4 Case Study 2: Randomization test for the equality of two populations

1

MACRO programming (Ch. 9)

0.   What is a macro and why would you use it?
1.   Motivation for Macros:  numerical integration to determine P(0<Z<1.645)
2.   Macro processing
3.   Macro variables
4.   Conditional execution, looping and macro programs
5.   Debugging macro coding and programming
6.   Saving macros - %include +autocall+stored compiled macros
7.   Functions/routines of potential interest to macro programmers - %index, %length, %eval, symput, symget

 

1-2

Programming with matrices and vectors – IML (Ch. 10)

1:   Basic matrix definition + subscripting
2:   Diagonal matrices and stacking matrices
3:   Repeating, Element-wise operations and Matrix Multiplication
4    Importing SAS data sets into IML and exporting matrices from IML to data set 4.1:  Creating matrices from SAS data sets and vice versa
5:  CASE STUDY 1:  Monte Carlo integration to estimate p
6:  CASE STUDY 2:  Bisection root finder
7:  CASE STUDY 3:  Randomization test using matrices imported from PLAN
8:  CASE STUDY 4:  IML module to implement Monte Carlo integration to estimate p
8.1: Storing and loading IML modules
9:   SAS/IML Studio
9.1 CASE STUDY 1:  Dynamic and interactive analysis of the SMSA country data set
9.2: CASE STUDY 2:  Multiple-linked graphics windows
9.3 CASE STUDY 3:  IML matrix manipulations and invocations of SAS/Stat procedures
9.4  CASE STUDY 4:  Calling R library to generate bootstrap confidence intervals for mean MPG

 

2-5

 

 

 

 

 

 

TOPICS IN STATISTICAL PROGRAMMING (varies)

* Introduction to quantitative programming in R (objects-vectors, lists, matrices, dataframes; reading data [scan, read.table, sas.get]; summarizing data sets [mean, var, summary, table]; graphical displays [plot, pairs, coplots]; writing functions.
* Intro., packages & GUI (Rcmdr)
* Data structures
* Basic graphics
* Programming (flow control, functions, etc.)
* Simulation
* Other topics?

FAQs

  1. Where can I run SAS on campus? A: Various libraries and labs may have SAS. The RedHawk cluster (redhawk.hpc.muohio.edu) has SAS. I will request RedHawk accounts for all 402/502 students. Dave Woods of IT Services (woodsdm2@muohio.edu) will lead a class session on running SAS on the cluster.
  2. Can I get SAS on my personal computer? A: Yes, assuming you have a Windows machine or can run Windows on your Mac (via VMWare Fusion or Parallels desktop). You can purchase the SAS from the bookstore on disk for $40.
  3. How do I download R? A: Go to www.r-project.org and follow the downloads link from a CRAN Mirror near us (e.g. Statlib at CMU). You can download Linux, MacOS X or Windows precompiled binary distributions of the base system and contributed packages from the mirrors.
  4. Can I get formal certification in SAS? A: Yes. There are different levels of certification (e.g. Base, Advanced, etc.) and students can take these exams for $90 (half price). If you work through two chapters of the certification guide each week, then you will be ready to take the base exam by the end of the semester.
  5. When should I join professional societies? A: Now! You can join ASA for $10 or IBS/ENAR for $27.