Stat680  Exercises  Fall 2015

This page is under construction. More problems will be added in the future.

  1. Practice: starting SAS, writing a short SAS program containing a data step and a PROC PRINT step, reading SAS log.
  2. How many program steps are executed when the program below is processed? Why?
       DATA hw1.listing;
         INFILE jobs;
         INPUT date name $ type : $10.;
       RUN;
       PROC SORT DATA=hw1.listing;  BY name;
       RUN;
       TITLE  'Homework 1';  TITLE2 'Sorted Listing';
       PROC PRINT DATA=hw1.listing;
       RUN;
            
  3. Beware of OBS= and FIRSTOBS= system options: What observations are printed in each PROC PRINT step? Why?
       TITLE 'Beware of OBS= and FIRSTOBS=';
    
       DATA one;
         INPUT AccountNumber amount @@;
         DATALINES;
         5 120  6 153    2 97  4 105   3
         213  8 65     7   117 1 155
       ;
    
       PROC SORT DATA=one OUT=two;    BY AccountNumber;
       RUN;
    
       TITLE2 '--> Two <--';
       TITLE3 '=== Part A ===';
       PROC PRINT DATA=two;   RUN;
    
       OPTIONS OBS=7 FIRSTOBS=3;
    
       TITLE3 '=== Part B ===';
       PROC PRINT DATA=two;   RUN;
    
       PROC SORT DATA=one OUT=three;    BY AccountNumber;
       RUN;
    
       TITLE2 '--> Three <--';
       TITLE3 '=== Part A ===';
       PROC PRINT DATA=three;   RUN;
    
       TITLE3 '=== Part B ===';
       PROC PRINT DATA=three(FIRSTOBS=1);   RUN;
            
  4. What are the lengths and values of x, y, and z in each of the following data sets? Why? (Hint: how does SAS process each step? In addition, the wild card _NUMERIC_ refers to all previously existed numeric variables. The wild card _CHARACTER_ behaves likewise. See class note 03 for the use of RETAIN.)
      DATA d1;
       LENGTH x 3 y $ 2 z 4;
       RETAIN _NUMERIC_ 0 _CHARACTER_ 'xyz';
      DATA d2;
       LENGTH x 3;
       RETAIN _NUMERIC_ 0 _CHARACTER_ 'xyz';
       LENGTH y $ 2 z 4;
      DATA d3;
       RETAIN _NUMERIC_ 0 _CHARACTER_ 'xyz';
       LENGTH x 3 y $ 2 z 4;
            


  5. Read the following raw data (found in ex04.dat which has more rows than those shown below) in an external file into a SAS data set
    ----+----1----+----2----+----3----+----4----+----5----+----6----+
    312445    12/01/4908/23/5604/29/73 Joe von Mises  $41,230
    473941    11/09/5106/21/5903/21/71 John Stevenson  $100,459
    (and more)
          
    where
    Name
    Type
    Start
    Column
    End
    Column
    Q1
    num
    1
    2
    Q2
    num
    3
    4
    Q3
    num
    5
    6
    Date1
    num
    11
    18
    Date2
    num
    19
    26
    Date3
    num
    27
    34
    FirstName
    char
    36
    ends with
    a blank
    LastName
    char

    ends with
    two blanks
    Salary
    num

    followed by
    end-of-record
    marker
    In your INPUT statement, use variable list and informat list.
  6. Create a SAS data set from the raw data file clinical.dat with the following variable descriptions.
    Variable
    Description
    Starting
    Column
    Length
    Format
    Codes
    ID
    Patient ID
    1
    3
    char

    Gender
    Gender
    4
    1
    char
    M, F
    DoB
    Date of Birth
    5
    6
    mmddyy

    DoV
    Date of Visit
    11
    6
    mmddyy

    pdx
    Primary DX Code
    17
    2
    char
    see list
    below
    sdx
    Secondary DX Code
    19
    2
    char
    see list
    below
    hr
    Heart Rate
    21
    3
    num

    sbp
    Systolic Blood Pressure
    24
    3
    num

    dbp
    Diastolic Blood Pressure
    27
    3
    num

    vitamin
    Patient Taking Vitamin?
    30
    1
    char
    1=yes
    0=no
    preg
    Patient Pregnant?
    31
    1
    char
    1=yes
    0=no


    DX Code
    Symptom

    DX Code
    Symptom
    01
    Cold

    05
    Abdominal Pain
    02
    Flu

    06
    Heart Problem
    03
    Routine Physical

    07
    Lung Disorder
    04
    Ear Ache



    Create appropriate permanent formats for selected variables in the data step. Associate selected variables with suitable labels in data step. Write a comprehensive listing of the data with patients sorted according to their ID's.
  7. Using the SAS data set you've created in the previous problem, create a new SAS data set named ages containing the variables: Include in this new data set, in additional to the three age variables, ID, Gender, DoB, DoV, and preg. Prepare a data listing with this new data set. Make sure you use DATE system option so that today's date is printed in the first title line. Also prepare a data listing of female patients only showing ID, DoB, DoV, Age1-Age3, and preg. Use sensible titles/footnotes and variable labels.
    Hint: Use
  8. Redo problem 6 using the following SAS data set to create a SAS control data set, which in turn is used to create format for DX code (hint: use PROC FORMAT with CNTLIN= option):
    DATA codes;
    INFILE DATALINES TRUNCOVER;
    INPUT @1 dx $2. @4 description $16.;
    DATALINES;
    01 Cold
    02 Flu
    03 Routine Physical
    04 Ear Ache
    05 Abdominal Pain
    06 Heart Problem
    07 Lung Disorder
    ;
    Hint: Use FMTLIB option (in PROC FORMAT) to check your answer.
  9. Use the SAS data step in problem 6 with the raw data set replaced by clinical2.dat and create a new data set named clinical2, and two SAS data sets created by the data steps below to create a new SAS data set called study.
    DATA treadmill;
    INPUT ID : $3. minute @@;
    DATALINES;
    159 10 594 12 547 15 315 12 953 13
    ;
    DATA bodyfat;
    INPUT ID : $3. BodyFat @@;
    DATALINES;
    315 14 159 23 647 20 953 24 547 30 575 21
    ;
    Some patients from clinical2 entered two studies, one on treadmill and one on bodyfat. Combining all data sets, keeping only patients entered in both studies, to create study
  10. The following SAS data set read variables Age, SBP, and DBP as character.
    data a;
    INPUT Age $ SBP $ DBP $ @@;
    datalines;
    23 121 70 35 134 72
    49 . 68 30 156 78
    52 132 98 ;
    Create a new SAS data set, called b that read from data set a so that all variables are of correct type, numeric. In this new data set, keep the same variable names: Age, SBP, and DBP. (Hint: Use SET with RENAME data set option, and INPUT function. Also use PROC CONTENTS to check your results.)
  11. The following SAS data set was given:
    DATA many;
    INPUT Subject : $2. record x1-x3 @@;
    DATALINES;
    01 1 3 8 4 02 3 5 6 9
    02 2 3 6 7 01 3 7 9 5
    02 1 2 4 6 03 1 4 4 5
    01 2 5 9 3 03 3 6 7 7
    03 2 5 6 5
    ;
    The SAS data set many has three observations per subject. Read this SAS data set and create a new SAS data set one that has one observation per subject containing the following variables Hint: Use
  12. The following DATA step is incomplete. Answer the questions in parts (a), (b) and (c) below by filling in appropriate SAS statements.
    DATA _NULL_;
    FILE PRINT;
    INPUT ID $ 1-4 answers $ 6-10;
    (SAS statements here)
    DATALINES;
    0123 ABCDE
    2264 XCCBA
    3124 aaBAD
    1972 CBADE
    9275 35AbE
    6211 DDEAC
    5278 A7344
    ;
    1. Suppose the variable answers contains answers to 5 multiple choice problems in which legitimate answers are A, B, C, D, and E (upper case). Insert suitable SAS codes so that proper messages will be printed for each data line that contains invalid data. Be sure to include data line number for invalid data. Replace each of problematic answers with a question mark (?).
    2. Suppose, in the above, the cases for the answers are ignored. That is, a, b, c, d, and e are legitimate answers corresponding to their upper-case counterparts. Redo part (a).
    3. Suppose, in addition to part (b), the five numbers 1, 2, 3, 4, and 5 are also legitimate answers (1 for A, 2 for B, etc). Redo part (a).
    4. For part (c), rewrite the entire data step and create a SAS data set named multiple that contains six variables: ID, q1-q5. Convert all legitimate answers to upper case (A to E) and place missing values for invalid answer codes.
    Hint: Use
  13. Rewrite the following SAS code to make it more efficient:
    DATA simu(DROP=i);
    LENGTH X1 X2 prop1 prop2 4 group $ 1;
    DO i = 1 TO 10000;
    X1 = RAND('uniform');
    X2 = RAND('uniform');
    prop1 = (x1 < 0.5);
    prop2 = (x2 < 0.005);
    IF x1 > 0.9 THEN group='A';
    ELSE IF x1 > 0.7 THEN group='B';
    ELSE IF x1 > 0.4 THEN group='C';
    ELSE group='D';
    OUTPUT;
    END;
    RUN;
  14. The following data step reads SAS data set simu from the previous problem and create a SAS data new. Rewrite it to make it more effieient.
    DATA new;
    SET simu;
    IF prop2=1 THEN X3=X1;
    ELSE X3=x2;
    RUN;
  15. Rewrite the following data step to make it more efficient (the actual raw data file is hypothetical, it does not exist):
    DATA a;
    INFILE 'f:\mydata\raw.dat' PAD;
    INPUT @1 (Q1-Q8) (4.1 +1) @51 class $9.;
    IF class in ('Junior','Senior');
    RUN;
  16. Listed below is an example data set that contains student names and (eight) test scores. The number of students ranges from 15 to 30. Assume that the student names are their first names which can be as long as 12 characters long containing no embedded space.
    Toby 64       80       61       59       31       44       90       86
    Barb 70       86       61       81       20       95       54       83
    Jay  81       73       56       72       26       92       87       40
    Joe  50       70       72       65       32      100       77       62
    Bob  68       94       82       82       71       58       79       71
    Kay  31       68       90       81       44       82       88       83
    Adam 73       81       85       79       61       68      100       71
    Carl 60       95       88       65       42       72      100       86
    Dave 72       67       70       37       45       61       84       82
    Guy  39       81       88       56       35       75      100       64
    Levy 82       94       87       74       79       50       69       89
    Eton 36       65       62       81       78       99       52       83
    Iman 63       77       85       90       60       78       95       39
    Pat  50       93       78       48       70       72       84       86
    Sue  67       76       57       71       59       80       98       56
    Fay  60       38       87       94       61       86       80       66
    Mark 65       81       88       77       34       74       78       36
    Hal  34       90       64       94       63       80       97       67
      
    The passing scores on the tests are 55, 65, 60, 65, 50, 70, 75, and 60, respectively. Write a data step (DATA _NULL_) that produces data listing, adding a column of number of passes for each student on the right and a row of number of passes for each test at the bottom. DO NOT hard code the data step by assuming there are only 18 observations such as that shown above.
  17. Information on each of four drugs was recorded for a patient's coming in for a clinic visit: a code of 0 indicates the drug is not being taken, a code of 1 indicates the drug is being started, a code of 2 indicates that drug is being discontinued. For each visit, a patient ID, visit date, and the status codes of the four drugs are recorded. Write a SAS program using only PROC SORT and a DATA step that will
    1. count the number of patients who were ever on at least one of the four drugs (a patient was on a drug rx_1, say, if he or she was found with a code of 1 or 2 on at least one visit);
    2. count the number of patients who were ever on each and every one of the four drugs;
    3. count the number of patients who had never been on any drug.
    Use PUT statement to write these numbers to a print or log file.
    DATA drug;
     INPUT ID : $2. Visit : MMDDYY8. rx_1-rx_4;
    DATALINES;
    04 01/28/95 1 0 1 1
    03 02/28/95 2 0 2 1
    05 05/20/95 0 0 0 0
    01 03/27/95 1 0 0 1
    03 03/02/95 0 0 0 1
    01 04/04/95 0 1 1 1
    02 04/29/95 0 1 1 1
    02 05/04/95 0 0 1 0
    03 02/26/95 1 0 1 1
    03 03/06/95 0 1 0 1
    04 01/25/95 0 0 1 0
    05 03/01/95 0 0 0 0
    02 04/30/95 0 2 1 2
    01 03/31/95 2 1 0 1
    03 03/07/95 0 1 0 2
    04 02/01/95 1 1 1 1
    05 04/18/95 0 0 0 0
    01 04/09/95 0 2 1 2
    ;
       
  18. The data below are patient's ID, SBP, and DBP. SBP that is greater than 160 (mmHg) and DBP that is greater than 90 (mmHg) are considered to be hypertensive.
    ID     SBP    DBP
    518    132     66
    569    153     78
    647    165     88
    345    129     78
    908    131     76
    771    124     85
    189    120     80
           117     79
    535    201     98
    607    143     85
    832    156     92
    795    122     74  
       
    Use two methods (each in a DATA _NULL_ step) to produce the output below (added ruler is not part of the output):
    1. Use single trailing at (@) in PUT statement along with logical statements to place an asterisk (*) where needed.
    2. Create character variables from SBP and DBP and use SUBSTR pseudo function to place an asterisk at the end of the string if needed.
    ----+----1----+----2----+----3----+----4----+----5----+
    Listing of Systolic and Diastolic Blood Pressure
    Pressure marked with an asterisk (*) are hypertensive
        ID        Systolic            Diastolic
              Blood Pressure      Blood Pressure
    ------------------------------------------------
       518          132                  66
       569          153                  78
       647          165*                 88
       345          129                  78
       908          131                  76
       771          124                  85
       189          120                  80
                    117                  79
       535          201*                 98*
       607          143                  85
       832          156                  92*
       795          122                  74
       
  19. For SAS data miss in example ProcFormat.sas, count the number of missing and non-missing values for numeric variables only. Use PROC MEANS, instead of PROC FORMAT, and PROC FREQ to accomplish this. First create an output data set then use PROC PRINT to print out the results in a listing similar to the following one:
                     Number of Missing and Non-missing Values
    
                   a          a          c          c          d          d
       Total  Non-missing  Missing  Non-missing  Missing  Non-missing  Missing
         5         3          2          3          2          4          1
       
  20. Consider the grades data in Array.sas. There are four homeworks at 8 points each and six quizzes at 8 points each too.
    1. Suppose the teacher decides to drop the lowest scores from the 10 homework-quizzes combined and then re-scale the score. That is, she decides to compute the total score (100% base) of each student by the formula
      score = total of 9 homework-quizzes high scores × 10/9 + final
      Grade the course for the teacher. Email program only.
    2. What if she decides to drop two lowest scores from the homework-quizzes. Email the program only.
  21. The following is a up-to-date record (as of 10/05/15 (05OCT2015)) of patients visiting a pediatric clinic who were diagnosed of a certain disease. (The SAS code is available as pediatric.sas).
    DATA pediatric;
     INFILE DATALINES FIRSTOBS=2 PAD;
     INPUT @1  ID          $3.
           @5  DateOfVisit MMDDYY8.
           @14 Doctor      $5.;
     DATALINES;
    ----+----1----+----2
    003 09/18/14 KLMN
    004 11/12/14 ABCD
    002 10/21/14 ABCD
    003 10/27/14 ABCD
    001 12/05/14 DOE
    004 03/01/15 ABCD
    002 01/13/15 KLMN
    001 01/27/15 ABCD
    004 07/31/15 DOE
    003 05/31/15 DOE
    001 05/11/15 UVWXY
    ;
       
    A variable called Result is to be set according to the following:
    Result =
    Success, if patient returns more than 90 days later for a visit
    Failure, if patient returns within 90 days for a visit
    Unknown, if it's within 90 days to current day (05OCT2015) since the patient's last visit
    The data set blame from this should appears as the following
    
    		      DateOf
    	   ID          Visit    Doctor    Result
    
    	   001    12/05/2014    DOE       Failure
    		  01/27/2015    ABCD      Success
    		  05/11/2015    UVWXY     Success
    
    
    	   002    10/21/2014    ABCD      Failure
    		  01/13/2015    KLMN      Success
    
    
    	   003    09/18/2014    KLMN      Failure
    		  10/27/2014    ABCD      Success
    		  05/31/2015    DOE       Success
    
    
    	   004    11/12/2014    ABCD      Success
    		  03/01/2015    ABCD      Success
    		  07/31/2015    DOE       Unknown
    
       
    Write a SAS code to obtain the SAS data set blame.
  22. This problem will make use of the cars data set from 1983 ASA Data Exposition. The data set was packed using UNIX shar command. To help you on doing this homework, I've unshar'ed it. It results in three files (renamed): There are noticeable data errors in carsNames.txt. However, DO NOT modify it with editor program and take it as-is. Also note that the missing value code used in the data file cars.txt is NA. Again, DO NOT modify it with editor program and take it as-is.
    1. You are to write a SAS program which takes the main data file and reads into a SAS data set. E-mail program only.
    2. Write a SAS program which read car names data file and break each car name into make and model name. Note that the abovementioned data errors cannot be detected by SAS log. However, use PROCs such as FREQ or TABULATE, you may be able to spot the errors. Now, use DATA step to correct data errors. E-mail program and turn in the table with the explanation of erroneous data.
    3. Merge the two SAS data sets created above into a single SAS data set. Now use this SAS data set to produce some exploratory descriptive statistics (of your choice) in two tables (using PROC REPORT). E-mail program and turn in a two-page simple report (WORD file) of your findings (including at-most-one-page SAS output).
  23. Redo Exercise 16 by
    1. first creating a data named tests containing the data
    2. then read the data in PROC IML by
      proc iml;
       reset ls=80 nocenter;
       use tests;
        read all var _NUM_ into A;
        read all var _CHAR_ into student_names;
       rnames = {"Students Passed"};
       rnames = student_names // rnames;
       cnames = ('Test 1':'Test 8') || {'Passed'};      
              
      where rnames and cnames are row names and column names of a matrix, named result, in which first n (=18) rows and first p (=8) columns (i.e., matrix A) contain the p test scores of these n students, column p+1 contains numbers of tests passed by students, row n+1 contains numbers of students passed the tests, and entry (n+1,p+1) contains total number of passes.
    You're to continue the codes in PROC IML to construct and print this matrix result. The printed result should be similar to this:
                                 Test Results
                    Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8 Passed
    
    Toby               64    80    61    59    31    44    90    86      5
    Barb               70    86    61    81    20    95    54    83      6
    Jay                81    73    56    72    26    92    87    40      5
    Joe                50    70    72    65    32   100    77    62      6
    Bob                68    94    82    82    71    58    79    71      7
    Kay                31    68    90    81    44    82    88    83      6
    Adam               73    81    85    79    61    68   100    71      7
    Carl               60    95    88    65    42    72   100    86      7
    Dave               72    67    70    37    45    61    84    82      5
    Guy                39    81    88    56    35    75   100    64      5
    Levy               82    94    87    74    79    50    69    89      6
    Eton               36    65    62    81    78    99    52    83      6
    Iman               63    77    85    90    60    78    95    39      7
    Pat                50    93    78    48    70    72    84    86      6
    Sue                67    76    57    71    59    80    98    56      6
    Fay                60    38    87    94    61    86    80    66      7
    Mark               65    81    88    77    34    74    78    36      6
    Hal                34    90    64    94    63    80    97    67      7
    Students Passed    12    17    16    14     9    13    15    14    110
          
    Hint: First use SHAPE function to create a matrix of the same size as matrix A in which each row contains the eight passing scores of the tests. Then use comparison, subscript reduction and concatenation operators, and MATTRIB statement to construct matrix result.
  24. Redo Exercise 20 by using PROC IML.
  25. Denote f(x;μ,σ) and F(x;μ,σ), respectively, the p.d.f. and the c.d.f. of N(μ,σ2). The p.d.f. and the c.d.f. of contaminated normal cN(μ,σ2,σc2,p) (a N(μ,σ2) contaminated in spread by a N(μ,σc2) with probability p) are
    (1-p)f(x;μ,σ)+pf(x;μ,σc) and
    (1-p)F(x;μ,σ)+pF(x;μ,σc),
    respectively. Write two IML modules, one for p.d.f. and one for c.d.f., that implement these functions.
    Hint: Use base SAS functions, PDF and CDF, in PROC IML.
  26. See below. For pdf version: see e26.pdf.

    (e26.png here)

  27. See below. For pdf version: see e27.pdf.

    (e27.png here)

  28. The university payroll department is to project employee costs in wages, retirement (benefits), and medical (benefits) through future years based on assumed increases:


    variable

    Current
    Level
    Estimated
    Annual
    Increase
    Wages
    $73,142,768
    3.1%
    Retirement
    $8,006,549
    1.3%
    Medical
     $2,993,170
    9.4%
    Use PROC IML to create a data matrix containing four columns: year and the three variables in the table above. Then create a SAS data set, named future, from this data matrix with named variables: Year, Wages, Retirement, and Medical (use proper case for the names). Initialize each of the three variables in the table to its current value. The SAS data set future should contain the projected costs for the next ten year (that is, from year 2014 to 2023 assuming the current year is 2013). As an example, the cost for wages in the year 2014 will be 1.031 times the wages expense of this year.
  29. Use existing SAS data set ozone0 created by the following DATA step.
    FILENAME oz URL
      'http://www.stat.wmich.edu/wang/680/data';
    DATA ozone0;
    INFILE oz(ozone.txt);
    INPUT x1-x10;
    RUN;
         
    This data set contains pairs of ozone levels of two cities A and B in the past 200 days. Each row contains 5 such pairs. Thus, x1, x3, x5, x7, and x9 are the ozone readings in city A in five consecutive days; x2, x4, x6, x8, and x10 are the ozone readings in city B in five consecutive days.
    1. From this SAS data set, create a new SAS data set, named ozone, with two variables A and B containing ozone levels of the two cities in 200 rows.
    2. Use PROC GPLOT to create quantile-quantile plot in two ways:
      1. use only complete cases (i.e., observations in which ozone levels of both cities are non-missing): plot quantiles (order statistics) of ozone levels of B versus those of ozone levels of A;
      2. for each city, use non-missing ozone readings to calculate 5ith percentiles, i=0,1,...,20. (Note: 0th percentile=MINIMUM, 100th percentile=MAXIMUM).
        Assume the order statistics of non-missing values are
        X(1), X(2), ..., X(n)
        To obtain 100pth percentile, first compute
        g = 0.5 + n×p
        and express g by
        g = k + f, where 0≤f<1.
        That is, k is the integral portion of g, and f is the fractional portion of g.
        The 100pth percentile is then defined by
        f × X(k+1) + (1-f) × X(k)
        Then use PROC GPLOT to plot pairs of percentiles of the two cities. (Hint: Use PROC IML. In IML, use SORT routine to get order statistics and then compute percentiles.)
  30. Write a SAS data step to read the following data. Fields are student name, id, weight, height, gender, and GPA, respectively. Note that variables weight, height, and GPA are numeric, the rest are character.
    Jack 12-3456 192    72 M 3.3
    Gill    23-4567 131 69 F 3.5
    Jay 34-5678       . 70 M 2.9
    Elizabeth 45-6789 115 68 F .
    Joanne 56-7890 121 70 F  3.8
    Peggy 67-8901 135 67   F 3.6
    Washington 78-9012 213 73 M 3.8
    Jim   89-0123   200 70 M 3.4
    
    Hint: available tools: one of
    1. colon informat modifier (:) in INPUT statement
    2. LENGTH statement
    3. INFORMAT statement
    4. ATTRIB statement
  31. A car survey was conducted and the data were recorded using two lines of data for each subject.
    Line
    Variable
    Name
    Type
    Start
    Column
    End
    Column
    Description
    1
    Subject
    Char
    1
    3
    subject ID

    DoB
    MM/DD/YY
    4
    11
    date of birth

    State
    Char
    25
    26
    State where living

    ZipCode
    Char
    40
    44
    zip code

    2
    Subject_
    Char
    1
    3
    subject ID*

    Cars
    Numeric
    4
    4
    number of cars

    Car1
    Char
    11
    20
    make of car 1

    Car2
    Char
    21
    30
    make of car 2

    Car3
    Char
    31
    40
    make of car 3
    Note*: Subject_ is supposed to be identical to Subject. However, there may exist subjects so that they differ. Below are example data:
    ----+----1----+----2----+----3----+----4----+----5
    00109/07/49             PA             19056
    0012      Chrysler  Ford
    00201/31/57             NY             11518
    0021      Honda
    00311/21/40             IN             46403
    0132      Chevy     GMC
    004                     MI             49077
    0043      Toyota    Toyota    Nissan
    (more subject data lines)
         
    Note that, unused fields at the end are truncated. Maximum number of cars per subject is 3. The field Subject_ is used to check for data correctness. Include only Subject in the SAS data set. However, do a raw data check, write a message into SAS log when encountering a raw data error. Also add at least two more subjects (i.e., at least four more raw data lines) to the above data in your program.
    Hint: Tools used may include:
    1. INPUT statement with line pointer (# or /), column pointer, informats
    2. TRUNCOVER option in INFILE statement
    3. DROP or KEEP statement; or DROP= or KEEP= data set option
  32. In the SAS data set below, a value of NA or na was used in place of a missing value. Include the SAS code below that create this data set, but write a new SAS data step that creates a SAS data set new where these values are converted to a SAS character missing value. In addition, your SAS log should not contain 'ERROR' and or 'WARNING'.
    DATA old;
     LENGTH A B C D $ 2;
     INPUT ID $ A $ B $ C $ D $ X Y Z;
    DATALINES;
    01 Y N Y Y 3 9 12
    02 na N Y N 5 2 8
    03 NA na na NA 4 1 7
    ;
          
    Hint: Tools used may include:
    1. SET statement
    2. PROC FORMAT with INVALUE statements
    3. ARRAY statement
    4. UPCASE function
    5. Index DO loop
    6. DIM function (optional)
  33. Use the SAS data set clinical created in Problem 6 (but use the raw data file clinical2.dat instead) and create the following table using PROC TABULATE.
    Simple Statistics Broken Down By Two Variables
    
    ---------------------------------------------------------------------------
    |                                  | Heart Rate |Systolic BP |Diastolic BP|
    |                                  |------------+------------+------------|
    |                                  |Number|Mean |Number|Mean |Number|Mean |
    |----------------------------------+------+-----+------+-----+------+-----|
    |Taking Vitamin? |Gender           |      |     |      |     |      |     |
    |----------------+-----------------|      |     |      |     |      |     |
    |No              |Female           |    17| 88.2|    17|205.1|    17| 86.1|
    |                |-----------------+------+-----+------+-----+------+-----|
    |                |Male             |    12| 94.4|    12|201.8|    12| 92.3|
    |                |-----------------+------+-----+------+-----+------+-----|
    |                |Both             |    29| 90.8|    29|203.7|    29| 88.7|
    |----------------+-----------------+------+-----+------+-----+------+-----|
    |Yes             |Gender           |      |     |      |     |      |     |
    |                |-----------------|      |     |      |     |      |     |
    |                |Female           |     3| 89.7|     3|196.3|     3| 98.3|
    |                |-----------------+------+-----+------+-----+------+-----|
    |                |Male             |     3| 84.7|     3|196.3|     3| 88.0|
    |                |-----------------+------+-----+------+-----+------+-----|
    |                |Both             |     6| 87.2|     6|196.3|     6| 93.2|
    |----------------+-----------------+------+-----+------+-----+------+-----|
    |Both            |Gender           |      |     |      |     |      |     |
    |                |-----------------|      |     |      |     |      |     |
    |                |Female           |    20| 88.5|    20|203.8|    20| 87.9|
    |                |-----------------+------+-----+------+-----+------+-----|
    |                |Male             |    15| 92.5|    15|200.7|    15| 91.5|
    |                |-----------------+------+-----+------+-----+------+-----|
    |                |Both             |    35| 90.2|    35|202.5|    35| 89.4|
    ---------------------------------------------------------------------------
          
    Hint: Tools used may include (in PROC TABULATE):
    1. CLASS and VAR statements
    2. KEYLABEL and LABEL statements
    3. TABLE statement with the options below:
        PRINTMISS and RTSPACE= options
  34. Redo parts (a), (b), and (c) in problem 12, but use only arrays, DO loops, functions RANK and BYTE, logical comparisons.
  35. Redo problem 28 using SAS macro programming with the following assumption:
    Having the data:
    DATA configuration;
     INPUT variable : $10. current_level : DOLLAR11.
           est_incr : PERCENT4.;
    CARDS;
    Wages $73,142,768 3.1%
    Retirement $8,006,549 1.3%
    Medical $2,993,170 9.4%
    ;
    	
    From this data set, create three macro variables: varlist, init, and rate using two methods: 1) PROC SQL macro facility and 2) SAS DATA step macro interfaces. After these, use the following codes to create desired data set (you may modify the codes except for the two ARRAY statements):
    %LET years = 10;
    %LET this_year = 2013;
    
    DATA outlook&years;
     ARRAY item[3] &varlist (&init);
     ARRAY rate[3] _TEMPORARY_ (&rate);
     DO i = 1 TO &years;
      year = &this_year + i;
      DO j = 1 TO 3;
       item[j] = item[j] * (1+rate[j])**i;
      END;
      OUTPUT;
     END;
     DROP i j;
    RUN;
    
    PROC PRINT DATA=outlook&years;
     ID year;
     FORMAT &varlist COMMA14.2;
    RUN;
    	
    The output from the PRINT procedure should look like this:
    year             Wages        Retirement           Medical
    
    2014     75,410,193.81      8,110,634.14      3,274,527.98
    2015     80,158,095.02      8,322,881.32      3,919,072.97
    2016     87,846,281.63      8,651,711.68      5,131,393.45
    2017     99,256,291.46      9,110,449.80      7,350,292.19
    2018    115,624,900.21      9,718,227.16     11,518,375.00
    2019    138,868,377.90     10,501,315.79     19,746,728.07
    2020    171,954,687.65     11,495,022.77     37,035,345.43
    2021    219,524,674.50     12,746,337.07     75,989,742.98
    2022    288,942,435.69     14,317,606.00    170,573,230.08
    2023    392,101,029.21     16,291,642.10    418,874,665.06
    	
  36. Redo the cdf (cumulative distribution function) part of problem 25 by writing a macro with the following parameters:
    1. mu (i.e., the mean μ)
    2. sigma (i.e., the uncontaminated standard deviation σ)
    3. cont_sigma (i.e., the contaminated standard deviation σc)
    4. p (i.e., the probability of contamination)
    5. data=&SYSLAST (input data set that contains a variable of x values to calculate cdf's)
    6. var= (to select a single variable of x values to calculate cdf's)
    7. F=cdf (data set variable name containing cdf's)
    8. out= (output data set, this should contain at least 2 columns, one for x values, one for respective cdf's)
    Run a test on an input data set and show the results.
  37. Use the SAS data created by the SAS codes below (also available for download in admit.sas) and create, in two methods, a new SAS data set Containing 6 observations and 5 columns: one observation per department and variables (in this order) Department, MalesAdmitted, MalesRejected, FemalesAdmitted, and FemalesRejected.
    DATA UCBAdmissions;
     LENGTH Admit Gender $ 8;
     LENGTH Department $ 1;
     DO Department = 'A', 'B', 'C', 'D', 'E', 'F';
      DO Gender = 'Male', 'Female';
       DO Admit = 'Admitted', 'Rejected';
        INPUT Count @@;
        OUTPUT;
       END;
      END;
     END;
    DATALINES;
    512 313 89 19   353 207 17 8
    120 205 202 391    138 279 131 244
    53 138 94 299   22 351 24 317
    ;
    
    /*
    PROC PRINT DATA=UCBAdmissions LABEL SPLIT='*' NOOBS;
     LABEL Count='Number of*Students';
    RUN;
     */
      
    When either new1 or new3 is printed, it should produce output like:
                    Males       Males      Females     Females
    Department    Admitted    Rejected    Admitted    Rejected
    
        A            512         313          89          19
        B            353         207          17           8
        C            120         205         202         391
        D            138         279         131         244
        E             53         138          94         299
        F             22         351          24         317
      
    The data set UCBAdmissions is real and is famous for its use as an example to demonstrate Simpson's Paradox. This is a data collected in the University of California-Berkeley. Gender-bias toward women applicants was suspected when the data was presented.
  38. Use the UCBAdmissions data set (see Exercise #37) and produce a 'flatten' table like
                    Dept   A   B   C   D   E   F
    Admit    Gender                             
    Admitted Male        512 353 120 138  53  22
             Female       89  17 202 131  94  24
    Rejected Male        313 207 205 279 138 351
             Female       19   8 391 244 299 317
    
    Suggested tools: either PROC FREQ or PROC TABULATE
  39. Use the UCBAdmissions data set (see Exercise #37) to produce an overall (university-wide) 2×2 table of Admit×Gender and the 6 individual (department-wide) 2×2 table. Perform Pearson's chisquare test and the Fisher's exact test on each table. Discuss the outcomes. Tool: PROC FREQ.
  40. The SAS data set VS created by TYPvs.sas contains the vital signs records of subjects from a clinical study detailed in the study protocol TYP-001. The data were collected from 3 clinical sites (001, 002, 003) with various numbers of subjects. After subjects entered into the study, their vital signs were taken on the baseline vist, on the day of treatment (approximately within 2 weeks of baseline visit), and at the day of 3 months (approximately) after treatment. Baseline systolic blood pressure (SBP) is normal if 100 ≤ SBP ≤ 140 mmHg; and SBP is elevated if SBP > 140 (not classified for other range). Baseline diastolic blood pressure (DBP) is normal if 60 ≤ DBP ≤ 90 mmHg; and DBP is elevated if DBP > 90 (not classified for other range). For subsequent visits, each vital sign value was monitored to detect if it is of Potential Clinical Significance (PCS). HR is of PCS if HR < 40 or if HR > 110. When baseline SBP is normal, an SBP is of PCS if SBP < 85 or if SBP > 160; when baseline SBP is elevated, an SBP is of PCS if SBP > 20 change from baseline or if SBP > 180. When baseline DBP is normal, a DBP is of PCS if DBP < 45 or if DBP > 100; when baseline DBP is elevated, a DBP is of PCS if DBP > 10 change from baseline or if DBP > 110.
    Use this SAS data and create a new SAS data set, named new_VS so that:
  41. Read the SAS data vs created in problem 40, create a new data set, named new_vs2, which turns the subject-level sas data (a.k.a., wide format) into a long format containing these variables:
    1. usubjid (renamed from patient)
    2. studyid, siteid, subjid (as in problem 40)
    3. vistnum (numeric): 1 = 'Baseline Visit' 2 = 'Treatment Date' 3 = '3 Months Post Treatment'
    4. visdt: dates of the visits
    5. Ady: Analysis Day = (visit date) - (treatment date) if visit date < treatment date, and = (visit date) - (treatment date) + 1 if visit date ≥ treatment date
    6. sbp: Systolic Blood Pressure (at respective visit dates)
    7. dbp: Diastolic Blood Pressure (at respective visit dates)
    8. hr: heart rate (at respective visit dates)
    9. pcssbp (character): Potential Clinical Significance SBP, = " " at baseline (since it's not applicable) = "yes" or "no" at treatment or 3-month post treatment according to the definition as laid out in problem 40.
    10. pcsdbp (character): Potential Clinical Significance DBP similar to pcssbp.
    11. pcshr (character): Potential Clinical Significance hr similar to pcssbp.
    12. sbpbl: SBP baseline values (each subject has a value repeated over all records of that subject)
    13. dbpbl: DBP baseline values
    14. hrbl: HR baseline values
    15. chgsbp: SBP change from baseline (=0 at baseline)
    16. chgdbp: DBP change from baseline (=0 at baseline)
    17. chghr: HR change from baseline (=0 at baseline)
    18. BLFL (character): Baseline Flag. = "Y" only at baseline records, and = " " (blank) for all other records.
    The new SAS data set should be sorted by usubjid and visitnum. You can do this problem either by using PROC TRANSPOSE or by using arrays, output statement in a data step (hint: also use vlabel function).
  42. Use the data, new_vs2, created in problem 41, create a new data set, named advs (Vital Sign Analysis Data Set), that stacks the vital sign variables and contains these variables:
    1. usubjid
    2. studyid
    3. siteid
    4. subjid
    5. vistnum
    6. visdt
    7. Ady
    8. PARAM (length-40 character): the labels of variables SBP, DBP, and HR.
    9. PARAMCD (length-8 character): the variable names of SBP, DBP, and HR.
    10. AVAL: values of the variables SBP, DBP, and HR.
    11. Baseline: variable of baseline values.
    12. CHG: Change from Baseline values.
    13. PCS (character): Potential Clinical Significance.
    14. BLFL: Baseline record flag.
    You can use tools similar to those in problem 41. In addition, functions such as vname can be used.
  43. The SAS data set HEMA created by TYPhema.sas contains the (unsorted) hematology records of subjects from the same clinical study as described in problem 40. The data set is in fairly similar format as the data set new_vs2 in problem 41 with exceptions: the hematology data were collected from various laboratories and consequently are of various units. Create an intermediate data hema1 by converting the laboratory reporting units into Systeme International (SI) units using the table below (SI = [Conversion Factor] × Reporting Unit):
    Hematology Reporting Unit SI Unit Conversion Factor
    RBC (all reporting units) TI/L 1
    HEMOGLOBIN mmol/L or mmolFe/L mmol/L 1
      g% or g/dl mmol/L 0.155
    HEMATOCRIT fraction or L/L L/L 1
      % L/L 0.01
    PLATELETS 10^9/L (109/L) or 10^3/uL (103/μL) GI/L 1
      mm3 (mm3) GI/L 0.001
    WBC 10^9/L (109/L) GI/L 1
      mm3 (mm3) or uL (μL) GI/L 0.001
    The converted data hema1 should contain the hematology data in SI units. Merge this hematology data with the VISDT and ADY variables from new_vs2 and then create a new data set adht (Hematology Analysis Data Set) similar to advs in problem 42. The PCS variable is now defined (according to SI units) at all time points with four possible values "no", "low", "high", or " " (if the measurement is missing) according to the following rules:
    Hematology Low High No
    RBC ≤3.5   >3.5
    HEMOGLOBIN Female≤95; Male≤115 >180 otherwise
    HEMOGLOBIN, change from baseline   >25 ≤25
    HEMATOCRIT Female≤0.32; Male≤0.37 >0.54 otherwise
    HEMATOCRIT, change from baseline   >0.075 ≤0.075
    PLATELETS <100 >550 otherwise
    WBC <3 >20  
    For clarity, create a separate PCS variable named PCSCHG just for the PCS-change from baseline for HEMOGLOBIN and HEMATOCRIT variables. VISDT, VISITNUM, and ADY should also be included. For the data, the final adht data set should have 9×3×5 = 135 observations. That is, for each combination of USUBJID and VISTDT, there are five observations (the five hematology measurements).
  44. Consider the data work.wt created by the SAS codes below:
    data work.wt;
     input patient $ visdt : mmddyy. weight;
     format visdt date9.;
    datalines;
    05 09/09/2003 121
    04 09/13/2003 89 
    02 01/31/2004 181
    03 05/09/2004 149
    04 07/11/2003 .  
    03 12/18/2003 191
    05 10/27/2003 .  
    01 01/23/2004 83 
    05 07/15/2004  89
    01 10/30/2003 69 
    02 08/10/2003 118
    02 10/21/2003 123
    05 01/18/2004 151
    04 02/27/2005 133
    03 11/01/2003 120
    03 02/15/2004 185
    01 09/27/2003 103
    04 12/28/2004 151
    ;
        
    Patients' weights are closely monitored according to the following classes: For each patient, at all the visits, a weight change is classified as significance ("yes") if there is at least a change of 2 classes from last available class. For example, change from 1 to 3, change from 5 to 2, change from 3 to 6, etc., are each considered as significance. If there is no significance change then it is marked as "no". However, if no previously available record of weight class, then mark it as " ". Create a new data set wt2 containing this additional variable, named Signif. (Note: as a consequence, the value of this new variable in the first record of each patient is always " " since there is no previously available weight record for this patient.)
  45. Predict what will turn out in SAS log when you run the following SAS codes:
    data _null_;
      input a;
      put a= _n_=;
      do _n_ = 1 to 3;
        put _n_=;
      end;
      put _n_=;
      datalines;
    1
    2
    3
    ;
        
    Run the SAS codes. Compare your prediction with the results. Explain why.
  46. Redo Exercise 18 with a data _null_ step but with picture formats for printing SBP and DBP to produce the same output.
  47. The HARVEST Trial Data (description in HARVEST Trial in OzDASL contain subject-level information (that is, of wide format). Construct a SAS data set, named HARVEST. Use sensible labels for all the variables. For example, 'Smoking Status at Baseline' for SMOKE and 'Sport Activity at Baseline' for SPORT and 'Age at Baseline' for AGEB. Read all columns as of numeric type. Precede the data step with a PROC FORMAT to define custom formats for the following variables and then associate these formats with the corresponding variables in the data step: A snapshot of the data view (with the 4 variables rearranged to appear the the very beginning) should look like:
  48. Construct, from the HARVEST data set in the previous problem, a new data set that contains the following columns (in long format) by stacking up columns: A snapshot of the data view for the first 2 subjects should look like:
  49. The data set HARVEST2 contains observations having all missing vital signs (SBP, DBP, and HR) in that the visit did not actually take place. Delete these observations and obtain a new data set HARVEST3. It should have 3786 observations.
  50. Using data steps and some procedure steps and create a new data set HARVEST4 from HARVEST3. The new data set is in CDISC format in which the vital signs are stack-upped. In addition to existing variables SUBJECT, TYPE, VISIT, VISITNUM, SMOKE, SPORT, AGE, BMI, ENDPOINT, TIME, and GENDER, add new variables The new data set should have 11,358 observations and is sorted by Subject, Paramcd, Type, Visitnum. A snapshot of the data view (with variable order rearranged) for the first 2 subjects should look like:
  51. Use the data set HARVEST3 and create a summary report table t_12_2_1vs.doc using the format as in the table mockup mt_12_2_1vs.doc. The implementation notes at the bottom of the mockup should not be included in Table 12.2.1.
    Tools used: PROC MEANS with noprint option, OUTPUT statement, WAYS statement, VAR and CLASS statement. The use of data set options WHERE= and DROP= could be useful.
  52. Repeat the previous problem by using the data set HARVEST4 instead.
  53. Use the Depression Data (description in Depression Before and After an Earthquake) and create a summary report of the scores at the 5 time points. The statistics to be included in the report at each time point should be arranged in rows like
         n
         Mean (SD)
         Q1/Median/Q3
         (Min, Max)
        
  54. Read the Turtle Data (data description in Plasma Protein of Fasting Turtles) and create a SAS data set named TURTLES.
  55. Note that the TURTLES data set is a subject level data set. That is, the data set is in wide format. Use this SAS data set and create a new SAS data set named TURTLES2 by stacking up Fed, Fasted10, and Fasted20. Refer to the HARVEST3, the resulted data set should have a similar structure having variables Subject, Sex, Protein (the stack-upped measurements of plasma protein), and Time (having 3 values: Baseline, Fasting 10 Days, and Fasting 20 Days). In addition, create variable BASE which is populated with the respective baseline plasma protein values for the subjects. Also create CHG, Change from Baseline having non-missing values at non-baseline time and defined by Protein − BASE.
  56. Use TURTLES2 data set and create a summary report for plasma protein at the 3 visits for either gender and for overall. At the two non-baseline time, also create summary statistics for CHG.
  57. Use Bassin Anticipatory Timing (data description in Bassin Anticipatory Timing) and repeat the steps similar to the TURTLES data above.
  58. Read Balance Data Set (in long format, and data description in Effect of Surface and Vision on Balance). Create a SAS data set named BALANCE
  59. Use the data in the previous problem and create summary reports of your choice. Tools includes, but not limit to, PROC FREQ, PROC MEANS, PROC UNIVARIATE, and PROC REPORT. You may use ODS to create specialized summary data sets/reports.
  60. Repeat the previous 2 problems, but use Lower Back Pain Data (data description in Runners with Low Back Pain) instead.
  61. Use the data set below and create a new data set that contains only the values of x that each appears only once in the data set rep.
          data rep;
            input x @@;
          datalines;
          26 37  5 26 10 26  2 10  2 17 26  2 50
          ;
         
    The new data set should contain 5, 17, 37, and 50 for variable x.
    Hint: Tools used may include proc freq with an OUT= option for the TABLES statement, and a data set option WHERE=.
  62. A study was conducted by XYZ Pharmaceutical Inc. in which subjects' medical events were logged in event.csv. Each occurrence of an event was recorded (in this order): Use a data step to read this data file into a SAS data set (Hint: use DLM= and DSD options in your INFILE statement) and then sort it by USUBJID and Event_Seq. Name the sorted data set ev_raw. Now, use this data set to create another data set, named event_log, containing the following variables: Hint: Use input and cats functions to create the numeric version of date values. For the character version of date values may include catx, input, cats, upcase, coalescec, ifc, and strip functions, if-else if construct or select-end construct, proc format with invalue statement for the conversion of 3-letter month abbreviation to 2-digit 'numeric' month name, use of SAS built-in monthname format. Note that you do not have to use them all and that you may have other tools of your choice to code the data step.
  63. Repeat the previous problem, but create another data set, in addition to event_log, named ev_error by using the data step statement
    data event_log ev_error;
    The data set event_log is as in the previous problem. However, the data set ev_error will capture every erroneous record of ev_raw in which at least one of the start date and end date is missing. The data set ev_error contains the same variables as ev_raw.
    Hint: In addition to the hint in the previous problem, use output statement under if-else if construct and keep= data set option.
  64. In a study conducted by XYZ Pharmaceutical Inc., subjects' visits are recorded in allvisits.csv. Each subject has 6 visits: Baseline, 2-Week Test, 3-Month Checkup, 6-Month Checkup, 1-Year Checkup, and 2-Year Checkup. In addition to these visits, each subject also made variable uncheduled visit with respect to medical needs. Read the data into a SAS data set, named allvisits. Then create a new SAS data set, named Visits, from this data set in which a new variable, named visitn, contains imputed visit numbers: as is for non-missing visitnum, and visit number such as 2.01, 2.02, 3.01, 3.02, 3.03, etc. according to the visit dates, visitdt.
  65. Two data sets from a clinical study are considered.
    Subject-level data set _sl (_sl.sas7bdat) contains Subject ID (usubjid), Treatment Number (trtn), Treatment (trt), Treatment Start Date (trtstdt), and Treatment End Date (trtendt).
    Adverse event data set allaes (allaes.sas7bdat) contains System Organ Class (SOC), Preferred Term (PT), Subject ID (usubjid), Treatment Number (trtn), Treatment (trt), Treatment Start Date (trtstdt), Treatment End Date (trtendt), AE Start Date (aestdt), and AE End Date (aeendt).
    These two SAS data sets were stored in a folder, say T:/680Fall14/egs/sasdata. Using SAS data set allaes to produce a summary report of the numbers of adverse events by System Organ Class/Preferred Term. An example (aetest.sas) run on a subset data set aetest (aetest.sas7bdat) produced the listing output (aetest.lst) where the report was produced from the data set final (final.sas7bdat). The SAS codes that produced the final data set was masked. Fill in SAS codes in example (aetest.sas) to test out and then apply it to the adverse event data set allaes (allaes.sas7bdat) to get the requested report. Note that the table was arranged in the descending order of the numbers of subjects within SOCs disregarding the treatment they received (variable ord in final data set) and then within each SOC, in descending order of numbers of subjects within PTs disregarding the treatment they received (var ord2 in final data set).
  66. Data set chrs was generated by the SAS codes below (also found in chrs.sas):
    data chrs;
      call streaminit("27OCT2015"d);
      length a $ 30 b $ 50 c $ 40 d group $ 1;
      do case = 1 to 100;
        group = choosec(rand('Table',.3,.3,.3,.1),"A","B","C","D");
        d = byte(ceil(rand("Uniform")*25+65)+32*rand("Bernoulli",0.5));
        a = repeat(d, ceil(rand("Normal",15,2)));
        d = byte(ceil(rand("Uniform")*25+65)+32*rand("Bernoulli",0.5));
        b = repeat(d, ceil(rand("Normal",25,3)));
        d = byte(ceil(rand("Uniform")*25+65)+32*rand("Bernoulli",0.5));
        c = repeat(d, ceil(rand("Normal",20,2.7)));
        output;
      end;
      drop d;
    run;
       
    Create a sas data set named max_len using two methods: The output data set should have 4 variables: group, a, b, and c where a, b, and c are numeric containing the maximum lengths by group. The image below shows the data view of max_len:

    Hint: use length function to count the number of characters in a text string.

2015-11-03