# Stat680  Exercises  Fall 2015

1. Practice: starting SAS, writing a short SAS program containing a data step and a PROC PRINT step, reading SAS log.
2. How many program steps are executed when the program below is processed? Why?
DATA hw1.listing;
INFILE jobs;
INPUT date name \$ type : \$10.;
RUN;
PROC SORT DATA=hw1.listing;  BY name;
RUN;
TITLE  'Homework 1';  TITLE2 'Sorted Listing';
PROC PRINT DATA=hw1.listing;
RUN;

3. Beware of OBS= and FIRSTOBS= system options: What observations are printed in each PROC PRINT step? Why?
TITLE 'Beware of OBS= and FIRSTOBS=';

DATA one;
INPUT AccountNumber amount @@;
DATALINES;
5 120  6 153    2 97  4 105   3
213  8 65     7   117 1 155
;

PROC SORT DATA=one OUT=two;    BY AccountNumber;
RUN;

TITLE2 '--> Two <--';
TITLE3 '=== Part A ===';
PROC PRINT DATA=two;   RUN;

OPTIONS OBS=7 FIRSTOBS=3;

TITLE3 '=== Part B ===';
PROC PRINT DATA=two;   RUN;

PROC SORT DATA=one OUT=three;    BY AccountNumber;
RUN;

TITLE2 '--> Three <--';
TITLE3 '=== Part A ===';
PROC PRINT DATA=three;   RUN;

TITLE3 '=== Part B ===';
PROC PRINT DATA=three(FIRSTOBS=1);   RUN;

4. What are the lengths and values of x, y, and z in each of the following data sets? Why? (Hint: how does SAS process each step? In addition, the wild card _NUMERIC_ refers to all previously existed numeric variables. The wild card _CHARACTER_ behaves likewise. See class note 03 for the use of RETAIN.)
DATA d1;
LENGTH x 3 y \$ 2 z 4;
RETAIN _NUMERIC_ 0 _CHARACTER_ 'xyz';
DATA d2;
LENGTH x 3;
RETAIN _NUMERIC_ 0 _CHARACTER_ 'xyz';
LENGTH y \$ 2 z 4;
DATA d3;
RETAIN _NUMERIC_ 0 _CHARACTER_ 'xyz';
LENGTH x 3 y \$ 2 z 4;

5. Read the following raw data (found in ex04.dat which has more rows than those shown below) in an external file into a SAS data set
----+----1----+----2----+----3----+----4----+----5----+----6----+
312445    12/01/4908/23/5604/29/73 Joe von Mises  \$41,230
473941    11/09/5106/21/5903/21/71 John Stevenson  \$100,459
(and more)

where
Name
Type
Start
Column
End
Column
Q1
num
1
2
Q2
num
3
4
Q3
num
5
6
Date1
num
11
18
Date2
num
19
26
Date3
num
27
34
FirstName
char
36
ends with
a blank
LastName
char

ends with
two blanks
Salary
num

followed by
end-of-record
marker
In your INPUT statement, use variable list and informat list.
6. Create a SAS data set from the raw data file clinical.dat with the following variable descriptions.
Variable
Description
Starting
Column
Length
Format
Codes
ID
Patient ID
1
3
char

Gender
Gender
4
1
char
M, F
DoB
Date of Birth
5
6
mmddyy

DoV
Date of Visit
11
6
mmddyy

pdx
Primary DX Code
17
2
char
see list
below
sdx
Secondary DX Code
19
2
char
see list
below
hr
Heart Rate
21
3
num

sbp
Systolic Blood Pressure
24
3
num

dbp
Diastolic Blood Pressure
27
3
num

vitamin
Patient Taking Vitamin?
30
1
char
1=yes
0=no
preg
Patient Pregnant?
31
1
char
1=yes
0=no

DX Code
Symptom

DX Code
Symptom
01
Cold

05
Abdominal Pain
02
Flu

06
Heart Problem
03
Routine Physical

07
Lung Disorder
04
Ear Ache

Create appropriate permanent formats for selected variables in the data step. Associate selected variables with suitable labels in data step. Write a comprehensive listing of the data with patients sorted according to their ID's.
7. Using the SAS data set you've created in the previous problem, create a new SAS data set named ages containing the variables:
• Age1: age as of Jan 1, 2001 rounded to the nearest year;
• Age2: age at the time of visit, dropping any fractional part of a year;
• Age3: age, rounded to the nearest half year, as of the date the program is run (i.e., today's date)
• Agegrp1n: numeric, its value =. if Age2=., =1 if Age2<30, =2 if 30≤Age2<50, =3 if 50≤Age2<65, =4 if 65≤Age2<80, =5 if Age2≥80.
• Agegrp1: character, its value =" " if Agegrp1n=., ='<30' if Agegrp1n=1, ='30-<50' if Agegrp1n=2, ='50-<65' if Agegrp1n=3, ='65-<80' if Agegrp1=4, ='≥80' if Agegrp1=5.
Include in this new data set, in additional to the three age variables, ID, Gender, DoB, DoV, and preg. Prepare a data listing with this new data set. Make sure you use DATE system option so that today's date is printed in the first title line. Also prepare a data listing of female patients only showing ID, DoB, DoV, Age1-Age3, and preg. Use sensible titles/footnotes and variable labels.
Hint: Use
• functions INT, ROUND, and TODAY
• KEEP statement or KEEP= data set option
• use WHERE statement or WHERE= data set option
• date literal (i.e., date constant)
8. Redo problem 6 using the following SAS data set to create a SAS control data set, which in turn is used to create format for DX code (hint: use PROC FORMAT with CNTLIN= option):
DATA codes;
INFILE DATALINES TRUNCOVER;
INPUT @1 dx \$2. @4 description \$16.;
DATALINES;
01 Cold
02 Flu
03 Routine Physical
04 Ear Ache
05 Abdominal Pain
06 Heart Problem
07 Lung Disorder
;
9. Use the SAS data step in problem 6 with the raw data set replaced by clinical2.dat and create a new data set named clinical2, and two SAS data sets created by the data steps below to create a new SAS data set called study.
INPUT ID : \$3. minute @@;
DATALINES;
159 10 594 12 547 15 315 12 953 13
;
DATA bodyfat;
INPUT ID : \$3. BodyFat @@;
DATALINES;
315 14 159 23 647 20 953 24 547 30 575 21
;
Some patients from clinical2 entered two studies, one on treadmill and one on bodyfat. Combining all data sets, keeping only patients entered in both studies, to create study
10. The following SAS data set read variables Age, SBP, and DBP as character.
data a;
INPUT Age \$ SBP \$ DBP \$ @@;
datalines;
23 121 70 35 134 72
49 . 68 30 156 78
52 132 98 ;
Create a new SAS data set, called b that read from data set a so that all variables are of correct type, numeric. In this new data set, keep the same variable names: Age, SBP, and DBP. (Hint: Use SET with RENAME data set option, and INPUT function. Also use PROC CONTENTS to check your results.)
11. The following SAS data set was given:
DATA many;
INPUT Subject : \$2. record x1-x3 @@;
DATALINES;
01 1 3 8 4 02 3 5 6 9
02 2 3 6 7 01 3 7 9 5
02 1 2 4 6 03 1 4 4 5
01 2 5 9 3 03 3 6 7 7
03 2 5 6 5
;
The SAS data set many has three observations per subject. Read this SAS data set and create a new SAS data set one that has one observation per subject containing the following variables
• Subject
• The differences between the third value (in record 3) of x1, x2, and x3 and the second value for each subject.
• The differences between the third value and the first value for each subject.
• The mean of x1, the mean of x2, and the mean of x3 for each subject. Note: DO NOT use PROC MEANS.
Hint: Use
• LAG, LAGn, DIF, DIFn, and MEAN functions
• SET and BY statements
• LAST.variable temporary variable
12. The following DATA step is incomplete. Answer the questions in parts (a), (b) and (c) below by filling in appropriate SAS statements.
DATA _NULL_;
FILE PRINT;
INPUT ID \$ 1-4 answers \$ 6-10;
(SAS statements here)
DATALINES;
0123 ABCDE
2264 XCCBA
9275 35AbE
6211 DDEAC
5278 A7344
;
1. Suppose the variable answers contains answers to 5 multiple choice problems in which legitimate answers are A, B, C, D, and E (upper case). Insert suitable SAS codes so that proper messages will be printed for each data line that contains invalid data. Be sure to include data line number for invalid data. Replace each of problematic answers with a question mark (?).
2. Suppose, in the above, the cases for the answers are ignored. That is, a, b, c, d, and e are legitimate answers corresponding to their upper-case counterparts. Redo part (a).
3. Suppose, in addition to part (b), the five numbers 1, 2, 3, 4, and 5 are also legitimate answers (1 for A, 2 for B, etc). Redo part (a).
4. For part (c), rewrite the entire data step and create a SAS data set named multiple that contains six variables: ID, q1-q5. Convert all legitimate answers to upper case (A to E) and place missing values for invalid answer codes.
Hint: Use
• character functions VERIFY, UPCASE, TRANSLATE and SUBSTR
• PUT statement
• Automatic variable _N_
13. Rewrite the following SAS code to make it more efficient:
DATA simu(DROP=i);
LENGTH X1 X2 prop1 prop2 4 group \$ 1;
DO i = 1 TO 10000;
X1 = RAND('uniform');
X2 = RAND('uniform');
prop1 = (x1 < 0.5);
prop2 = (x2 < 0.005);
IF x1 > 0.9 THEN group='A';
ELSE IF x1 > 0.7 THEN group='B';
ELSE IF x1 > 0.4 THEN group='C';
ELSE group='D';
OUTPUT;
END;
RUN;
14. The following data step reads SAS data set simu from the previous problem and create a SAS data new. Rewrite it to make it more effieient.
DATA new;
SET simu;
IF prop2=1 THEN X3=X1;
ELSE X3=x2;
RUN;
15. Rewrite the following data step to make it more efficient (the actual raw data file is hypothetical, it does not exist):
DATA a;
INPUT @1 (Q1-Q8) (4.1 +1) @51 class \$9.;
IF class in ('Junior','Senior');
RUN;
16. Listed below is an example data set that contains student names and (eight) test scores. The number of students ranges from 15 to 30. Assume that the student names are their first names which can be as long as 12 characters long containing no embedded space.
Toby 64       80       61       59       31       44       90       86
Barb 70       86       61       81       20       95       54       83
Jay  81       73       56       72       26       92       87       40
Joe  50       70       72       65       32      100       77       62
Bob  68       94       82       82       71       58       79       71
Kay  31       68       90       81       44       82       88       83
Adam 73       81       85       79       61       68      100       71
Carl 60       95       88       65       42       72      100       86
Dave 72       67       70       37       45       61       84       82
Guy  39       81       88       56       35       75      100       64
Levy 82       94       87       74       79       50       69       89
Eton 36       65       62       81       78       99       52       83
Iman 63       77       85       90       60       78       95       39
Pat  50       93       78       48       70       72       84       86
Sue  67       76       57       71       59       80       98       56
Fay  60       38       87       94       61       86       80       66
Mark 65       81       88       77       34       74       78       36
Hal  34       90       64       94       63       80       97       67

The passing scores on the tests are 55, 65, 60, 65, 50, 70, 75, and 60, respectively. Write a data step (DATA _NULL_) that produces data listing, adding a column of number of passes for each student on the right and a row of number of passes for each test at the bottom. DO NOT hard code the data step by assuming there are only 18 observations such as that shown above.
17. Information on each of four drugs was recorded for a patient's coming in for a clinic visit: a code of 0 indicates the drug is not being taken, a code of 1 indicates the drug is being started, a code of 2 indicates that drug is being discontinued. For each visit, a patient ID, visit date, and the status codes of the four drugs are recorded. Write a SAS program using only PROC SORT and a DATA step that will
1. count the number of patients who were ever on at least one of the four drugs (a patient was on a drug rx_1, say, if he or she was found with a code of 1 or 2 on at least one visit);
2. count the number of patients who were ever on each and every one of the four drugs;
3. count the number of patients who had never been on any drug.
Use PUT statement to write these numbers to a print or log file.
DATA drug;
INPUT ID : \$2. Visit : MMDDYY8. rx_1-rx_4;
DATALINES;
04 01/28/95 1 0 1 1
03 02/28/95 2 0 2 1
05 05/20/95 0 0 0 0
01 03/27/95 1 0 0 1
03 03/02/95 0 0 0 1
01 04/04/95 0 1 1 1
02 04/29/95 0 1 1 1
02 05/04/95 0 0 1 0
03 02/26/95 1 0 1 1
03 03/06/95 0 1 0 1
04 01/25/95 0 0 1 0
05 03/01/95 0 0 0 0
02 04/30/95 0 2 1 2
01 03/31/95 2 1 0 1
03 03/07/95 0 1 0 2
04 02/01/95 1 1 1 1
05 04/18/95 0 0 0 0
01 04/09/95 0 2 1 2
;

18. The data below are patient's ID, SBP, and DBP. SBP that is greater than 160 (mmHg) and DBP that is greater than 90 (mmHg) are considered to be hypertensive.
ID     SBP    DBP
518    132     66
569    153     78
647    165     88
345    129     78
908    131     76
771    124     85
189    120     80
117     79
535    201     98
607    143     85
832    156     92
795    122     74

Use two methods (each in a DATA _NULL_ step) to produce the output below (added ruler is not part of the output):
1. Use single trailing at (@) in PUT statement along with logical statements to place an asterisk (*) where needed.
2. Create character variables from SBP and DBP and use SUBSTR pseudo function to place an asterisk at the end of the string if needed.
----+----1----+----2----+----3----+----4----+----5----+
Listing of Systolic and Diastolic Blood Pressure
Pressure marked with an asterisk (*) are hypertensive
ID        Systolic            Diastolic
Blood Pressure      Blood Pressure
------------------------------------------------
518          132                  66
569          153                  78
647          165*                 88
345          129                  78
908          131                  76
771          124                  85
189          120                  80
117                  79
535          201*                 98*
607          143                  85
832          156                  92*
795          122                  74

19. For SAS data miss in example ProcFormat.sas, count the number of missing and non-missing values for numeric variables only. Use PROC MEANS, instead of PROC FORMAT, and PROC FREQ to accomplish this. First create an output data set then use PROC PRINT to print out the results in a listing similar to the following one:
Number of Missing and Non-missing Values

a          a          c          c          d          d
Total  Non-missing  Missing  Non-missing  Missing  Non-missing  Missing
5         3          2          3          2          4          1

20. Consider the grades data in Array.sas. There are four homeworks at 8 points each and six quizzes at 8 points each too.
1. Suppose the teacher decides to drop the lowest scores from the 10 homework-quizzes combined and then re-scale the score. That is, she decides to compute the total score (100% base) of each student by the formula
score = total of 9 homework-quizzes high scores × 10/9 + final
Grade the course for the teacher. Email program only.
2. What if she decides to drop two lowest scores from the homework-quizzes. Email the program only.
21. The following is a up-to-date record (as of 10/05/15 (05OCT2015)) of patients visiting a pediatric clinic who were diagnosed of a certain disease. (The SAS code is available as pediatric.sas).
DATA pediatric;
INPUT @1  ID          \$3.
@5  DateOfVisit MMDDYY8.
@14 Doctor      \$5.;
DATALINES;
----+----1----+----2
003 09/18/14 KLMN
004 11/12/14 ABCD
002 10/21/14 ABCD
003 10/27/14 ABCD
001 12/05/14 DOE
004 03/01/15 ABCD
002 01/13/15 KLMN
001 01/27/15 ABCD
004 07/31/15 DOE
003 05/31/15 DOE
001 05/11/15 UVWXY
;

A variable called Result is to be set according to the following:
Result =
Success, if patient returns more than 90 days later for a visit
Failure, if patient returns within 90 days for a visit
Unknown, if it's within 90 days to current day (05OCT2015) since the patient's last visit
The data set blame from this should appears as the following

DateOf
ID          Visit    Doctor    Result

001    12/05/2014    DOE       Failure
01/27/2015    ABCD      Success
05/11/2015    UVWXY     Success

002    10/21/2014    ABCD      Failure
01/13/2015    KLMN      Success

003    09/18/2014    KLMN      Failure
10/27/2014    ABCD      Success
05/31/2015    DOE       Success

004    11/12/2014    ABCD      Success
03/01/2015    ABCD      Success
07/31/2015    DOE       Unknown

Write a SAS code to obtain the SAS data set blame.
22. This problem will make use of the cars data set from 1983 ASA Data Exposition. The data set was packed using UNIX shar command. To help you on doing this homework, I've unshar'ed it. It results in three files (renamed):
• cars.txt which gives the main data set;
• carsNames.txt which gives the respective car names (in two parts: make, model name);
• the third file was removed as it contains the eight variable names which can be found in the data description.
There are noticeable data errors in carsNames.txt. However, DO NOT modify it with editor program and take it as-is. Also note that the missing value code used in the data file cars.txt is NA. Again, DO NOT modify it with editor program and take it as-is.
1. You are to write a SAS program which takes the main data file and reads into a SAS data set. E-mail program only.
2. Write a SAS program which read car names data file and break each car name into make and model name. Note that the abovementioned data errors cannot be detected by SAS log. However, use PROCs such as FREQ or TABULATE, you may be able to spot the errors. Now, use DATA step to correct data errors. E-mail program and turn in the table with the explanation of erroneous data.
3. Merge the two SAS data sets created above into a single SAS data set. Now use this SAS data set to produce some exploratory descriptive statistics (of your choice) in two tables (using PROC REPORT). E-mail program and turn in a two-page simple report (WORD file) of your findings (including at-most-one-page SAS output).
23. Redo Exercise 16 by
1. first creating a data named tests containing the data
2. then read the data in PROC IML by
proc iml;
reset ls=80 nocenter;
use tests;
read all var _NUM_ into A;
read all var _CHAR_ into student_names;
rnames = {"Students Passed"};
rnames = student_names // rnames;
cnames = ('Test 1':'Test 8') || {'Passed'};

where rnames and cnames are row names and column names of a matrix, named result, in which first n (=18) rows and first p (=8) columns (i.e., matrix A) contain the p test scores of these n students, column p+1 contains numbers of tests passed by students, row n+1 contains numbers of students passed the tests, and entry (n+1,p+1) contains total number of passes.
You're to continue the codes in PROC IML to construct and print this matrix result. The printed result should be similar to this:
Test Results
Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8 Passed

Toby               64    80    61    59    31    44    90    86      5
Barb               70    86    61    81    20    95    54    83      6
Jay                81    73    56    72    26    92    87    40      5
Joe                50    70    72    65    32   100    77    62      6
Bob                68    94    82    82    71    58    79    71      7
Kay                31    68    90    81    44    82    88    83      6
Adam               73    81    85    79    61    68   100    71      7
Carl               60    95    88    65    42    72   100    86      7
Dave               72    67    70    37    45    61    84    82      5
Guy                39    81    88    56    35    75   100    64      5
Levy               82    94    87    74    79    50    69    89      6
Eton               36    65    62    81    78    99    52    83      6
Iman               63    77    85    90    60    78    95    39      7
Pat                50    93    78    48    70    72    84    86      6
Sue                67    76    57    71    59    80    98    56      6
Fay                60    38    87    94    61    86    80    66      7
Mark               65    81    88    77    34    74    78    36      6
Hal                34    90    64    94    63    80    97    67      7
Students Passed    12    17    16    14     9    13    15    14    110

Hint: First use SHAPE function to create a matrix of the same size as matrix A in which each row contains the eight passing scores of the tests. Then use comparison, subscript reduction and concatenation operators, and MATTRIB statement to construct matrix result.
24. Redo Exercise 20 by using PROC IML.
25. Denote f(x;μ,σ) and F(x;μ,σ), respectively, the p.d.f. and the c.d.f. of N(μ,σ2). The p.d.f. and the c.d.f. of contaminated normal cN(μ,σ2,σc2,p) (a N(μ,σ2) contaminated in spread by a N(μ,σc2) with probability p) are
(1-p)f(x;μ,σ)+pf(x;μ,σc) and
(1-p)F(x;μ,σ)+pF(x;μ,σc),
respectively. Write two IML modules, one for p.d.f. and one for c.d.f., that implement these functions.
Hint: Use base SAS functions, PDF and CDF, in PROC IML.

28. The university payroll department is to project employee costs in wages, retirement (benefits), and medical (benefits) through future years based on assumed increases:

variable

Current
Level
Estimated
Annual
Increase
Wages
\$73,142,768
3.1%
Retirement
\$8,006,549
1.3%
Medical
\$2,993,170
9.4%
Use PROC IML to create a data matrix containing four columns: year and the three variables in the table above. Then create a SAS data set, named future, from this data matrix with named variables: Year, Wages, Retirement, and Medical (use proper case for the names). Initialize each of the three variables in the table to its current value. The SAS data set future should contain the projected costs for the next ten year (that is, from year 2014 to 2023 assuming the current year is 2013). As an example, the cost for wages in the year 2014 will be 1.031 times the wages expense of this year.
29. Use existing SAS data set ozone0 created by the following DATA step.
FILENAME oz URL
'http://www.stat.wmich.edu/wang/680/data';
DATA ozone0;
INFILE oz(ozone.txt);
INPUT x1-x10;
RUN;

This data set contains pairs of ozone levels of two cities A and B in the past 200 days. Each row contains 5 such pairs. Thus, x1, x3, x5, x7, and x9 are the ozone readings in city A in five consecutive days; x2, x4, x6, x8, and x10 are the ozone readings in city B in five consecutive days.
1. From this SAS data set, create a new SAS data set, named ozone, with two variables A and B containing ozone levels of the two cities in 200 rows.
2. Use PROC GPLOT to create quantile-quantile plot in two ways:
1. use only complete cases (i.e., observations in which ozone levels of both cities are non-missing): plot quantiles (order statistics) of ozone levels of B versus those of ozone levels of A;
2. for each city, use non-missing ozone readings to calculate 5ith percentiles, i=0,1,...,20. (Note: 0th percentile=MINIMUM, 100th percentile=MAXIMUM).
Assume the order statistics of non-missing values are
X(1), X(2), ..., X(n)
To obtain 100pth percentile, first compute
g = 0.5 + n×p
and express g by
g = k + f, where 0≤f<1.
That is, k is the integral portion of g, and f is the fractional portion of g.
The 100pth percentile is then defined by
f × X(k+1) + (1-f) × X(k)
Then use PROC GPLOT to plot pairs of percentiles of the two cities. (Hint: Use PROC IML. In IML, use SORT routine to get order statistics and then compute percentiles.)
30. Write a SAS data step to read the following data. Fields are student name, id, weight, height, gender, and GPA, respectively. Note that variables weight, height, and GPA are numeric, the rest are character.
Jack 12-3456 192    72 M 3.3
Gill    23-4567 131 69 F 3.5
Jay 34-5678       . 70 M 2.9
Elizabeth 45-6789 115 68 F .
Joanne 56-7890 121 70 F  3.8
Peggy 67-8901 135 67   F 3.6
Washington 78-9012 213 73 M 3.8
Jim   89-0123   200 70 M 3.4

Hint: available tools: one of
1. colon informat modifier (:) in INPUT statement
2. LENGTH statement
3. INFORMAT statement
4. ATTRIB statement
31. A car survey was conducted and the data were recorded using two lines of data for each subject.
Line
Variable
Name
Type
Start
Column
End
Column
Description
1
Subject
Char
1
3
subject ID

DoB
MM/DD/YY
4
11
date of birth

State
Char
25
26
State where living

ZipCode
Char
40
44
zip code

2
Subject_
Char
1
3
subject ID*

Cars
Numeric
4
4
number of cars

Car1
Char
11
20
make of car 1

Car2
Char
21
30
make of car 2

Car3
Char
31
40
make of car 3
Note*: Subject_ is supposed to be identical to Subject. However, there may exist subjects so that they differ. Below are example data:
----+----1----+----2----+----3----+----4----+----5
00109/07/49             PA             19056
0012      Chrysler  Ford
00201/31/57             NY             11518
0021      Honda
00311/21/40             IN             46403
0132      Chevy     GMC
004                     MI             49077
0043      Toyota    Toyota    Nissan
(more subject data lines)

Note that, unused fields at the end are truncated. Maximum number of cars per subject is 3. The field Subject_ is used to check for data correctness. Include only Subject in the SAS data set. However, do a raw data check, write a message into SAS log when encountering a raw data error. Also add at least two more subjects (i.e., at least four more raw data lines) to the above data in your program.
Hint: Tools used may include:
1. INPUT statement with line pointer (# or /), column pointer, informats
2. TRUNCOVER option in INFILE statement
3. DROP or KEEP statement; or DROP= or KEEP= data set option
32. In the SAS data set below, a value of NA or na was used in place of a missing value. Include the SAS code below that create this data set, but write a new SAS data step that creates a SAS data set new where these values are converted to a SAS character missing value. In addition, your SAS log should not contain 'ERROR' and or 'WARNING'.
DATA old;
LENGTH A B C D \$ 2;
INPUT ID \$ A \$ B \$ C \$ D \$ X Y Z;
DATALINES;
01 Y N Y Y 3 9 12
02 na N Y N 5 2 8
03 NA na na NA 4 1 7
;

Hint: Tools used may include:
1. SET statement
2. PROC FORMAT with INVALUE statements
3. ARRAY statement
4. UPCASE function
5. Index DO loop
6. DIM function (optional)
33. Use the SAS data set clinical created in Problem 6 (but use the raw data file clinical2.dat instead) and create the following table using PROC TABULATE.
Simple Statistics Broken Down By Two Variables

---------------------------------------------------------------------------
|                                  | Heart Rate |Systolic BP |Diastolic BP|
|                                  |------------+------------+------------|
|                                  |Number|Mean |Number|Mean |Number|Mean |
|----------------------------------+------+-----+------+-----+------+-----|
|Taking Vitamin? |Gender           |      |     |      |     |      |     |
|----------------+-----------------|      |     |      |     |      |     |
|No              |Female           |    17| 88.2|    17|205.1|    17| 86.1|
|                |-----------------+------+-----+------+-----+------+-----|
|                |Male             |    12| 94.4|    12|201.8|    12| 92.3|
|                |-----------------+------+-----+------+-----+------+-----|
|                |Both             |    29| 90.8|    29|203.7|    29| 88.7|
|----------------+-----------------+------+-----+------+-----+------+-----|
|Yes             |Gender           |      |     |      |     |      |     |
|                |-----------------|      |     |      |     |      |     |
|                |Female           |     3| 89.7|     3|196.3|     3| 98.3|
|                |-----------------+------+-----+------+-----+------+-----|
|                |Male             |     3| 84.7|     3|196.3|     3| 88.0|
|                |-----------------+------+-----+------+-----+------+-----|
|                |Both             |     6| 87.2|     6|196.3|     6| 93.2|
|----------------+-----------------+------+-----+------+-----+------+-----|
|Both            |Gender           |      |     |      |     |      |     |
|                |-----------------|      |     |      |     |      |     |
|                |Female           |    20| 88.5|    20|203.8|    20| 87.9|
|                |-----------------+------+-----+------+-----+------+-----|
|                |Male             |    15| 92.5|    15|200.7|    15| 91.5|
|                |-----------------+------+-----+------+-----+------+-----|
|                |Both             |    35| 90.2|    35|202.5|    35| 89.4|
---------------------------------------------------------------------------

Hint: Tools used may include (in PROC TABULATE):
1. CLASS and VAR statements
2. KEYLABEL and LABEL statements
3. TABLE statement with the options below:
PRINTMISS and RTSPACE= options
34. Redo parts (a), (b), and (c) in problem 12, but use only arrays, DO loops, functions RANK and BYTE, logical comparisons.
35. Redo problem 28 using SAS macro programming with the following assumption:
Having the data:
DATA configuration;
INPUT variable : \$10. current_level : DOLLAR11.
est_incr : PERCENT4.;
CARDS;
Wages \$73,142,768 3.1%
Retirement \$8,006,549 1.3%
Medical \$2,993,170 9.4%
;

From this data set, create three macro variables: varlist, init, and rate using two methods: 1) PROC SQL macro facility and 2) SAS DATA step macro interfaces. After these, use the following codes to create desired data set (you may modify the codes except for the two ARRAY statements):
%LET years = 10;
%LET this_year = 2013;

DATA outlook&years;
ARRAY item[3] &varlist (&init);
ARRAY rate[3] _TEMPORARY_ (&rate);
DO i = 1 TO &years;
year = &this_year + i;
DO j = 1 TO 3;
item[j] = item[j] * (1+rate[j])**i;
END;
OUTPUT;
END;
DROP i j;
RUN;

PROC PRINT DATA=outlook&years;
ID year;
FORMAT &varlist COMMA14.2;
RUN;

The output from the PRINT procedure should look like this:
year             Wages        Retirement           Medical

2014     75,410,193.81      8,110,634.14      3,274,527.98
2015     80,158,095.02      8,322,881.32      3,919,072.97
2016     87,846,281.63      8,651,711.68      5,131,393.45
2017     99,256,291.46      9,110,449.80      7,350,292.19
2018    115,624,900.21      9,718,227.16     11,518,375.00
2019    138,868,377.90     10,501,315.79     19,746,728.07
2020    171,954,687.65     11,495,022.77     37,035,345.43
2021    219,524,674.50     12,746,337.07     75,989,742.98
2022    288,942,435.69     14,317,606.00    170,573,230.08
2023    392,101,029.21     16,291,642.10    418,874,665.06

36. Redo the cdf (cumulative distribution function) part of problem 25 by writing a macro with the following parameters:
1. mu (i.e., the mean μ)
2. sigma (i.e., the uncontaminated standard deviation σ)
3. cont_sigma (i.e., the contaminated standard deviation σc)
4. p (i.e., the probability of contamination)
5. data=&SYSLAST (input data set that contains a variable of x values to calculate cdf's)
6. var= (to select a single variable of x values to calculate cdf's)
7. F=cdf (data set variable name containing cdf's)
8. out= (output data set, this should contain at least 2 columns, one for x values, one for respective cdf's)
Run a test on an input data set and show the results.
37. Use the SAS data created by the SAS codes below (also available for download in admit.sas) and create, in two methods, a new SAS data set Containing 6 observations and 5 columns: one observation per department and variables (in this order) Department, MalesAdmitted, MalesRejected, FemalesAdmitted, and FemalesRejected.
• Use a single data step that reads the SAS data set UCBAdmissions and produces the required new SAS data set (name it new1). Suggested tools: MOD function, _N_ automatic variable, DROP statement, OUTPUT statement, and SELECT-WHEN construct. This method could be data-dependent. That is, assume full knowledge of the data --- in this case, the column Count goes through 6 cycles of Male-Admitted, Male-Rejected, Female-Admitted, Female-Rejected exactly for the 6 departments.
• First use a data step to create a SAS data set (name it new2) containing 3 columns: Department, Count, and what in which the last column, a character variable, contains four possible values 'MalesAdmitted', 'MalesRejected', 'FemalesAdmitted' and 'FemalesRejected'. Then use this newly created data set to produce the final data set (name it new3) by using PROC TRANSPOSE. This method is NOT data-dependent.
LENGTH Department \$ 1;
DO Department = 'A', 'B', 'C', 'D', 'E', 'F';
DO Gender = 'Male', 'Female';
INPUT Count @@;
OUTPUT;
END;
END;
END;
DATALINES;
512 313 89 19   353 207 17 8
120 205 202 391    138 279 131 244
53 138 94 299   22 351 24 317
;

/*
PROC PRINT DATA=UCBAdmissions LABEL SPLIT='*' NOOBS;
LABEL Count='Number of*Students';
RUN;
*/

When either new1 or new3 is printed, it should produce output like:
Males       Males      Females     Females

A            512         313          89          19
B            353         207          17           8
C            120         205         202         391
D            138         279         131         244
E             53         138          94         299
F             22         351          24         317

The data set UCBAdmissions is real and is famous for its use as an example to demonstrate Simpson's Paradox. This is a data collected in the University of California-Berkeley. Gender-bias toward women applicants was suspected when the data was presented.
38. Use the UCBAdmissions data set (see Exercise #37) and produce a 'flatten' table like
Dept   A   B   C   D   E   F
Admitted Male        512 353 120 138  53  22
Female       89  17 202 131  94  24
Rejected Male        313 207 205 279 138 351
Female       19   8 391 244 299 317

Suggested tools: either PROC FREQ or PROC TABULATE
39. Use the UCBAdmissions data set (see Exercise #37) to produce an overall (university-wide) 2×2 table of Admit×Gender and the 6 individual (department-wide) 2×2 table. Perform Pearson's chisquare test and the Fisher's exact test on each table. Discuss the outcomes. Tool: PROC FREQ.
40. The SAS data set VS created by TYPvs.sas contains the vital signs records of subjects from a clinical study detailed in the study protocol TYP-001. The data were collected from 3 clinical sites (001, 002, 003) with various numbers of subjects. After subjects entered into the study, their vital signs were taken on the baseline vist, on the day of treatment (approximately within 2 weeks of baseline visit), and at the day of 3 months (approximately) after treatment. Baseline systolic blood pressure (SBP) is normal if 100 ≤ SBP ≤ 140 mmHg; and SBP is elevated if SBP > 140 (not classified for other range). Baseline diastolic blood pressure (DBP) is normal if 60 ≤ DBP ≤ 90 mmHg; and DBP is elevated if DBP > 90 (not classified for other range). For subsequent visits, each vital sign value was monitored to detect if it is of Potential Clinical Significance (PCS). HR is of PCS if HR < 40 or if HR > 110. When baseline SBP is normal, an SBP is of PCS if SBP < 85 or if SBP > 160; when baseline SBP is elevated, an SBP is of PCS if SBP > 20 change from baseline or if SBP > 180. When baseline DBP is normal, a DBP is of PCS if DBP < 45 or if DBP > 100; when baseline DBP is elevated, a DBP is of PCS if DBP > 10 change from baseline or if DBP > 110.
Use this SAS data and create a new SAS data set, named new_VS so that:
• In addition to existing variables, create
1. VISDY1: baseline day = visdt1 - visdt2
2. VISDY3: 3-month post treatment visit day = visdt3 - visdt2 + 1 ( treatment date is the day 1 for the patient in study)
3. BSSBP: baseline SBP grouping with 3 possible values 'normal', 'elevated' and " "
4. BSDBP: baseline DBP grouping with 3 possible values 'normal', 'elevated' and " "
5. PCSSBP2: PCS for SBP on day of treatment with 3 possible values of 'yes', 'no', " "
6. PCSDBP2: PCS for DBP on day of treatment with 3 possible values of 'yes', 'no', " "
7. PCSHR2 PCS for HR on day of treatment with 2 possible values of 'yes', 'no'
8. PCSSBP3: PCS for SBP on day of 3-month-post-treatment with 3 possible values of 'yes', 'no', " "
9. PCSDBP3: PCS for DBP on day of 3-month-post-treatment with 3 possible values of 'yes', 'no', " "
10. PCSHR3: PCS for HR on day of 3-month-post-treatment with 2 possible values of 'yes', 'no'
• The variable patient was constructed by combining treatment abbreviation (TYP), STUDYID (001, etc.), SITEID (001, 002, 003), and SUBJID (001, 002, ...) with a '−' separating the parts. Extract from the variable and create STUDYID, SITEID, and SUBJID.
• Rename patient as USUBJID (Unique Subject Identifier).
• Data are sorted by USUBJID.
41. Read the SAS data vs created in problem 40, create a new data set, named new_vs2, which turns the subject-level sas data (a.k.a., wide format) into a long format containing these variables:
1. usubjid (renamed from patient)
2. studyid, siteid, subjid (as in problem 40)
3. vistnum (numeric): 1 = 'Baseline Visit' 2 = 'Treatment Date' 3 = '3 Months Post Treatment'
4. visdt: dates of the visits
5. Ady: Analysis Day = (visit date) - (treatment date) if visit date < treatment date, and = (visit date) - (treatment date) + 1 if visit date ≥ treatment date
6. sbp: Systolic Blood Pressure (at respective visit dates)
7. dbp: Diastolic Blood Pressure (at respective visit dates)
8. hr: heart rate (at respective visit dates)
9. pcssbp (character): Potential Clinical Significance SBP, = " " at baseline (since it's not applicable) = "yes" or "no" at treatment or 3-month post treatment according to the definition as laid out in problem 40.
10. pcsdbp (character): Potential Clinical Significance DBP similar to pcssbp.
11. pcshr (character): Potential Clinical Significance hr similar to pcssbp.
12. sbpbl: SBP baseline values (each subject has a value repeated over all records of that subject)
13. dbpbl: DBP baseline values
14. hrbl: HR baseline values
15. chgsbp: SBP change from baseline (=0 at baseline)
16. chgdbp: DBP change from baseline (=0 at baseline)
17. chghr: HR change from baseline (=0 at baseline)
18. BLFL (character): Baseline Flag. = "Y" only at baseline records, and = " " (blank) for all other records.
The new SAS data set should be sorted by usubjid and visitnum. You can do this problem either by using PROC TRANSPOSE or by using arrays, output statement in a data step (hint: also use vlabel function).
42. Use the data, new_vs2, created in problem 41, create a new data set, named advs (Vital Sign Analysis Data Set), that stacks the vital sign variables and contains these variables:
1. usubjid
2. studyid
3. siteid
4. subjid
5. vistnum
6. visdt
8. PARAM (length-40 character): the labels of variables SBP, DBP, and HR.
9. PARAMCD (length-8 character): the variable names of SBP, DBP, and HR.
10. AVAL: values of the variables SBP, DBP, and HR.
11. Baseline: variable of baseline values.
12. CHG: Change from Baseline values.
13. PCS (character): Potential Clinical Significance.
14. BLFL: Baseline record flag.
You can use tools similar to those in problem 41. In addition, functions such as vname can be used.
43. The SAS data set HEMA created by TYPhema.sas contains the (unsorted) hematology records of subjects from the same clinical study as described in problem 40. The data set is in fairly similar format as the data set new_vs2 in problem 41 with exceptions: the hematology data were collected from various laboratories and consequently are of various units. Create an intermediate data hema1 by converting the laboratory reporting units into Systeme International (SI) units using the table below (SI = [Conversion Factor] × Reporting Unit):
Hematology Reporting Unit SI Unit Conversion Factor
RBC (all reporting units) TI/L 1
HEMOGLOBIN mmol/L or mmolFe/L mmol/L 1
g% or g/dl mmol/L 0.155
HEMATOCRIT fraction or L/L L/L 1
% L/L 0.01
PLATELETS 10^9/L (109/L) or 10^3/uL (103/μL) GI/L 1
mm3 (mm3) GI/L 0.001
WBC 10^9/L (109/L) GI/L 1
mm3 (mm3) or uL (μL) GI/L 0.001
The converted data hema1 should contain the hematology data in SI units. Merge this hematology data with the VISDT and ADY variables from new_vs2 and then create a new data set adht (Hematology Analysis Data Set) similar to advs in problem 42. The PCS variable is now defined (according to SI units) at all time points with four possible values "no", "low", "high", or " " (if the measurement is missing) according to the following rules:
Hematology Low High No
RBC ≤3.5   >3.5
HEMOGLOBIN Female≤95; Male≤115 >180 otherwise
HEMOGLOBIN, change from baseline   >25 ≤25
HEMATOCRIT Female≤0.32; Male≤0.37 >0.54 otherwise
HEMATOCRIT, change from baseline   >0.075 ≤0.075
PLATELETS <100 >550 otherwise
WBC <3 >20
For clarity, create a separate PCS variable named PCSCHG just for the PCS-change from baseline for HEMOGLOBIN and HEMATOCRIT variables. VISDT, VISITNUM, and ADY should also be included. For the data, the final adht data set should have 9×3×5 = 135 observations. That is, for each combination of USUBJID and VISTDT, there are five observations (the five hematology measurements).
44. Consider the data work.wt created by the SAS codes below:
data work.wt;
input patient \$ visdt : mmddyy. weight;
format visdt date9.;
datalines;
05 09/09/2003 121
04 09/13/2003 89
02 01/31/2004 181
03 05/09/2004 149
04 07/11/2003 .
03 12/18/2003 191
05 10/27/2003 .
01 01/23/2004 83
05 07/15/2004  89
01 10/30/2003 69
02 08/10/2003 118
02 10/21/2003 123
05 01/18/2004 151
04 02/27/2005 133
03 11/01/2003 120
03 02/15/2004 185
01 09/27/2003 103
04 12/28/2004 151
;

Patients' weights are closely monitored according to the following classes:
• class 1: 70(lb) or less
• class 2: (70,90]
• class 3: (90,120]
• class 4: (120,150]
• class 5: (150,180]
• class 6: more than 180
For each patient, at all the visits, a weight change is classified as significance ("yes") if there is at least a change of 2 classes from last available class. For example, change from 1 to 3, change from 5 to 2, change from 3 to 6, etc., are each considered as significance. If there is no significance change then it is marked as "no". However, if no previously available record of weight class, then mark it as " ". Create a new data set wt2 containing this additional variable, named Signif. (Note: as a consequence, the value of this new variable in the first record of each patient is always " " since there is no previously available weight record for this patient.)
45. Predict what will turn out in SAS log when you run the following SAS codes:
data _null_;
input a;
put a= _n_=;
do _n_ = 1 to 3;
put _n_=;
end;
put _n_=;
datalines;
1
2
3
;

Run the SAS codes. Compare your prediction with the results. Explain why.
46. Redo Exercise 18 with a data _null_ step but with picture formats for printing SBP and DBP to produce the same output.
47. The HARVEST Trial Data (description in HARVEST Trial in OzDASL contain subject-level information (that is, of wide format). Construct a SAS data set, named HARVEST. Use sensible labels for all the variables. For example, 'Smoking Status at Baseline' for SMOKE and 'Sport Activity at Baseline' for SPORT and 'Age at Baseline' for AGEB. Read all columns as of numeric type. Precede the data step with a PROC FORMAT to define custom formats for the following variables and then associate these formats with the corresponding variables in the data step:
• SMOKE
• SPORT
• ENDPOINT
• MALE
A snapshot of the data view (with the 4 variables rearranged to appear the the very beginning) should look like:
48. Construct, from the HARVEST data set in the previous problem, a new data set that contains the following columns (in long format) by stacking up columns:
• SUBJECT, character type, from 0001 to 1100
• SMOKE, numeric type, with descriptive format (see data description)
• SPORT, numeric type, with descriptive format (see data description)
• SBP, numeric type, by stacking up SBPCB, SBPAB, SBPA3, SBPC5, SBPA5, SBPCE, and SBPAE
• DBP, numeric type, by stacking up DBPCB, DBPAB, DBPA3, DBPC5, DBPA5, DBPCE, and DBPAE
• HR, numeric type, by stacking up HRCB, HRAB, HRA3, HRC5, HRA5, HRCE, and HRAE
• TYPE (examination type), character type, with two possible values 'C' and 'A' for clinical examination or ambulatory (home monitoring) corresponding to proper data chunks; in addition, create descriptive format to associate with the single-character values
• VISIT, character type, with four possible values 'B', '3', '5' and 'E' corresponding to proper data chunks; in addition, create descriptive format to associate with the single-character values
• VISITNUM, numeric type, the numeric version of Visit with 1, 2, 3, and 4 corresponding to Baseline, 3-Month, 5-Year, and Endpoint, respectively
• AGE, numeric type, with each subject's baseline age appearing at each of the 7 data chunks
• BMI, numeric type, with each subject's baseline BMI value appearing at each of the 7 data chunks
• ENDPOINT, numeric type, in addition, create descriptive format to associate with the variable
• TIME, numeric type
• GENDER, numeric type, in addition, create descriptive format to associate with the variable
• The new data set, named HARVEST2 should have 7700 observations. Sort the data set by SUBJECT, TYPE, and VISITNUM. In addition to the above requirement, use sensible labels for all the variables. For example, 'Smoking Status at Baseline' for SMOKE and 'Sport Activity at Baseline' for SPORT and 'Age at Baseline' for AGE, etc.
Note that, for each subject, the values for SMOKE, SPORT, AGE, BMI, ENDPOINT, TIME, and GENDER should be copied to all data chunks.
A snapshot of the data view for the first 2 subjects should look like:
49. The data set HARVEST2 contains observations having all missing vital signs (SBP, DBP, and HR) in that the visit did not actually take place. Delete these observations and obtain a new data set HARVEST3. It should have 3786 observations.
50. Using data steps and some procedure steps and create a new data set HARVEST4 from HARVEST3. The new data set is in CDISC format in which the vital signs are stack-upped. In addition to existing variables SUBJECT, TYPE, VISIT, VISITNUM, SMOKE, SPORT, AGE, BMI, ENDPOINT, TIME, and GENDER, add new variables
• Param: character labels of vital signs--Systolic Blood Pressure, Diastolic Blood Pressure, and Heart Rate
• Paramcd: character names of vital signs--SBP, DBP, and HR
• Aval: corresponding vital sign values (stack-up of SBP, DBP, and HR)
• Base: vital sign baseline value
• Chg: change from baseline = Aval − Base
The new data set should have 11,358 observations and is sorted by Subject, Paramcd, Type, Visitnum. A snapshot of the data view (with variable order rearranged) for the first 2 subjects should look like:
51. Use the data set HARVEST3 and create a summary report table t_12_2_1vs.doc using the format as in the table mockup mt_12_2_1vs.doc. The implementation notes at the bottom of the mockup should not be included in Table 12.2.1.
Tools used: PROC MEANS with noprint option, OUTPUT statement, WAYS statement, VAR and CLASS statement. The use of data set options WHERE= and DROP= could be useful.
52. Repeat the previous problem by using the data set HARVEST4 instead.
53. Use the Depression Data (description in Depression Before and After an Earthquake) and create a summary report of the scores at the 5 time points. The statistics to be included in the report at each time point should be arranged in rows like
n
Mean (SD)
Q1/Median/Q3
(Min, Max)

54. Read the Turtle Data (data description in Plasma Protein of Fasting Turtles) and create a SAS data set named TURTLES.
55. Note that the TURTLES data set is a subject level data set. That is, the data set is in wide format. Use this SAS data set and create a new SAS data set named TURTLES2 by stacking up Fed, Fasted10, and Fasted20. Refer to the HARVEST3, the resulted data set should have a similar structure having variables Subject, Sex, Protein (the stack-upped measurements of plasma protein), and Time (having 3 values: Baseline, Fasting 10 Days, and Fasting 20 Days). In addition, create variable BASE which is populated with the respective baseline plasma protein values for the subjects. Also create CHG, Change from Baseline having non-missing values at non-baseline time and defined by Protein − BASE.
56. Use TURTLES2 data set and create a summary report for plasma protein at the 3 visits for either gender and for overall. At the two non-baseline time, also create summary statistics for CHG.
57. Use Bassin Anticipatory Timing (data description in Bassin Anticipatory Timing) and repeat the steps similar to the TURTLES data above.
58. Read Balance Data Set (in long format, and data description in Effect of Surface and Vision on Balance). Create a SAS data set named BALANCE
59. Use the data in the previous problem and create summary reports of your choice. Tools includes, but not limit to, PROC FREQ, PROC MEANS, PROC UNIVARIATE, and PROC REPORT. You may use ODS to create specialized summary data sets/reports.
60. Repeat the previous 2 problems, but use Lower Back Pain Data (data description in Runners with Low Back Pain) instead.
61. Use the data set below and create a new data set that contains only the values of x that each appears only once in the data set rep.
data rep;
input x @@;
datalines;
26 37  5 26 10 26  2 10  2 17 26  2 50
;

The new data set should contain 5, 17, 37, and 50 for variable x.
Hint: Tools used may include proc freq with an OUT= option for the TABLES statement, and a data set option WHERE=.
62. A study was conducted by XYZ Pharmaceutical Inc. in which subjects' medical events were logged in event.csv. Each occurrence of an event was recorded (in this order):
• USUBJID: length-10 character variable for "Subject ID"
• Event_Seq: numeric variable recorded "Event Sequence". For a subject having 3 events, there are 3 records with Event_Seq recorded as 1, 2, and 3 for the events.
• estd: length-2 character variable recording day event started. "UN", "un", "Un" or " " (missing) indicate missing start day.
• estm: length-3 character variable recording 3-letter abbreviation of "Start Month" in upcase, lowcase, or propcase ("JAN","jan","Jan" for example). "UNK", "unk", "Unk" or " " (missing) indicate missing start month.
• esty: length-4 character variable recording 4-digit start year. Its value could be " " (missing).
• eend: length-2 character variable recording day event ended. "UN", "un", "Un" or " " (missing) indicate missing end day.
• eenm: length-3 character variable recording 3-letter abbreviation of end month in upcase, lowcase, or propcase ("JAN","jan","Jan" for example). "UNK", "unk", "Unk" or " " (missing) indicate missing end month.
• eeny: length-4 character variable recording 4-digit end year. Its value could be " " (missing).
Use a data step to read this data file into a SAS data set (Hint: use DLM= and DSD options in your INFILE statement) and then sort it by USUBJID and Event_Seq. Name the sorted data set ev_raw. Now, use this data set to create another data set, named event_log, containing the following variables:
• USUBJID: length-10 character variable, labeled "Subject ID"
• Event_Seq: numeric variable labeled "Event Sequence"
• EVSTDT: numeric date variable labeled "Event Start Date" constructed out of the variables estd, estm, and esty from data set ev_raw.
• EVENDT: numeric date variable labeled "Event End Date" constructed out of the variables eend, eenm, and eeny from data set ev_raw.
• EVSTDTC: length-10 character variable labeled "Event Start Date (Char)" containing character start date of the form yyyy-mm-dd for non-missing EVSTDT; yyyy-mm if estd is missing and estm and esty are non-missing; yyyy if estm is missing and esty is non-missing; missing (" ") if esty is missing. Example 1: Subject "XYZ-06-010" at Event_Seq=2 has estd="UN", estm="feb" and esty="2013". The value of EVSTDTC is then "2013-02".
Example 2: Subject "XYZ-25-009" at Event_Seq=3 has estd="01", estm="unk" and esty="2013". The value of EVSTDTC is then "2013".
Example 3: Subject "XYZ-11-008" at Event_Seq=1 has estd="19", estm="JAN" and esty="". The value of EVSTDTC is then " " (missing).
• EVENDTC: length-10 character variable labeled "Event End Date (Char)" containing character end date of the form yyyy-mm-dd for non-missing EVENDT; yyyy-mm if eend is missing and eenm and eeny are non-missing; yyyy if eenm is missing and eeny is non-missing.
Hint: Use input and cats functions to create the numeric version of date values. For the character version of date values may include catx, input, cats, upcase, coalescec, ifc, and strip functions, if-else if construct or select-end construct, proc format with invalue statement for the conversion of 3-letter month abbreviation to 2-digit 'numeric' month name, use of SAS built-in monthname format. Note that you do not have to use them all and that you may have other tools of your choice to code the data step.
63. Repeat the previous problem, but create another data set, in addition to event_log, named ev_error by using the data step statement
data event_log ev_error;
The data set event_log is as in the previous problem. However, the data set ev_error will capture every erroneous record of ev_raw in which at least one of the start date and end date is missing. The data set ev_error contains the same variables as ev_raw.
Hint: In addition to the hint in the previous problem, use output statement under if-else if construct and keep= data set option.
64. In a study conducted by XYZ Pharmaceutical Inc., subjects' visits are recorded in allvisits.csv. Each subject has 6 visits: Baseline, 2-Week Test, 3-Month Checkup, 6-Month Checkup, 1-Year Checkup, and 2-Year Checkup. In addition to these visits, each subject also made variable uncheduled visit with respect to medical needs. Read the data into a SAS data set, named allvisits. Then create a new SAS data set, named Visits, from this data set in which a new variable, named visitn, contains imputed visit numbers: as is for non-missing visitnum, and visit number such as 2.01, 2.02, 3.01, 3.02, 3.03, etc. according to the visit dates, visitdt.
65. Two data sets from a clinical study are considered.
Subject-level data set _sl (_sl.sas7bdat) contains Subject ID (usubjid), Treatment Number (trtn), Treatment (trt), Treatment Start Date (trtstdt), and Treatment End Date (trtendt).
Adverse event data set allaes (allaes.sas7bdat) contains System Organ Class (SOC), Preferred Term (PT), Subject ID (usubjid), Treatment Number (trtn), Treatment (trt), Treatment Start Date (trtstdt), Treatment End Date (trtendt), AE Start Date (aestdt), and AE End Date (aeendt).
These two SAS data sets were stored in a folder, say T:/680Fall14/egs/sasdata. Using SAS data set allaes to produce a summary report of the numbers of adverse events by System Organ Class/Preferred Term. An example (aetest.sas) run on a subset data set aetest (aetest.sas7bdat) produced the listing output (aetest.lst) where the report was produced from the data set final (final.sas7bdat). The SAS codes that produced the final data set was masked. Fill in SAS codes in example (aetest.sas) to test out and then apply it to the adverse event data set allaes (allaes.sas7bdat) to get the requested report. Note that the table was arranged in the descending order of the numbers of subjects within SOCs disregarding the treatment they received (variable ord in final data set) and then within each SOC, in descending order of numbers of subjects within PTs disregarding the treatment they received (var ord2 in final data set).
66. Data set chrs was generated by the SAS codes below (also found in chrs.sas):
data chrs;
call streaminit("27OCT2015"d);
length a \$ 30 b \$ 50 c \$ 40 d group \$ 1;
do case = 1 to 100;
group = choosec(rand('Table',.3,.3,.3,.1),"A","B","C","D");
d = byte(ceil(rand("Uniform")*25+65)+32*rand("Bernoulli",0.5));
a = repeat(d, ceil(rand("Normal",15,2)));
d = byte(ceil(rand("Uniform")*25+65)+32*rand("Bernoulli",0.5));
b = repeat(d, ceil(rand("Normal",25,3)));
d = byte(ceil(rand("Uniform")*25+65)+32*rand("Bernoulli",0.5));
c = repeat(d, ceil(rand("Normal",20,2.7)));
output;
end;
drop d;
run;

Create a sas data set named max_len using two methods:
• Method one: use set statement with by group processing (need to sort data appropriately) and with first(.byvar) and last(.byvar).
• Method two: first create an intermediate data set that contains a corresponding character count variable for each of variables a, b, c. For example, suppose the value of a is "aaaaa" then the value of the count variable (named alen) is 5. Then run PROC MEANS on these count variables with OUTPUT statement to create the desired data set.
The output data set should have 4 variables: group, a, b, and c where a, b, and c are numeric containing the maximum lengths by group. The image below shows the data view of max_len:

Hint: use length function to count the number of characters in a text string.

2015-11-03