Introduction

Welcome to Edition 6 of Making SAS Accessible to Everyone. After exploring data structure and summaries in Edition 5, we now shift our focus to data preparation and validation.

These steps are crucial before any statistical analysis, modeling, or reporting can begin. Whether you’re cleaning raw inputs, checking consistency, or creating new variables, SAS offers a comprehensive set of tools to get your dataset analysis-ready.


1. Creating and Filtering Datasets

1.1 Copying a Dataset

Use the DATA and SET statements to create a new table from an existing dataset.

DATA work.new_table;
    SET work.original_table;
RUN;

This is useful for preserving the original data while applying transformations or filtering.

1.2 Filtering Rows

You can filter rows directly during dataset creation using WHERE:

DATA work.filtered_data;
    SET work.original_data;
    WHERE Age >= 18;
RUN;

Alternatively, use IF after reading in each observation:

DATA work.filtered_data;
    SET work.original_data;
    IF Age >= 18;
RUN;

Note: WHERE is processed during input, making it more efficient for large datasets.


2. Keeping or Dropping Variables

To manage which variables appear in your dataset, use the KEEP or DROP options:

DATA work.cleaned_data (DROP=SSN);
    SET work.original_data;
RUN;

DATA work.reduced_data;
    SET work.original_data (KEEP=Name Age Gender);
RUN;

These allow you to reduce dataset size and avoid storing unnecessary information.


3. Formatting Variables

SAS allows you to format how data values are displayed. For example:

DATA work.formatted_data;
    SET work.original_data;
    FORMAT BirthDate DATE9. Salary DOLLAR8.2;
RUN;
  • DATE9.: Displays dates like 01JAN2023
  • DOLLAR8.2: Formats numbers as monetary values (e.g., $1,234.56)

4. Creating New Variables

Arithmetic Computations

Create new numeric variables from existing ones:

DATA work.bmi_data;
    SET work.original_data;
    BMI = Weight / ((Height/100)**2);
RUN;

Character Creation

Use conditional logic and functions to create descriptive labels:

DATA work.labels;
    SET work.original_data;
    LENGTH Category $15;
    IF Score >= 90 THEN Category = "Excellent";
    ELSE IF Score >= 75 THEN Category = "Good";
    ELSE Category = "Needs Review";
RUN;

5. Useful SAS Functions

Aggregation

These handle missing values gracefully:

Total = SUM(score1, score2, score3);
Average = MEAN(score1, score2, score3);

Character Handling

Manipulate strings easily:

FullName = CATS(FirstName, " ", LastName);
Initials = SUBSTR(FirstName,1,1) || SUBSTR(LastName,1,1);

Date Functions

Work with and compute dates:

TodayDate = TODAY();
Age = YRDIF(BirthDate, TODAY(), 'AGE');

6. Validating Data

Conditional Checks

Create flags for unusual or invalid values:

IF Income < 0 THEN Flag_Invalid = 1;
IF NMISS(Height, Weight) > 0 THEN Flag_Missing = 1;

Multi-Condition Validation

Handle complex rules with DO blocks:

IF Age >= 65 THEN DO;
    Category = "Senior";
    Discount = 0.2;
END;
ELSE IF Age < 18 THEN DO;
    Category = "Minor";
    Discount = 0.1;
END;
ELSE DO;
    Category = "Adult";
    Discount = 0;
END;

7. Best Practices

  • Always validate numeric ranges and detect impossible values (e.g., Age < 0).
  • Use NMISS to track missing values across multiple variables.
  • Flag invalid records using conditional logic and document them.
  • Use LABEL and FORMAT to make variables easier to interpret during analysis.

Conclusion

Edition 6 provided essential techniques for preparing and validating data in SAS:

  • Creating and filtering datasets
  • Keeping/dropping and formatting variables
  • Using conditional logic for new columns and flags
  • Employing common SAS functions for transformation and validation

These steps form the foundation for clean, trustworthy data analysis.


What’s Next

In Edition 7, we’ll cover Analyzing and Reporting Data.


Keep learning with 3 D Statistical Learning.

Special thanks to Dr. Dany Djeudeu for inspiring accessible and applied statistical learning in SAS.