Introduction
Welcome to Edition 6 of Making SAS Accessible to Everyone. After exploring data structure and summaries in Edition 5, we now shift our focus to data preparation and validation.
These steps are crucial before any statistical analysis, modeling, or reporting can begin. Whether you’re cleaning raw inputs, checking consistency, or creating new variables, SAS offers a comprehensive set of tools to get your dataset analysis-ready.
1. Creating and Filtering Datasets
1.1 Copying a Dataset
Use the DATA and SET statements to create a new table from an existing dataset.
DATA work.new_table;
SET work.original_table;
RUN;
This is useful for preserving the original data while applying transformations or filtering.
1.2 Filtering Rows
You can filter rows directly during dataset creation using WHERE:
DATA work.filtered_data;
SET work.original_data;
WHERE Age >= 18;
RUN;
Alternatively, use IF after reading in each observation:
DATA work.filtered_data;
SET work.original_data;
IF Age >= 18;
RUN;
Note: WHERE is processed during input, making it more efficient for large datasets.
2. Keeping or Dropping Variables
To manage which variables appear in your dataset, use the KEEP or DROP options:
DATA work.cleaned_data (DROP=SSN);
SET work.original_data;
RUN;
DATA work.reduced_data;
SET work.original_data (KEEP=Name Age Gender);
RUN;
These allow you to reduce dataset size and avoid storing unnecessary information.
3. Formatting Variables
SAS allows you to format how data values are displayed. For example:
DATA work.formatted_data;
SET work.original_data;
FORMAT BirthDate DATE9. Salary DOLLAR8.2;
RUN;
DATE9.: Displays dates like01JAN2023DOLLAR8.2: Formats numbers as monetary values (e.g.,$1,234.56)
4. Creating New Variables
Arithmetic Computations
Create new numeric variables from existing ones:
DATA work.bmi_data;
SET work.original_data;
BMI = Weight / ((Height/100)**2);
RUN;
Character Creation
Use conditional logic and functions to create descriptive labels:
DATA work.labels;
SET work.original_data;
LENGTH Category $15;
IF Score >= 90 THEN Category = "Excellent";
ELSE IF Score >= 75 THEN Category = "Good";
ELSE Category = "Needs Review";
RUN;
5. Useful SAS Functions
Aggregation
These handle missing values gracefully:
Total = SUM(score1, score2, score3);
Average = MEAN(score1, score2, score3);
Character Handling
Manipulate strings easily:
FullName = CATS(FirstName, " ", LastName);
Initials = SUBSTR(FirstName,1,1) || SUBSTR(LastName,1,1);
Date Functions
Work with and compute dates:
TodayDate = TODAY();
Age = YRDIF(BirthDate, TODAY(), 'AGE');
6. Validating Data
Conditional Checks
Create flags for unusual or invalid values:
IF Income < 0 THEN Flag_Invalid = 1;
IF NMISS(Height, Weight) > 0 THEN Flag_Missing = 1;
Multi-Condition Validation
Handle complex rules with DO blocks:
IF Age >= 65 THEN DO;
Category = "Senior";
Discount = 0.2;
END;
ELSE IF Age < 18 THEN DO;
Category = "Minor";
Discount = 0.1;
END;
ELSE DO;
Category = "Adult";
Discount = 0;
END;
7. Best Practices
- Always validate numeric ranges and detect impossible values (e.g., Age < 0).
- Use
NMISSto track missing values across multiple variables. - Flag invalid records using conditional logic and document them.
- Use
LABELandFORMATto make variables easier to interpret during analysis.
Conclusion
Edition 6 provided essential techniques for preparing and validating data in SAS:
- Creating and filtering datasets
- Keeping/dropping and formatting variables
- Using conditional logic for new columns and flags
- Employing common SAS functions for transformation and validation
These steps form the foundation for clean, trustworthy data analysis.
What’s Next
In Edition 7, we’ll cover Analyzing and Reporting Data.
Keep learning with 3 D Statistical Learning.
Special thanks to Dr. Dany Djeudeu for inspiring accessible and applied statistical learning in SAS.
We help businesses and researchers solve complex challenges by providing expert guidance in statistics, machine learning, and tailored education.
Our core services include:
– Statistical Consulting:
Comprehensive consulting tailored to your data-driven needs.
– Training and Coaching:
In-depth instruction in statistics, machine learning, and the use of statistical software such as SAS, R, and Python.
– Reproducible Data Analysis Pipelines:
Development of documented, reproducible workflows using SAS macros and customized R and Python code.
– Interactive Data Visualization and Web Applications:
Creation of dynamic visualizations and web apps with R (Shiny, Plotly), Python (Streamlit, Dash by Plotly), and SAS (SAS Viya, SAS Web Report Studio).
– Automated Reporting and Presentation:
Generation of automated reports and presentations using Markdown and Quarto.
– Scientific Data Analysis:
Advanced analytical support for scientific research projects.