Introduction

This edition marks the beginning of our deep dive into data exploration using SAS.

This Edition cannot have the pretension to cover the complete data exploration—not even a huge part. From experience, data exploration can consume up to 90% of the data analytics workload, depending on the goal and the data quality.

Our goal here is simple: to introduce foundational tools and steps to start exploring your data in SAS.

In future intermediate-level tutorials, we’ll go deeper into each of the important SAS procedures for data exploration.


1. Previewing Data with PROC PRINT

The PROC PRINT procedure is a basic yet essential tool for examining raw data rows.

PROC PRINT DATA=input-table(OBS=10);
    VAR col1 col2 col3;
RUN;
  • OBS=n: Limits the number of rows displayed.
  • VAR: Selects and orders variables to display.

Use this as your first look into a dataset.


2. Generating Descriptive Statistics

2.1 With PROC MEANS

Use PROC MEANS to quickly view summary statistics for numeric variables:

PROC MEANS DATA=input-table;
    VAR var1 var2;
RUN;
  • Shows: Mean, Std Dev, Min, Max, etc.
  • Use VAR to focus on specific variables.

2.2 With PROC UNIVARIATE

For more detailed insights (e.g., distribution, skewness, extreme values):

PROC UNIVARIATE DATA=input-table;
    VAR var1;
RUN;

This procedure is powerful for uncovering outliers and understanding distributional properties.


3. Exploring Categorical Data with PROC FREQ

For counts and percentages of unique values:

PROC FREQ DATA=input-table;
    TABLES col1 col2;
RUN;

You can add options for controlling how frequency tables are displayed. An example:

TABLES gender*region / NOCOL NOPERCENT;

Use this to explore categorical distributions and combinations.


4. Filtering Observations with WHERE

Use WHERE to focus your exploration on specific subsets of the data:

PROC PRINT DATA=input-table;
    WHERE Age > 40 AND Gender = "M";
RUN;

Operators You Can Use:

  • = or EQ
  • ^= or NE
  • > / < / >= / <=
  • IN, NOT IN, BETWEEN, LIKE

Example with date filter:

WHERE AdmissionDate >= "01JAN2020"d;

Use logical operators (AND, OR, NOT) for complex filters.


5. Dynamic Filtering with Macro Variables

%LET ageLimit = 30;

PROC PRINT DATA=input-table;
    WHERE Age > &ageLimit;
RUN;
  • %LET: Creates macro variables.
  • &macrovar: Substitutes values in code.
  • Use "&charvar" for character variables and "&date"d for dates.

Macros make filtering parameterized and reusable.


6. Formatting Variables for Readability

Change how data is displayed (not stored) using formats:

PROC PRINT DATA=input-table;
    FORMAT Salary DOLLAR8.2 BirthDate DATE9.;
RUN;

Formats help produce clean, human-readable outputs.


7. Sorting and Deduplicating with PROC SORT

Sorting:

PROC SORT DATA=input-table OUT=sorted-table;
    BY Age;
RUN;
  • BY: Specifies sort key(s).
  • DESCENDING keyword for reverse sort.

Removing Duplicates:

PROC SORT DATA=input-table OUT=nodups NODUPKEY;
    BY _ALL_;
RUN;
  • NODUPKEY: Removes duplicates based on BY keys.
  • _ALL_: Uses all columns to identify exact duplicates.

Use DUPOUT= to save duplicates to a separate dataset.


Conclusion

In this edition, we introduced basic yet powerful tools for exploring data in SAS:

  • PROC PRINT to preview data
  • PROC MEANS and PROC UNIVARIATE for numeric summaries
  • PROC FREQ for categorical analysis
  • WHERE statements for filtering
  • Macro variables for dynamic queries
  • FORMAT for readable output
  • PROC SORT for ordering and deduplication

Mastering these techniques will prepare you for more in-depth exploration and visualization in SAS.


Coming Next

In Edition 6, we’ll explore data validation and common strategies to ensure your data is clean, consistent, and analysis-ready.


Stay curious and keep coding with 3 D Statistical Learning.

Special thanks to Dr. Dany Djeudeu for his continuous effort to make statistical tools accessible and intuitive.