Introduction

Welcome to Edition 11 of Making SAS Accessible to Everyone. In this edition, we explore how to summarize data efficiently and accurately using foundational SAS techniques. Data summarization is essential in both exploratory data analysis and reporting, allowing analysts to uncover patterns, identify outliers, and condense large datasets into actionable insights.

In particular, we will focus on how to:

Create accumulating totals using the RETAIN statement.
Generate grouped summaries using BY groups in combination with FIRST. and LAST. indicators.
Apply conditional logic to include or exclude records using subsetting IF statements.
Produce tailored output with multi-layered control flows.

These tools not only streamline your summarization tasks but also reinforce best practices in efficient and readable SAS programming.

1. Creating Accumulating Columns

Often in business and analytics, we want to see running totals, like cumulative revenue, inventory levels, or patient visits. SAS resets variables during each iteration of the DATA step unless told otherwise.

Using `RETAIN` and Sum Statements

DATA sales_total;
    SET sales_data;
    RETAIN TotalSales 0;
    TotalSales + Revenue;
RUN;

Explanation:

RETAIN TotalSales 0; initializes TotalSales only once, not at the start of each loop.
The + operator accumulates values across rows.
This creates a column showing cumulative revenue for each row processed.

Additional Tip:

If the dataset isn’t sorted, the total will span the entire file. If you want grouped totals, use BY groups (see next section).

2. Processing Data in Groups with `BY`

Summarizing by a group (e.g., by Region, Department, or CustomerID) allows you to analyze patterns within subsets of your data.

Step 1: Sorting the Data

Before using BY, sort the dataset:

PROC SORT DATA=sales_data;
    BY Region;
RUN;

Step 2: Using `FIRST.` and `LAST.`

DATA region_summary;
    SET sales_data;
    BY Region;
    RETAIN RegionTotal 0;

    IF FIRST.Region THEN RegionTotal = 0;
    RegionTotal + Revenue;

    IF LAST.Region THEN OUTPUT;
    KEEP Region RegionTotal;
RUN;

Explanation:

FIRST.Region is 1 on the first observation of a new group.
LAST.Region is 1 on the last observation of a group.
We reset the total at the start and output only once per group.

This approach is much more efficient than aggregating with PROC MEANS if you need additional logic or custom columns.

3. Subsetting with Conditional `IF`

Use the subsetting IF to selectively keep rows during processing. This technique offers more granular control than a WHERE clause in many contexts.

Example:

DATA high_revenue;
    SET sales_data;
    IF Revenue > 10000;
RUN;

Use Cases:

Removing outliers or irrelevant cases.
Isolating extreme performers (e.g., top 10% sales reps).
Building training datasets from filtered records.

Comparison with `WHERE`:

WHERE filters rows before entering the DATA step.
Subsetting IF allows you to filter after new variables are calculated.

4. Combining Group Processing and Conditional Output

Advanced workflows often require you to summarize by group and filter simultaneously.

Example: Top Performers by Region

DATA top_performers;
    SET sales_data;
    BY Region;
    IF FIRST.Region AND Revenue > 15000 THEN OUTPUT;
RUN;

Interpretation:

You are extracting the first observation from each region only if revenue exceeds the threshold.
Great for reports that require representative but selective rows.

Another Example: Final Sale Per Client

PROC SORT DATA=transactions;
    BY CustomerID TransactionDate;
RUN;

DATA last_transaction;
    SET transactions;
    BY CustomerID;
    IF LAST.CustomerID;
RUN;

This keeps only the most recent transaction per customer.
Perfect for retention modeling, latest purchase analysis, etc.

5. Best Practices

Always sort your dataset using PROC SORT before BY processing.
Use RETAIN only where necessary to minimize unintended data retention.
Avoid placing OUTPUT too early inside complex conditional blocks—ensure it follows all intended logic.
Test logic using PUTLOG statements to verify FIRST. and LAST. behavior.
Pair KEEP= and DROP= with your summarization logic to generate clean outputs.

Conclusion

In Edition 11, we learned how to:

Use RETAIN to maintain cumulative totals across iterations.
Leverage BY groups and FIRST./LAST. indicators for grouped summaries.
Use subsetting IF to include or exclude specific rows.
Combine group processing with conditional logic to build sophisticated summaries.

These summarization tools are fundamental to most SAS workflows, from dashboards to analytical models.

What’s Next

In Edition 12, we transition to Manipulating Data with Functions.

Keep summarizing smartly with 3 D Statistical Learning.

Special thanks to Dr. Dany Djeudeu for empowering learners with foundational SAS skills and making summarization techniques accessible to all.

3 D Statistical Learning

We help businesses and researchers solve complex challenges by providing expert guidance in statistics, machine learning, and tailored education.

Our core services include:

– Statistical Consulting:
Comprehensive consulting tailored to your data-driven needs.

– Training and Coaching:
In-depth instruction in statistics, machine learning, and the use of statistical software such as SAS, R, and Python.

– Reproducible Data Analysis Pipelines:
Development of documented, reproducible workflows using SAS macros and customized R and Python code.

– Interactive Data Visualization and Web Applications:
Creation of dynamic visualizations and web apps with R (Shiny, Plotly), Python (Streamlit, Dash by Plotly), and SAS (SAS Viya, SAS Web Report Studio).

– Automated Reporting and Presentation:
Generation of automated reports and presentations using Markdown and Quarto.

– Scientific Data Analysis:
Advanced analytical support for scientific research projects.

Making SAS Accessible to Everyone – Edition 11: Summarizing Data in SAS

Introduction

1. Creating Accumulating Columns

Using `RETAIN` and Sum Statements

Explanation:

Additional Tip:

2. Processing Data in Groups with `BY`

Step 1: Sorting the Data

Step 2: Using `FIRST.` and `LAST.`

Explanation:

3. Subsetting with Conditional `IF`

Example:

Use Cases:

Comparison with `WHERE`:

4. Combining Group Processing and Conditional Output

Example: Top Performers by Region

Interpretation:

Another Example: Final Sale Per Client

5. Best Practices

Conclusion

What’s Next

Leave a Reply Cancel reply

Archive

Categories

Recent Posts

Share on Social Media

Follow Our Social Media

Data Visualization

Recent Posts

Recent Comments

Categories

Archives

Recent Post

Recent Comments

Making SAS Accessible to Everyone – Edition 11: Summarizing Data in SAS

Introduction

1. Creating Accumulating Columns

Using RETAIN and Sum Statements

Explanation:

Additional Tip:

2. Processing Data in Groups with BY

Step 1: Sorting the Data

Step 2: Using FIRST. and LAST.

Explanation:

3. Subsetting with Conditional IF

Example:

Use Cases:

Comparison with WHERE:

4. Combining Group Processing and Conditional Output

Example: Top Performers by Region

Interpretation:

Another Example: Final Sale Per Client

5. Best Practices

Conclusion

What’s Next

Leave a Reply Cancel reply

Archive

Categories

Recent Posts

Share on Social Media

Follow Our Social Media

Data Visualization

Recent Posts

Recent Comments

Categories

Archives

Recent Post

Recent Comments

Using `RETAIN` and Sum Statements

2. Processing Data in Groups with `BY`

Step 2: Using `FIRST.` and `LAST.`

3. Subsetting with Conditional `IF`

Comparison with `WHERE`: