Introduction

Welcome to Edition 11 of Making SAS Accessible to Everyone. In this edition, we explore how to summarize data efficiently and accurately using foundational SAS techniques. Data summarization is essential in both exploratory data analysis and reporting, allowing analysts to uncover patterns, identify outliers, and condense large datasets into actionable insights.

In particular, we will focus on how to:

  • Create accumulating totals using the RETAIN statement.
  • Generate grouped summaries using BY groups in combination with FIRST. and LAST. indicators.
  • Apply conditional logic to include or exclude records using subsetting IF statements.
  • Produce tailored output with multi-layered control flows.

These tools not only streamline your summarization tasks but also reinforce best practices in efficient and readable SAS programming.


1. Creating Accumulating Columns

Often in business and analytics, we want to see running totals, like cumulative revenue, inventory levels, or patient visits. SAS resets variables during each iteration of the DATA step unless told otherwise.

Using RETAIN and Sum Statements

DATA sales_total;
    SET sales_data;
    RETAIN TotalSales 0;
    TotalSales + Revenue;
RUN;

Explanation:

  • RETAIN TotalSales 0; initializes TotalSales only once, not at the start of each loop.
  • The + operator accumulates values across rows.
  • This creates a column showing cumulative revenue for each row processed.

Additional Tip:

If the dataset isn’t sorted, the total will span the entire file. If you want grouped totals, use BY groups (see next section).


2. Processing Data in Groups with BY

Summarizing by a group (e.g., by Region, Department, or CustomerID) allows you to analyze patterns within subsets of your data.

Step 1: Sorting the Data

Before using BY, sort the dataset:

PROC SORT DATA=sales_data;
    BY Region;
RUN;

Step 2: Using FIRST. and LAST.

DATA region_summary;
    SET sales_data;
    BY Region;
    RETAIN RegionTotal 0;

    IF FIRST.Region THEN RegionTotal = 0;
    RegionTotal + Revenue;

    IF LAST.Region THEN OUTPUT;
    KEEP Region RegionTotal;
RUN;

Explanation:

  • FIRST.Region is 1 on the first observation of a new group.
  • LAST.Region is 1 on the last observation of a group.
  • We reset the total at the start and output only once per group.

This approach is much more efficient than aggregating with PROC MEANS if you need additional logic or custom columns.


3. Subsetting with Conditional IF

Use the subsetting IF to selectively keep rows during processing. This technique offers more granular control than a WHERE clause in many contexts.

Example:

DATA high_revenue;
    SET sales_data;
    IF Revenue > 10000;
RUN;

Use Cases:

  • Removing outliers or irrelevant cases.
  • Isolating extreme performers (e.g., top 10% sales reps).
  • Building training datasets from filtered records.

Comparison with WHERE:

  • WHERE filters rows before entering the DATA step.
  • Subsetting IF allows you to filter after new variables are calculated.

4. Combining Group Processing and Conditional Output

Advanced workflows often require you to summarize by group and filter simultaneously.

Example: Top Performers by Region

DATA top_performers;
    SET sales_data;
    BY Region;
    IF FIRST.Region AND Revenue > 15000 THEN OUTPUT;
RUN;

Interpretation:

  • You are extracting the first observation from each region only if revenue exceeds the threshold.
  • Great for reports that require representative but selective rows.

Another Example: Final Sale Per Client

PROC SORT DATA=transactions;
    BY CustomerID TransactionDate;
RUN;

DATA last_transaction;
    SET transactions;
    BY CustomerID;
    IF LAST.CustomerID;
RUN;
  • This keeps only the most recent transaction per customer.
  • Perfect for retention modeling, latest purchase analysis, etc.

5. Best Practices

  • Always sort your dataset using PROC SORT before BY processing.
  • Use RETAIN only where necessary to minimize unintended data retention.
  • Avoid placing OUTPUT too early inside complex conditional blocks—ensure it follows all intended logic.
  • Test logic using PUTLOG statements to verify FIRST. and LAST. behavior.
  • Pair KEEP= and DROP= with your summarization logic to generate clean outputs.

Conclusion

In Edition 11, we learned how to:

  • Use RETAIN to maintain cumulative totals across iterations.
  • Leverage BY groups and FIRST./LAST. indicators for grouped summaries.
  • Use subsetting IF to include or exclude specific rows.
  • Combine group processing with conditional logic to build sophisticated summaries.

These summarization tools are fundamental to most SAS workflows, from dashboards to analytical models.


What’s Next

In Edition 12, we transition to Manipulating Data with Functions.


Keep summarizing smartly with 3 D Statistical Learning.

Special thanks to Dr. Dany Djeudeu for empowering learners with foundational SAS skills and making summarization techniques accessible to all.