Introduction
Welcome to Edition 11 of Making SAS Accessible to Everyone. In this edition, we explore how to summarize data efficiently and accurately using foundational SAS techniques. Data summarization is essential in both exploratory data analysis and reporting, allowing analysts to uncover patterns, identify outliers, and condense large datasets into actionable insights.
In particular, we will focus on how to:
- Create accumulating totals using the
RETAINstatement. - Generate grouped summaries using
BYgroups in combination withFIRST.andLAST.indicators. - Apply conditional logic to include or exclude records using subsetting
IFstatements. - Produce tailored output with multi-layered control flows.
These tools not only streamline your summarization tasks but also reinforce best practices in efficient and readable SAS programming.
1. Creating Accumulating Columns
Often in business and analytics, we want to see running totals, like cumulative revenue, inventory levels, or patient visits. SAS resets variables during each iteration of the DATA step unless told otherwise.
Using RETAIN and Sum Statements
DATA sales_total;
SET sales_data;
RETAIN TotalSales 0;
TotalSales + Revenue;
RUN;
Explanation:
RETAIN TotalSales 0;initializesTotalSalesonly once, not at the start of each loop.- The
+operator accumulates values across rows. - This creates a column showing cumulative revenue for each row processed.
Additional Tip:
If the dataset isn’t sorted, the total will span the entire file. If you want grouped totals, use BY groups (see next section).
2. Processing Data in Groups with BY
Summarizing by a group (e.g., by Region, Department, or CustomerID) allows you to analyze patterns within subsets of your data.
Step 1: Sorting the Data
Before using BY, sort the dataset:
PROC SORT DATA=sales_data;
BY Region;
RUN;
Step 2: Using FIRST. and LAST.
DATA region_summary;
SET sales_data;
BY Region;
RETAIN RegionTotal 0;
IF FIRST.Region THEN RegionTotal = 0;
RegionTotal + Revenue;
IF LAST.Region THEN OUTPUT;
KEEP Region RegionTotal;
RUN;
Explanation:
FIRST.Regionis1on the first observation of a new group.LAST.Regionis1on the last observation of a group.- We reset the total at the start and output only once per group.
This approach is much more efficient than aggregating with PROC MEANS if you need additional logic or custom columns.
3. Subsetting with Conditional IF
Use the subsetting IF to selectively keep rows during processing. This technique offers more granular control than a WHERE clause in many contexts.
Example:
DATA high_revenue;
SET sales_data;
IF Revenue > 10000;
RUN;
Use Cases:
- Removing outliers or irrelevant cases.
- Isolating extreme performers (e.g., top 10% sales reps).
- Building training datasets from filtered records.
Comparison with WHERE:
WHEREfilters rows before entering the DATA step.- Subsetting
IFallows you to filter after new variables are calculated.
4. Combining Group Processing and Conditional Output
Advanced workflows often require you to summarize by group and filter simultaneously.
Example: Top Performers by Region
DATA top_performers;
SET sales_data;
BY Region;
IF FIRST.Region AND Revenue > 15000 THEN OUTPUT;
RUN;
Interpretation:
- You are extracting the first observation from each region only if revenue exceeds the threshold.
- Great for reports that require representative but selective rows.
Another Example: Final Sale Per Client
PROC SORT DATA=transactions;
BY CustomerID TransactionDate;
RUN;
DATA last_transaction;
SET transactions;
BY CustomerID;
IF LAST.CustomerID;
RUN;
- This keeps only the most recent transaction per customer.
- Perfect for retention modeling, latest purchase analysis, etc.
5. Best Practices
- Always sort your dataset using
PROC SORTbeforeBYprocessing. - Use
RETAINonly where necessary to minimize unintended data retention. - Avoid placing
OUTPUTtoo early inside complex conditional blocks—ensure it follows all intended logic. - Test logic using
PUTLOGstatements to verifyFIRST.andLAST.behavior. - Pair
KEEP=andDROP=with your summarization logic to generate clean outputs.
Conclusion
In Edition 11, we learned how to:
- Use
RETAINto maintain cumulative totals across iterations. - Leverage
BYgroups andFIRST./LAST.indicators for grouped summaries. - Use subsetting
IFto include or exclude specific rows. - Combine group processing with conditional logic to build sophisticated summaries.
These summarization tools are fundamental to most SAS workflows, from dashboards to analytical models.
What’s Next
In Edition 12, we transition to Manipulating Data with Functions.
Keep summarizing smartly with 3 D Statistical Learning.
Special thanks to Dr. Dany Djeudeu for empowering learners with foundational SAS skills and making summarization techniques accessible to all.
We help businesses and researchers solve complex challenges by providing expert guidance in statistics, machine learning, and tailored education.
Our core services include:
– Statistical Consulting:
Comprehensive consulting tailored to your data-driven needs.
– Training and Coaching:
In-depth instruction in statistics, machine learning, and the use of statistical software such as SAS, R, and Python.
– Reproducible Data Analysis Pipelines:
Development of documented, reproducible workflows using SAS macros and customized R and Python code.
– Interactive Data Visualization and Web Applications:
Creation of dynamic visualizations and web apps with R (Shiny, Plotly), Python (Streamlit, Dash by Plotly), and SAS (SAS Viya, SAS Web Report Studio).
– Automated Reporting and Presentation:
Generation of automated reports and presentations using Markdown and Quarto.
– Scientific Data Analysis:
Advanced analytical support for scientific research projects.