Introduction

Welcome to Edition 10 of Making SAS Accessible to Everyone. After exploring SQL integration in SAS in Edition 9, we now turn our attention back to the DATA step, the core of SAS programming. This edition focuses on controlling DATA step processing for efficient, readable, and accurate code.

Understanding the internal mechanics of the DATA step, especially the distinction between compilation and execution phases is essential for mastering SAS. You’ll learn how to manage outputs explicitly, use DROP= and KEEP= efficiently, and debug with PUTLOG. Mastering these tools makes your programs faster, cleaner, and easier to troubleshoot.


1. DATA Step Processing Phases

Compilation Phase

When a DATA step is compiled, SAS does not execute any data-related logic. Instead, it:

  • Scans your code to determine variable names, types (numeric or character), and lengths.
  • Builds the Program Data Vector (PDV), a memory structure that holds one observation at a time.
  • Prepares the data descriptor portion of the output dataset.

Even if the dataset has millions of rows, no data is read during compilation, SAS is simply preparing the structure.

Execution Phase

In this phase, SAS reads and processes data row-by-row:

  • Reads the first record from the source dataset.
  • Fills the PDV with values.
  • Executes all logic (e.g., IF, PUT, OUTPUT statements).
  • Writes the resulting observation to the output dataset (implicitly or explicitly).
  • Clears the PDV and repeats the process for the next row.

Use PUTLOG as a diagnostic tool during execution to inspect values in real time.

PUTLOG _ALL_;         * Logs all variables and their values;
PUTLOG Age= Gender=;  * Logs selected variables;
PUTLOG "Checkpoint";  * Writes a custom message;

2. Implicit vs. Explicit OUTPUT

Implicit Output

By default, SAS automatically writes each PDV record to the output dataset at the end of every iteration. This behavior is called implicit output.

Explicit Output

You can suppress implicit output and take full control using the OUTPUT statement:

DATA result;
    SET patients;
    IF Age >= 65 THEN OUTPUT;   /* Only output these rows */
RUN;

You may also create multiple output datasets:

DATA seniors juniors;
    SET patients;
    IF Age >= 65 THEN OUTPUT seniors;
    ELSE OUTPUT juniors;
RUN;

This technique is ideal for classification and segmentation.


3. Optimizing Variables with DROP= and KEEP=

Using DROP=

The DROP= option removes variables from the output dataset, although they remain in the PDV during execution:

DATA output (DROP=TempVar);
    SET source;
    TempVar = ... ;   /* Used for calculation */
RUN;

Using KEEP=

The KEEP= option limits which variables are read from input and/or written to output:

DATA result;
    SET rawdata (KEEP=Name Age Gender);
RUN;

These options reduce memory usage, improve clarity, and produce leaner datasets.


4. Conditional Logic + Output Control

SAS allows combining logic with output and variable control for more robust pipelines:

DATA diabetes_flags (KEEP=ID RiskLevel);
    SET patients;
    IF BMI > 30 AND Age > 50 THEN RiskLevel = 'High';
    ELSE IF BMI > 25 THEN RiskLevel = 'Moderate';
    ELSE RiskLevel = 'Low';
RUN;

By specifying KEEP=, we write only relevant variables. The logic inside allows complex condition-based classification.


5. Advanced Debugging with PUTLOG

Use PUTLOG to investigate how your program behaves during execution:

DATA debug_test;
    SET work.patients;
    PUTLOG "Processing ID=" ID " Age=" Age;
    IF Age < 0 THEN PUTLOG "Warning: Negative age detected for ID=" ID;
RUN;

This is particularly useful for identifying:

  • Incorrect data values

  • Logic bugs

  • Unexpected PDV behavior

Try PUTLOG _ALL_; to see the entire PDV content at each iteration.


Conclusion

In Edition 10, you gained a deeper understanding of the internals of SAS DATA step processing. We covered:

  • The two-step nature of DATA steps: compilation and execution
  • Managing output behavior using implicit and explicit OUTPUT
  • Optimizing memory and clarity with DROP= and KEEP=
  • Structuring conditional outputs
  • Debugging effectively with PUTLOG

These foundational concepts unlock better performance and more readable code for advanced SAS projects.


What’s Next

In Edition 11, we transition to summarizing data in SAS, building upon the summary techniques introduced in previous editions.


Keep refining your SAS skills with 3 D Statistical Learning.

Special thanks to Dr. Dany Djeudeu for demystifying core programming mechanics for learners around the world.