Introduction
In previous editions, we learned how to access and import data into SAS. Now, in Edition 4, we uncover what happens behind the scenes when SAS processes a DATA step.
Understanding SAS’s internal processing model—specifically, how the Program Data Vector (PDV) works—will help you write cleaner, more efficient code and debug with confidence.
1. The Two-Stage DATA Step Processing
SAS processes DATA steps in two distinct stages:
In this phase, SAS prepares to process your data but does not yet read any observations. Here’s what happens:
SAS sees the DATA statement and sets up a structure for the new dataset.
It reads the INFILE statement and inspects the external file (e.g., file length, delimiter).
It builds a memory buffer called the input buffer to store one record from the file.
It parses the INPUT and other statements to identify variables.
Each variable is assigned a name, type (character or numeric), and length (default is 8 bytes).
SAS constructs the descriptor portion (metadata) of the dataset.
The Program Data Vector (PDV)
The PDV is a memory area where one observation is built at a time. Imagine the PDV as a set of labeled boxes, one per variable.
+---------+-----+--------+--------+
| Gender | Age | Height | Weight |
| Char(8) | Num | Num | Num |
+---------+-----+--------+--------+
When you add derived variables, they are also added to the PDV:
+---------+-----+--------+--------+--------+
| Gender | Age | Height | Weight | BMI |
| Char(8) | Num | Num | Num | Num |
+---------+-----+--------+--------+--------+
No data has been read yet at this point.
b. Execution Stage
Once compilation is complete, SAS begins executing the DATA step:
- Initialize: All PDV values are set to missing (blanks for character, periods for numeric).
- Read: A line from the input file is placed into the input buffer.
- Input: The values are parsed and transferred to the PDV based on the
INPUT statement. - Calculate: Computations (e.g.,
BMI = ...) are evaluated. - Output: The observation is written to the dataset.
- Loop: Return to step 1 until the end of the file.
3. A Step-by-Step Example
Let’s walk through the following code:
data demographics;
infile "patients.txt";
input Gender $ Age Height Weight;
BMI = (Weight * 703) / (Height ** 2);
run;
Compile Phase Summary:
- Gender: Character, 8 bytes
- Age, Height, Weight: Numeric, 8 bytes each
- BMI: Computed, numeric, 8 bytes
PDV layout after compilation:
+---------+-----+--------+--------+--------+
| Gender | Age | Height | Weight | BMI |
| Char(8) | Num | Num | Num | Num |
+---------+-----+--------+--------+--------+
Execution Phase Trace:
Input Line: M 50 68 155
PDV evolves as follows:
1. Initialize:
Gender='', Age=., Height=., Weight=., BMI=.
2. Input Buffer:
'M 50 68 155'
3. Populate:
Gender='M', Age=50, Height=68, Weight=155, BMI=.
4. Calculate:
BMI = (155 * 703) / (68^2) = 23.62
5. Output:
Observation written to dataset
4. Important Notes
The input buffer holds one line of raw data at a time.
The PDV is reused for each observation; SAS resets it to missing between iterations.
The RUN statement signals the end of a DATA step explicitly. However, SAS also detects implicit step boundaries (e.g., next DATA or PROC statement).
There’s an implied OUTPUT at the end of each iteration unless explicitly suppressed.
5. Why This Knowledge Matters
Understanding SAS internals helps you:
Write cleaner code by avoiding redundant computations.
Debug data mismatches and missing values confidently.
Avoid unexpected behavior due to implicit loops and data overwriting.
Design more efficient and predictable programs.
Conclusion
In this edition, we took a look under the hood of the SAS processing engine and explored:
The compile vs. execution stages of the DATA step
How SAS constructs and manages the Program Data Vector (PDV)
A detailed trace of how SAS reads, calculates, and writes observations
This foundation will empower you to write more robust SAS code and prepare for advanced topics in future editions.
In Edition 5, we will cover Starting data exploration.
Stay curious and keep coding with 3 D Statistical Learning.
Special thanks to Dr. Dany Djeudeu for his clarity and expertise in simplifying SAS internals.