Dsky · Volume 10

DSKY — Volume 10 — The Software and the Executive

How a sliver of memory learned to do everything at once

About This Volume

We have spent nine volumes building a machine: the gimbals and the sextant, the integrated circuits and the woven rope, the verbs and nouns blinking on the panel. But a computer is not its hardware. The hardware is the loom; the software is the cloth. And the most astonishing thing about the Apollo Guidance Computer — the thing that engineers still teach in graduate courses sixty years later — was not its silicon or its memory. It was the way its software thought: the way a few thousand words of code, running on a processor slower than a pocket calculator, managed to juggle navigation, engine control, thruster firing, and a chatty astronaut at the keyboard, all at once, on hard deadlines, without ever seizing up.

This volume is about that achievement: the real-time operating system at the heart of the AGC, and the two men and one discipline that made it possible. It is, quietly, the most ahead-of-its-time part of the entire Apollo program. The rope memory was clever; the priority scheduler was visionary. When people say the AGC was a computer that “saved itself” during the first lunar landing, what they are really praising is the architecture described here.

A note on what comes next: this volume sets up Volume 11, on Margaret Hamilton and the birth of software engineering as a discipline, and Volume 12, on the 1202 alarm that nearly aborted the first Moon landing — and didn’t, precisely because of the ideas you are about to read.

Figure 1 — A flight-configuration Apollo Guidance Computer. Inside this sealed magnesium case ran the asynchronous, priority-driven operating system described in this volume. File:Apollo Guidance C…
Figure 1 — A flight-configuration Apollo Guidance Computer. Inside this sealed magnesium case ran the asynchronous, priority-driven operating system described in this volume. File:Apollo Guidance Computer (AGC).jpg by Steve Jurvetson. License: CC BY 2.0 (https://creativecommons.org/licenses/by/2.0). Via Wikimedia Commons.

The Problem: Everything, At Once, On Time

Imagine the situation the AGC faced during a powered descent to the lunar surface. In the same handful of seconds the computer must: read the accelerometers and gyroscopes of the inertial measurement unit, hundreds of times a second; integrate those readings to know where the spacecraft is and how fast it is moving; run the guidance equations that decide where the engine should point; command the descent engine’s throttle; fire the reaction-control thrusters to hold attitude; drive the needles on the cockpit displays; and watch the DSKY in case the astronaut keys in a request. Several of these tasks have brutal, non-negotiable deadlines — the autopilot loop must run on schedule or the spacecraft will literally tumble.

Now consider the resources available to do all this. The AGC had a 16-bit word, roughly 36,864 words of fixed (read-only) core-rope memory and about 2,048 words of erasable memory — what we would now call RAM. That erasable memory, the computer’s entire scratchpad for everything happening at once, was smaller than a single modern emoji. The processor executed on the order of tens of thousands of operations per second. There was no spare capacity, no margin to waste, and absolutely no acceptable failure mode in which the computer simply stopped.

A naive solution — write one enormous program that does step one, then step two, then step three, forever in a loop — collapses immediately. The tasks run at different rates. Some are urgent and short; others are long and can tolerate a little delay. An astronaut pressing a key arrives at an unpredictable moment. A fixed sequence cannot accommodate this, and worse, if any single step ran long, everything downstream would miss its deadline and the whole edifice would fall over. What the AGC needed was something that, in the mid-1960s, barely existed even as a research idea: a real-time, multitasking operating system.

The Executive and the Waitlist

The answer came largely from one extraordinary mind: J. Halcombe “Hal” Laning, Jr., of the MIT Instrumentation Laboratory. Laning was already a legend in the small world of early computing — in the early 1950s he had written one of the first algebraic compilers, called George, years before FORTRAN. For the AGC he designed a two-part scheduling system that is the conceptual ancestor of every real-time operating system that followed.

The two parts divided the world by timescale and weight.

The Executive handled the longer, heavier units of work, called jobs. A job was a substantial piece of computation — running the navigation update, processing a DSKY request, performing the guidance calculation. Jobs could take many milliseconds, could be interrupted, and could coexist: the Executive could have several jobs in progress at the same time, suspending one to let another run. To run, a job needed bookkeeping space. The Executive maintained a small pool of core sets — twelve-word blocks of erasable memory, each holding everything needed to track one job: its priority, its entry-point address, a copy of the bank register, status flags, and a pointer to its working storage. There were only a handful of these — on the order of six in the Command Module’s configuration, seven in the Lunar Module. A job that used the interpreter (more on that below) also needed a VAC area (“vector accumulator”), a larger scratchpad of about forty-three words for double-precision vector and matrix work. The number of core sets and VAC areas was, in effect, the maximum number of jobs the computer could have in flight at one instant. Run out of either, and you have a problem — a fact that will matter enormously in Volume 12.

The Waitlist handled the opposite kind of work: short, sharply time-critical tasks. A task was a brief routine — a few milliseconds at most — that had to run at a precise future moment. The Waitlist was, in essence, a timer queue: a routine could ask the Waitlist to “call this task in 20 milliseconds,” and the Waitlist, driven by the AGC’s hardware timer interrupts, would fire it at exactly that time. Tasks ran to completion quickly and could not be interrupted by jobs; they sat closer to the hardware. If a task found it needed to do heavy lifting, it handed that off — scheduling a job with the Executive and getting out of the way.

This division of labor is elegant precisely because it matches the structure of the problem. Short, deadline-bound things (sample this sensor now) go to the Waitlist. Long, interruptible things (figure out where we are) go to the Executive. Together they let a single, tiny processor behave as if it were doing a dozen things simultaneously.

Figure 2 — The Apollo display and keyboard unit (DSKY). Servicing the astronaut at this keyboard was just one of the many concurrent "jobs" the Executive scheduled against navigation, guidance, and…
Figure 2 — The Apollo display and keyboard unit (DSKY). Servicing the astronaut at this keyboard was just one of the many concurrent "jobs" the Executive scheduled against navigation, guidance, and control. File:Apollo display and keyboard unit (DSKY) used on F-8 DFBW DVIDS683588.jpg by NASA/Dennis Taylor. License: Public domain. Via Wikimedia Commons.

Priority Scheduling: The Key Idea

The genius of Laning’s design is not merely that it could run many jobs — it is how it chose which job to run. The Executive did not run jobs in the order they arrived, nor in a fixed rotation. Every job carried a priority number, and the rule was breathtakingly simple: the computer always runs the highest-priority job that is ready to run. When a higher-priority job became ready, it would, at the next opportunity, push aside whatever lower-priority job was running and take the processor for itself. The displaced job was not lost; it simply waited, its state safely held in its core set, until the urgent work was done.

This is asynchronous, priority-driven multiprogramming, and in the mid-1960s it was close to science fiction for a flight computer. Most machines of the era ran one program at a time, start to finish. The AGC ran a shifting, self-organizing ecology of work in which the most important thing always won.

The truly radical consequence shows up under overload — when there is simply more work than the processor can finish in the time available. A conventional fixed-sequence system, asked to do more than it can, falls behind on everything and eventually misses a critical deadline, with catastrophic results. The AGC did the opposite. Because it always served the highest priority first, an overload meant only that the lowest-priority jobs got squeezed: deferred, and if the overload was severe enough, never run at all. The autopilot kept flying; the navigation kept updating; the display, far down the priority list, might simply freeze for a moment. The system degraded gracefully instead of failing catastrophically. It shed load the way a sensible person triages a flood of work — drop the trivial, protect the vital.

The designers built in an explicit backstop for the worst case. If so many jobs piled up that the Executive ran out of core sets, or so many interpreter jobs that it ran out of VAC areas, the software did not silently corrupt itself or wander off into nonsense. It raised a program alarm — the now-famous 1201 (“no VAC areas”) and 1202 (“no core sets”) — and triggered a software restart. Which brings us to the second pillar of the design.

Restart Protection: A Machine That Heals

Hardware fails. Cosmic rays flip bits; power glitches; a marginal solder joint hiccups; or the software itself, overloaded, decides it must start over. On a spacecraft a quarter-million miles from home, a computer that needed a careful human reboot after every such event would be useless. So the AGC software was built around a remarkable assumption: a restart is a normal event, and the system must survive it almost invisibly.

The mechanism is restart protection, and it rests on a discipline of checkpointing. The flight software did not treat its long jobs as uninterruptible monoliths. Instead, critical work was broken into stages, and at well-chosen points each job registered a restart point — a marker saying, in effect, “if we have to start over, resume me here, with this state.” Important variables were kept in protected, redundant erasable storage so they would survive the upheaval. Transient, half-finished work that could not be trusted after a fault was simply discarded.

When a restart fired — whether from a hardware watchdog, a power transient, or a software alarm like 1202 — the computer did something that still feels almost magical. It wiped the volatile bookkeeping clean, then walked through its tables of restart points and re-established the important jobs from their last safe checkpoint, in priority order, while throwing away the transient debris. Within a fraction of a second the machine was running again, doing the same vital work it had been doing, as if nothing had happened. The navigation state was intact. The autopilot resumed. The astronauts, in the best case, saw only a momentary alarm light.

This is fault tolerance of a kind that would not become commonplace in computing for decades. The AGC did not merely detect errors; it was architected to recover from them, continuously, as a routine part of operation. It treated catastrophe as a transient. And it is exactly this property — priority scheduling that sheds low-priority load, plus restart protection that recovers cleanly — that saved the first lunar landing when the 1202 alarms rang out during Apollo 11’s descent. That story is Volume 12. For now, hold onto the principle: the AGC was designed not to be perfect, but to be unstoppable.

Figure 3 — Mission Control at the conclusion of an Apollo lunar landing. On the ground, flight controllers could read the AGC's program alarms; in the air, the computer's restart protection meant s…
Figure 3 — Mission Control at the conclusion of an Apollo lunar landing. On the ground, flight controllers could read the AGC's program alarms; in the air, the computer's restart protection meant such alarms were survivable rather than fatal. File:Mission Control Center at conclusion of Apollo 15 lunar landing mission (S71-43429).jpg by NASA Johnson Space Center. License: Public domain. Via Wikimedia Commons.

The Interpreter, Briefly Revisited

We met the interpreter in Volume 6, and it deserves a paragraph here because it is woven into the scheduler. Navigation is mathematics in three dimensions: vectors, matrices, trigonometry, all in more precision than the AGC’s native 15-bit arithmetic could comfortably express. Coding that directly in the machine’s spartan instruction set would have been agonizing and would have consumed memory the program could not spare. So the Lab built a software interpreter — a virtual machine running atop the real one, offering pseudo-instructions for double- and triple-precision scalar, vector, and matrix operations, including transcendental functions and even a matrix-times-vector instruction. Navigation code written in this interpretive language was perhaps five to ten times slower than native code, but vastly more compact and far less error-prone — a trade the designers gladly made, because the navigation jobs were not the ones on the tightest deadlines. The interpreter is why a job needed that VAC area: it was the scratchpad for the interpreter’s vector arithmetic. Scheduler and interpreter were partners — the Executive decided when the heavy math ran; the interpreter made the heavy math writable at all.

Building It

A beautiful architecture is one thing; tens of thousands of words of correct, flight-worthy code is another. The flight programs were written in AGC assembly language and assembled by a program called YUL — named, in the Lab’s dry humor, after “Yule,” the looming Christmas deadline that ruled every schedule. YUL ran first on a Honeywell computer and later was reimplemented and renamed GAP (the “GAP” assembler) as the toolchain modernized. The assembler turned source listings into the bit patterns that would, eventually, be woven into rope.

The flight software came in two great families, each a sprawling program in its own right. Colossus was the software for the Command Module; Luminary was the software for the Lunar Module. (Earlier and intermediate builds carried their own names — Sundance, Sundisk, Comanche, and others — as the code evolved release by release.) Each was a layered structure: the Executive and Waitlist at the core, the interpreter atop them, and above those the actual mission programs — the “major modes” the astronauts selected with the DSKY’s Verb 37. The total size is genuinely hard to pin to a single number, because it depends on which build you count and whether you include erasable data, but a fair characterization is that a mature flight load filled most of the available fixed memory — on the order of tens of thousands of 15-bit words, with the great majority of the roughly 36,000-word rope given over to program. Treat any single figure as approximate; the software grew almost until the day it had to freeze.

And freeze it did — literally. As Volume 5 described, the AGC’s program memory was core rope, hand-woven by skilled workers at a factory, a process that took weeks. Once weaving began for a given mission, the software was frozen: no patches, no last-minute fixes, because the bits were physically threaded through ferrite cores. This imposed a hard deadline of a severity modern programmers can barely imagine. The code had to be finished — tested, verified, signed off — weeks before flight, because after that it was woven into copper and there was no edit button. The pressure of that freeze drove the Lab toward extraordinary discipline in testing and simulation: every release was exercised on digital simulators and hybrid hardware rigs, hunting for the defects that could not be fixed once the rope was wound.

Overseeing the integration of each mission’s program was a role the Lab called the “rope mother” — the engineer responsible for shepherding a particular software load from a sprawl of modules into a single coherent, tested, freezable rope. The job was usually held by a man, but not always: Margaret Hamilton served as rope mother on Luminary, the Lunar Module’s software — a thread we pick up in the next volume.

Why It Mattered

Step back and consider what Laning, Hamilton, and the rest of the MIT Instrumentation Laboratory accomplished. With a processor of trivial speed and a scratchpad of a couple thousand words, they built a real-time operating system with priority-based preemptive scheduling, graceful degradation under overload, and automatic recovery from faults — a combination that commercial and military computing would spend the next two decades catching up to. They did it under a hardware-imposed freeze deadline that forbade the patch-it-later culture every later generation of programmer would lean on. And they did it for a machine on which a missed deadline or an unrecovered fault could kill three people.

The DSKY on the panel, with its glowing verbs and nouns, was the face of all this. Behind that face, the Executive and the Waitlist were quietly running the show — always serving the most important work, always ready to start over and survive. It is one of the great quiet triumphs in the history of engineering, and in the next two volumes we will meet the woman who turned it into a discipline, and the eleven seconds on the way to the Moon when it proved its worth.

Next — Volume 11: Margaret Hamilton and the Birth of Software Engineering.