How NASA built Artemis II’s fault-tolerant computer

The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.

Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.

I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.

How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything

Headline needs its how-dectomy reverted to make sense

The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.

Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.

Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

Tesla’s Cybertruck uses that in its ethernet as well!

If you look at code as art, where its value is a measure of the effort it takes to make, sure.

You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.

That alone is worth my tax dollars.

I take the opposite message from that line - out of touch teams working on something so over budget and so overdue, and so bureaucratic, and with such an insanely poor history of success, and they talk as if they have cured cancer.

This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.

Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.

Nice “well, actually”. I’m sure Lockheed were building this quad-redundant, radiation-hardened PowerPC that costs millions of dollars and communicates via Time-Triggered Ethernet anyway, whether NASA needed one or not.

Lockheed Martin and their subcontractors did the implementation.

We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.

will nobody think of the megacorps!!!

I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.

Not a direct answer but probably as good information as you can get: https://static.googleusercontent.com/media/research.google.c...

Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).

And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).

How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything

You don't need 4x redundancy for everything. If no humans are aboard, you have 2x redundancy and immediately reboot if there is a disagreement.

They dont go into here.. but I thought that NASA also used like 250nm chips in space for radiation resistance. Are there even any radiation resistance GPUs out there?

Headline needs its how-dectomy reverted to make sense

(Off-topic:) Great word. Is that the usual word for it? Totally apt, and it should be the standard.

That was a laptop, not one of the Artemis computers.

Tesla’s Cybertruck uses that in its ethernet as well!

That alone is worth my tax dollars.

will nobody think of the megacorps!!!

Not a direct answer but probably as good information as you can get: https://static.googleusercontent.com/media/research.google.c...

Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).

And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).

That was a laptop, not one of the Artemis computers.

If you look at code as art, where its value is a measure of the effort it takes to make, sure.

In that case, our test infrastructure belongs in the Louvre…

If your implication is that stencil art does not take effort then perhaps you may not fully appreciate Banksy. Works like Gaza Kitty or Flower Thrower don’t just appear haphazardly without effort.

This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.

Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.

You mean like this?

"With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."

https://blog.codinghorror.com/building-a-computer-the-google...

No, space is just hard.

Everything is bespoke.

You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.

People died on the Apollo missions.

It just costs that much.

One simply does not [“provision” more hardware|(reboot systems)|(redeploy software)] in space.

What would you suggest? Vibe coding a react app that runs on a Mac mini to control trajectory? What happens when that Mac mini gets hit with an SEU or even a SEGR? Guess everyone just dies?

> ...they talk as if they have cured cancer.

I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.

Probably, if it already wasn’t developed for DoD.

This is the equivalent of prompt engineering.

Lockheed Martin and their subcontractors did the implementation.

We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.

In that case, our test infrastructure belongs in the Louvre…

If your implication is that stencil art does not take effort then perhaps you may not fully appreciate Banksy. Works like Gaza Kitty or Flower Thrower don’t just appear haphazardly without effort.

One simply does not [“provision” more hardware|(reboot systems)|(redeploy software)] in space.

> ...they talk as if they have cured cancer.

I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.

Probably, if it already wasn’t developed for DoD.

This is the equivalent of prompt engineering.

You don't need 4x redundancy for everything. If no humans are aboard, you have 2x redundancy and immediately reboot if there is a disagreement.

(Off-topic:) Great word. Is that the usual word for it? Totally apt, and it should be the standard.

You mean like this?

https://blog.codinghorror.com/building-a-computer-the-google...

The problem they solved isn't easy. But its not some insane technical breakthrough either. Literally add redundancy, thats the ask. They didnt invent quantum computing to solve the issue did they? Why dunk on sprints?

Google then had complete regret not doing this with ECC RAM: https://news.ycombinator.com/item?id=14206811

No, space is just hard.

Everything is bespoke.

You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.

People died on the Apollo missions.

It just costs that much.

Please, this is hacker news. Nothing else is hard outside of our generic software jobs, and we could totally solve any other industry in an afternoon.

Yep, spend 100 billion on what should have cost 1/50that cost, and send people up to the moon with rockets that we are still keeping our fingers crossed wont kill them tomorrow, and we have to congratulate them for dunking on some irrelevant career?

What would you suggest? Vibe coding a react app that runs on a Mac mini to control trajectory? What happens when that Mac mini gets hit with an SEU or even a SEGR? Guess everyone just dies?

No, of course not! It would be far better to have an openClaw instance running on a Mac Mini. We would only need to vibe code a 15s cron job for assistant prompting...

USER: You are a HELPFUL ASSISTANT. You are a brilliant robot. You are a lunar orbiter flight computer. Your job is to calculate burn times and attitudes for a critical mission to orbit the moon. You never make a mistake. You are an EXPERT at calculating orbital trajectories and have a Jack Parsons level knowledge of rocket fuel and engines. You are a staff level engineer at SpaceX. You are incredible and brilliant and have a Stanley Kubrick level attention to detail. You will be fired if you make a mistake. Many people will DIE if you make any mistakes.

USER: Your job is to calculate the throttle for each of the 24 orientation thrusters of the spacecraft. The thrusters burn a hypergolic monopropellent and can provide up to 0.44kN of thrust with a 2.2 kN/s slew rate and an 8ms minimum burn time. Format your answer as JSON, like so:

     ```json
    {
      x1: 0.18423
      x2: 0.43251
      x3: 0.00131
       ...
    }
     ```

one value for each of the 24 independent monopropellant attitude thrusters on the spacecraft, x1, x2, x3, x4, y1, y2, y3, y4, z1, z2, z3, z4, u1, u2, u3, u4, v1, v2, v3, v4, w1, w2, w3, w4. You may reference the collection of markdown files stored in `/home/user/geoff/stuff/SPACECRAFT_GEOMETRY` to inform your analysis.

USER: Please provide the next 15 seconds of spacecraft thruster data to the USER. A puppy will be killed if you make a mistake so make sure the attitude is really good. ONLY respond in JSON.

All Im suggesting is to be humble about your mediocre solutions. This is not the only solution and not that ingenious necessarily. Why do you need to bring up vibecoding here? Because people who criticize arrogant nasal engineers are also AI idiots by default?

They dont go into here.. but I thought that NASA also used like 250nm chips in space for radiation resistance. Are there even any radiation resistance GPUs out there?

NOPE, RAD hardened space parts basically froze on mid 2000s tech: https://www.baesystems.com/en-us/product/radiation-hardened-...

Absolutely not, although the latest fabs with rad-tolerant processors are at ~20 nm. There are FDSOI processes in that generation that I assume can be made radiation-tolerant.

No, of course not! It would be far better to have an openClaw instance running on a Mac Mini. We would only need to vibe code a 15s cron job for assistant prompting...

     ```json
    {
      x1: 0.18423
      x2: 0.43251
      x3: 0.00131
       ...
    }
     ```

USER: Please provide the next 15 seconds of spacecraft thruster data to the USER. A puppy will be killed if you make a mistake so make sure the attitude is really good. ONLY respond in JSON.

NOPE, RAD hardened space parts basically froze on mid 2000s tech: https://www.baesystems.com/en-us/product/radiation-hardened-...

Absolutely not, although the latest fabs with rad-tolerant processors are at ~20 nm. There are FDSOI processes in that generation that I assume can be made radiation-tolerant.

It seems not; anti-interference primarily relies on using older manufacturing processes, including for military equipment, and then applying an anti-interference casing or hardware redundancy correction similar to ECC.

Wow. What a hand wave away of the intrinsic challenge of writing fault tolerant distributed systems. It only seems easy because of decades of research and tools built since Google did it, but by no means was it something you could trivially add to a project as you can today.

Google then had complete regret not doing this with ECC RAM: https://news.ycombinator.com/item?id=14206811

A great version of this and how ex-DEC engineers saved Google and their choice of ECC RAM - inventing MapReduce and BigTable https://www.youtube.com/watch?v=IK0I4f8Rbis

It got them to where they need to be to then worry about ECC. This is like the dudes who deploy their blog on kubernetes just in case it hits front page of new york times or something.

Please, this is hacker news. Nothing else is hard outside of our generic software jobs, and we could totally solve any other industry in an afternoon.

I mean I can just replace Dropbox with a shell script.

Wild shit to be advising other people to be humble whilst talking directly out of your ass about technology you clearly do not understand and engineers you have no respect for.

Perhaps self-reflect.

A great version of this and how ex-DEC engineers saved Google and their choice of ECC RAM - inventing MapReduce and BigTable https://www.youtube.com/watch?v=IK0I4f8Rbis

It got them to where they need to be to then worry about ECC. This is like the dudes who deploy their blog on kubernetes just in case it hits front page of new york times or something.

Wild shit to be advising other people to be humble whilst talking directly out of your ass about technology you clearly do not understand and engineers you have no respect for.

Perhaps self-reflect.

Are you interested in sharing more details to make your claim more believable?

> fault tolerant distributed systems

I mean there were mainframes which could be described as that. IBM just fixed it in hardware instead of software so its not like it was an unknown field.

I mean I can just replace Dropbox with a shell script.

That's funny because you could! Dropbox started a shell script :)

Funny though I would assume HN people would respect how hard real-time stuff and 'hardened' stuff is.

Are you interested in sharing more details to make your claim more believable?

> fault tolerant distributed systems

I mean there were mainframes which could be described as that. IBM just fixed it in hardware instead of software so its not like it was an unknown field.

That's funny because you could! Dropbox started a shell script :)

Funny though I would assume HN people would respect how hard real-time stuff and 'hardened' stuff is.

I think GP is referencing this somewhat [in]famous post/comment: https://news.ycombinator.com/item?id=8863#9224

The computer system aboard the current Artemis II lunar space mission is from a different world that the one from the Apollo era. Apollo astronauts navigated to the lunar surface using a computer with a 1-MHz processor and roughly 4 kilobytes of erasable memory, supported by a larger store of fixed “rope” memory. While it was a marvel of 1960s engineering, the Apollo Guidance Computer’s functional scope was focused and not in the control loop for every system. Critical environmental and power controls were managed through manual or electromechanical means, such as switches and relays.

This month’s Artemis II mission carrying a crew of four around the Moon for the first time in over 50 years is supported by one of the most fault-tolerant computer system built for spaceflight. Unlike Apollo, the Orion capsule’s computing architecture manages nearly all of the vessel’s safety-critical functions, from life support to communication routing.

When a mission is 250,000 miles from Earth, failure is unrecoverable. There are no runways for emergency landings and no technicians to swap out a fried motherboard. Every subsystem must be designed to survive cosmic-ray bit flips, radiation-induced latch-ups, and hardware faults without a single second of downtime.

“We still architect to cover for hardware failures,” said Nate Uitenbroek, Software Integration and Verification Lead in NASA’s Orion Program at Johnson Space Center. “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”

One of the biggest drivers for this redundancy is the harsh radiation environment of space, where high-energy particles can affect avionics and create ‘wrong answers’ that must be filtered out of the flight solution.

The Power of Eight

To ensure those wrong answers never reach the spacecraft’s thrusters, NASA moved beyond the triple redundancy of traditional systems. Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.

Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a “fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.

“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. This approach simplifies the complex task of the triplex “voting” mechanism that compares results. Instead of comparing three answers to find a majority, the system uses a priority-ordered source selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the first available FCM in the priority list; if that module has gone silent due to a fault, it moves to the second, third, or fourth.

This level of redundancy is specifically scaled for the rigors of deep space. NASA anticipates transient failures during the Artemis II mission’s transit through the high-radiation Van Allen Belts.

“We can lose three FCMs in 22 seconds and still ride through safely on the last FCM,” said Uitenbroek. A silenced FCM doesn’t become dead weight, however; the system is designed to reset, re-synchronize its state with the operating modules, and re-join the group mid-flight.

Enforcing Determinism

Running multiple independent computers in lockstep is a notorious challenge in computer science, as slight timing drifts or processor variances can cause healthy computers to appear to diverge. NASA solves this through a strictly deterministic architecture.

This architectural discipline is increasingly rare in modern development. Michael Riley, a team lead at Carnegie Mellon’s Software Engineering Institute who previously collaborated with NASA to adapt risk-assessment tools for the Orion mission, noted that while earlier generations worked within strict hardware constraints, modern mission-critical development is different.

“Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”

Orion utilizes a time-triggered Ethernet network where time is distributed across the entire system. The flight software operates within “major frames” divided into “minor frames,” managed by an ARINC653-compliant scheduler. This architecture utilizes time and space partitioning to schedule partitions within these frames, ensuring inputs and outputs are perfectly aligned to the network schedule.

“This architecture ensures that each FCM sees the same inputs, runs the same application code, and produces the same outputs,” said Uitenbroek. Every second, the drift of any individual FCM is measured and its local clock is recalibrated to the network’s ‘true’ time. If an application fails to meet its strict deadline, the module is automatically silenced, reset, and re-synchronized.

The hardware itself is also reinforced. The system employs triple-modular-redundant memory that self-corrects single-bit errors on every read. Even the network interface cards utilize two lanes of traffic that are constantly compared, ensuring that a bit flip in the communication fabric results in a fail-silent event rather than a corrupted command. The network itself is triple redundant with three separate planes, and all network switches employ self-checking strategies.

The Ultimate Fallback

While the four-FCM primary system is robust, NASA must still account for common mode failures—software bugs or catastrophic events that could theoretically impact all primary channels simultaneously.

To mitigate this, Orion carries a completely independent Backup Flight Software (BFS) system. This is a prime example of dissimilar redundancy. It is implemented on different hardware, runs a different operating system, and utilizes independently developed, simplified flight software.

“It is intentionally different to ensure that a common mode software failure in the primary flight software isn’t also implemented incorrectly on the backup,” Uitenbroek said. The BFS runs constantly in the background and automatically takes over via source selection if the primary computers fail. If the system finds itself on the BFS, it can complete all dynamic portions of the mission to reach a quiescent phase, at which point the crew can attempt to recover the primary FCMs.

Riley emphasized that while fail-silent logic is critical, it must be paired with active monitoring to avoid catastrophic gaps.

“If a software component fails silently, the failure may go undetected unless monitored by another component or watchdog timer,” he said. For mission assurance, he said, error detection and recovery mechanisms must be explicitly designed and correlated across multiple layers of the codebase to ensure consistent behavior.

Even in a total power loss scenario—called a “dead bus”—Orion is designed to survive. If power is restored, the spacecraft enters a safe mode, in which the vehicle first stabilizes itself and then points its solar arrays at the Sun to recover power. Then, it orients its tail toward the Sun for thermal stability before attempting to re-establish communication with Earth. During such a failure, the crew can also take manual action to configure life support systems or don space suits.

A Future of Reliability

The changes from Apollo to Artemis represent a massive leap in software complexity. While Apollo’s AGC was a singular achievement, its mechanical fallbacks meant the computer wasn’t the sole arbiter of the crew’s survival. Today, with software managing every thermal valve and power relay, the challenge is ensuring that the software remains synchronized and valid amidst a barrage of cosmic radiation.

To reach this level of confidence, NASA now employs modern verification workflows. This includes full-environment simulations and Monte Carlo stress testing to model worst-case latencies and communication outages. High-performance supercomputers are used for large-scale fault injection, emulating entire flight timelines where catastrophic hardware failures are introduced to see if the software can successfully ‘fail silent’ and recover.

As spaceflight technology has historically seeded commercial advances, Orion’s zero-tolerance architecture offers a preview of a future where mainstream computing—from autonomous vehicles to industrial grids—can achieve the same always-on resilience that’s required for the stars.

Logan Kugler is a technology writer specializing in artificial intelligence based in Tampa, FL, USA. He has been a regular contributor to CACM for 15 years and has written for nearly 100 major publications.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Hacker Times

Hacker Times

How NASA built Artemis II’s fault-tolerant computer

Discussion

Discussion

How NASA Built Artemis II’s Fault-Tolerant Computer