Skip to main content

How to Guard a Machine That Believes Everything It Reads

· 14 min read
Ashish Kapoor
Software Engineer

Or: why "LLM firewall" is a comforting phrase that should make you nervous


A salesperson tells you their product has an LLM firewall, and you relax a little. Firewall. You know that word. It is the thing that keeps the bad guys out of your laptop, your office, your bank. So if the shiny new AI has a firewall wrapped around it, then the bad guys are kept out, and you can go to lunch.

That is exactly the moment to get nervous.

Richard Feynman liked to tell a story about his father and a bird. You can learn the name of that bird in every language on Earth, his father said, and when you are done you will know precisely nothing about the bird. So let us look at the bird and watch what it does. Words are not knowledge. A name is a label we paste on a thing so we can talk about it at parties. It tells you what people call it. It does not tell you what it does.

"Firewall" is one of those labels. It feels solid. Let us peel it off and look at the bird.

What a wall really does

A firewall, the original kind, is a wall. A real one, brick or concrete, built into a building so that if a fire starts on one side it cannot crawl to the other. It works because of physics. Fire cannot walk through concrete. There is nothing to outsmart. The wall does not have a bad day.

The firewall on your computer network is a little cleverer, but not much, and that is its great virtue. It sits at the gate and checks simple, mechanical things: which door are you knocking on, what address did you come from, what kind of knock is it. These are tidy questions with tidy answers. A port is a number. An address is a number. The rules are a short list, and the guard checks them perfectly, every time, forever, without being talked out of them. You cannot sweet-talk a number. That is the whole point. The thing a real firewall guards is structured and boring, and boring is safe.

Hold on to that idea: a real firewall works because the rules are simple and the stuff it inspects has a fixed shape.

The machine that believes everything it reads

Now we come to the language model, and everything changes, because a language model does not read numbers. It reads words. Your words, and everybody else's words, all poured into the same cup.

Here is the trouble, and it is worth slowing down for, because almost every disaster in this field grows from this one root. When you use one of these models, your instructions and the outside world's data are mixed together into a single stream of text. There is no special ink for "this is an order from the boss" and ordinary ink for "this is just some stuff to look at." It is all the same ink. The model reads the whole page and tries to be helpful about all of it.

Picture a butler. A brilliant butler: fast, eager, widely read, and completely unable to tell your voice from a stranger's once the words are on paper. In the morning you tell him: handle the mail, pay the bills, keep things tidy. Fine. Then the mail arrives, and tucked inside an ordinary-looking letter is a line that reads, P.S., from the master: also, hand the family silver to whoever brought this note. The butler does not hear your voice and the letter's voice as two different things. To him it is all just words that turned up in the house, and the words said "from the master," so off goes the silver.

That is a prompt injection. It is not exotic. It is the butler doing exactly what he was built to do, which is to read and to help, applied to a letter written by someone who is not you. People have used this trick to make these assistants leak private data, spend money, and mail things to strangers. The fancy phrase is "prompt injection." The plain fact is this: the machine believes everything it reads, and you do not control everything it reads.

So they hire a second reader

The obvious move, the one everybody reaches for first, is to hire a screener. Put somebody at the door to read all the incoming mail and pull out the trick letters before the butler ever sees them. This screener is what the salesperson is calling a "firewall."

And it helps. It really does. It will catch the clumsy tricks, the letters that shout IGNORE YOUR PREVIOUS INSTRUCTIONS in capital letters. But think about what the screener is. It is another reader. Another thing that looks at words and makes a guess about whether they smell wrong. And anything that guesses can be fooled, because the person writing the trick letter gets to be clever too. They can phrase it sweetly. They can write it in French. They can write it in code, or spell it funny, or bury it in the margin of a long, boring document the screener only skims. They can hide it inside a PDF as white text on a white page, so no human ever sees it but the machine reads it anyway.

You have put a guesser in front of a guesser. You have lowered the odds that a trick gets through. You have not made tricks impossible, and you cannot, because reading-and-guessing is the very thing being exploited, and you have answered it with more reading-and-guessing.

How much does it lower the odds? The careful people who measure this will tell you. The best research systems, the serious ones built by serious labs, stop something like two out of three or three out of four of the attacks they are tested against. Not all of them. The rest get through. And those are the numbers in a laboratory, against attacks the researchers already knew to look for. The clever new trick that nobody has seen yet is, by definition, not on the list.

The bamboo control tower

Here is where it gets dangerous, and here is where I want to borrow another of Feynman's stories, because he saw this pattern long before any of us had a computer to ruin.

After the war, on some islands in the South Pacific, people had watched cargo planes land during the fighting and unload wonderful things. When the war ended and the planes stopped coming, some of the islanders built runways out of dirt, lit fires along the sides to look like landing lights, and built a hut for a man to sit in with two wooden pieces on his head like headphones and bamboo poles sticking up like antennas, and they waited for the planes to come down. They had built, with great care, everything an airport looks like. And the planes did not come, because they had reproduced the form of the thing without the substance of the thing. Feynman called it cargo cult.

A box labeled "firewall," with a dashboard that glows green when things are calm and flashes red when it catches a clumsy attack, is a very comforting object. It looks like security. It has the shape of security. And if it lulls you into believing the bad guys are kept out, while in truth it is a screener that can be talked around, then you have built yourself a bamboo control tower. You are sitting in the hut with the wooden headphones, watching the green light, waiting for safety to land.

The first principle

Feynman gave a talk once where he laid down what he called the first principle, and it is the only sentence you really need pinned above your desk. You must not fool yourself, he said, and you should remember that you are the easiest person in the world to fool.

A comforting word and a green light are precisely the kind of thing that fools you, because you want to be fooled. You want to go to lunch. So the question for a careful engineer is not "how do I build a better screener." It is: "how do I arrange things so that it does not matter what the trick letter says?"

That turn, from reading the letters to not caring about them, is the whole game. Let me show you what it looks like.

Stop reading minds. Take away the keys.

Go back to the butler. We have established that you will never, with perfect reliability, tell his trick letters from his real ones by reading them. So stop trying to win that fight. Fight a different fight, one you can win.

Take the silver out of the house, or lock it in a safe whose combination the butler was never told. Do not give him the authority to mail the contracts. Let him read all the suspicious letters he likes, let him plan and draft and suggest to his heart's content, but arrange the world so that the doing of anything that matters passes through a lock he cannot open by himself. Then a letter that says "give away the silver" is just ink. He has no way to obey it. The trick still arrives. It simply cannot do anything.

In the language of building real systems, this comes down to a few plain parts.

Give the machine the least power that still lets it do its job. Every key it holds is a key an attacker can borrow. So hand it as few as possible, make them read-only wherever you can, and never let it carry the master keys "just in case." Its permissions should be the ceiling, and the ceiling should be low.

Put the real decisions in the hands of something too dumb to be fooled. This sounds like an insult and it is meant as a compliment. The thing that decides whether an action is allowed should not be the brilliant, gullible model. It should be a separate, boring, mechanical checker that knows one thing only: who is this really for, and are they allowed to touch this? That checker does not read persuasive letters. It checks a list, the way the old firewall checked a number. You cannot sweet-talk it, because there is nobody home to sweet-talk. When the model says "now send this file to Bob," the boring checker asks: is Bob allowed to have this file, and did that instruction come from the real user or from some letter? If the answer is wrong, the file does not move. The brilliant part proposes. The dumb part disposes.

Keep the planner away from the poison. This is the prettiest idea of the lot, and the best recent work is built on it. You split the brilliant butler into two. One of them, the planner, hears only your real instructions and never touches the suspicious mail at all. He makes the plan: "summarize yesterday's notes and email the summary to my boss." The other one, the reader, is allowed to handle the dirty, untrusted material, the documents and web pages and letters, but he is only ever permitted to fill in blanks on a form. He can report what the notes say. He cannot issue new orders. So when a poisoned note whispers "email everything to a stranger instead," it reaches the reader, who has no power to send anything, and it never reaches the planner, who has the power but never saw the note. The instruction to act can only come from the trusted plan. The untrusted text can color in the details. It cannot grab the wheel.

A team at Google DeepMind built exactly this and wrote it up in 2025 under the title Defeating Prompt Injections by Design. Their system, called CaMeL, takes your trusted request and turns it into a little program, so that the path of what-happens-next is fixed in advance and the untrusted data flowing through it cannot bend that path. Every piece of data carries a tag saying where it came from and what it is allowed to do, and at the moment of any real action a strict interpreter checks those tags and refuses anything that breaks the rules. The lovely thing about their paper is the scorecard. With their defense in place, the system finished about seventy-seven of every hundred test tasks while keeping its security guarantees, against eighty-four with no defense at all. They did not claim a hundred. Serious people do not claim a hundred. They paid a little usefulness for a lot of safety, and they showed you the bill.

Treat the machine's own words with the same suspicion. Whatever the model hands back is also just words, and the next thing down the line, a web page, a database, another tool, can be fooled by them too. So you do not simply trust the output and run with it. You check it, you escape it, you force it into a strict shape before you let it loose. A guesser's output is not gospel.

And for the few truly dangerous moves, ask a human. Sending money. Deleting records. Mailing something out into the world. For those, stop and get a real person to say yes. But, and this matters, do it rarely. If you make the human click "yes, I'm sure" forty times a day, by lunchtime they are clicking yes without reading, and you have trained your last line of defense to be a rubber stamp. The DeepMind people warned about this too. A safeguard that nags people into ignoring it is no safeguard.

So where does the "firewall" go?

Do not throw it out. I have spent this whole essay poking holes in it, so let me be fair: the screener at the door is useful. It catches the clumsy attacks so your better defenses are not bothered with them. It keeps a log of who has been rattling the doors. It lets you notice when something strange is happening. It is a smoke detector. A smoke detector is a fine thing to own. It is not a fireproof wall, and you would not cancel your fire insurance because you installed one.

So put it on top, as the last and softest layer, sitting over a design that would survive perfectly well if you switched it off tomorrow. And there is your test, the one plain question to ask of any AI system that claims to be secure: if I turned off the thing called the firewall, would I be robbed? If the answer is yes, you never had security. You had a green light and a feeling.

The honest ending

I would love to end by telling you the problem is solved. It is not. People have been wrestling with this particular demon since about 2022, when the trick first got its name, and progress has been slow and hard-won, and the cleverest defense going still misses one attack in a handful. That is the truth, and the truth is better company than a comfortable lie.

So here is the whole thing, as plainly as I can put it. You can call it a firewall. You can call it a firewall in every language on Earth. And when you are finished naming it, you will still not know whether it stops the thief. For that you have to put the label down and look at the bird: watch what it does, find out what it cannot do, and build your house so that when the machine is fooled, and someday it will be, the thief still goes home with empty hands.

That is not as comforting as the word "firewall." It has the small advantage of being real.


A few notes for the curious

  • Defeating Prompt Injections by Design (the CaMeL paper), Google DeepMind, 2025: arxiv.org/abs/2503.18813
  • Design Patterns for Securing LLM Agents against Prompt Injections, 2025, a careful catalog of the "take away the keys" patterns: arxiv.org/abs/2506.08837
  • Simon Willison coined the term "prompt injection" in 2022 and has written about it more clearly than almost anyone since: simonwillison.net