Why AI Forgets What You Asked It To Do

If, like me, you are now a regular AI chatbot user, then you may have experienced what can best be described as prompt fade.

This is where you start a conversation. At first you get some results that look to be on track, but then as the conversation progresses at some point you find that it is starting to ignore information and guidance that you previously gave. You might even find yourself furiously typing, “But I already told you to focus on x, and now you have just completely ignored that“. In response you get, “You are absolutely right, I should not have ignored x …” or similar.

This is prompt fade, the point where the model still appears to understand the conversation, but no longer reliably follows the instruction that mattered most.

Why does this happen?

Here are some common explanations …

  • Your context window can be limited and so your original prompt is now outside that window.
  • Later messages, examples, quotes, web pages, or retrieved text can pull the model in a different direction.
  • A chat may accumulate old preferences, exceptions, corrections, examples, and side topics. The model then has to infer which constraints still matter. 
  • A user starts with one goal, then asks follow-ups that subtly change things. The model adapts to the recent turns and may lose the original framing.

However, there is also a rather interesting paper that appears to offer yet another explanation.

Let’s dive into that then.

What has been discovered?

The researchers argue within their paper that current transformer-based LLMs have a significant limitation. They lack an explicit equivalent of human executive control of attention. That’s the system that helps us maintain a goal, suppress distractions, and resolve conflict between competing responses.

That matters because following an instruction is not the same as staying locked onto a goal.

A model can understand the instruction, repeat the instruction, and even solve a few examples, yet still drift into the wrong behaviour when the context becomes longer or more conflict-heavy.

That is the connection to prompt fade. The model may begin by doing exactly what you asked, but as the conversation grows, other patterns in the context can start to pull it away from the original task.

What is the impact of this, what does it really tell us?

LLMs will follow instructions that we give. That however is not the same as remaining focused on the actual goal.

A model can understand the instruction, repeat the instruction, and even solve a few examples, yet still drift into the wrong behaviour when the context becomes longer or more conflict-heavy.

That is genuinely significant.

It is the same family of weakness you see when models do the following:

  • answer a different question from the one asked;
  • follow the pattern in the data rather than the stated rule;
  • get distracted by salient but irrelevant text;
  • degrade over long multi-step tasks;
  • “know” the right instruction but fail to apply it consistently
  • etc…

To explore this, the researchers used a version of the Stroop test.

The what?

The Stroop Test

In 1935 the American psychologist John Ridley Stroop published a now famous paper titled “Studies of Interference in Serial Verbal Reactions.”.

Within this paper he describes a psychology test that measures how well your brain can control attention and resist automatic responses. The classic version shows you color words printed in different colored ink. For example: The text of one word shown below is RED, but it has been written using BLUE ink. You are asked to call out the color of the ink and ignore that it is spelling RED.

The difficulty comes from the fact that reading is automatic. When you see the word RED, your brain wants to read “Red” even if the ink is blue. To answer correctly, you have to suppress the automatic reading response and focus on the color.

This delay or increased error rate is called the Stroop effect.

The Stroop test probes several things:

  • Selective attention – Can you focus on the relevant information?
  • Inhibitory control – Can you suppress the wrong automatic response?
  • Processing speed – How quickly can you resolve the conflict?
  • Executive function – How well the brain manages competing demands?

Now we get to the LLM bit of all this.

What happened when they tried this task with LLMs?

The Stroop task is visual and artificial, but the failure mode is familiar. The model is told one rule, then the input repeatedly tempts it toward another. That is not so different from a long chat where the model is told to focus on one criterion, but that then gets buried under examples, exceptions, retrieved text, and recent turns that pull it elsewhere.

When shown Stroop-style images, GPT-4o and Claude 3.5 Sonnet often start well, but performance collapses as the list gets longer. The paper reports GPT-4o dropping from 91% accuracy at 5 incongruent words to 15% at 40, while Claude 3.5 Sonnet drops to 24% at 40. In mixed lists, GPT-4o reportedly gets almost none of the longer incongruent color-naming items right. The paper contrasts this with humans, who usually remain highly accurate on long Stroop tasks even if they slow down.

The important finding is not that “AI failed a children’s brain test.” That would be the wrong lesson.

The real point is that current LLMs can recognise a task, explain the task, perform the task for a few examples, and still fail to maintain the task goal when the input becomes longer or more conflicted.

That distinction matters.

Instruction-following is not the same as robust goal maintenance. A model may understand the sentence “do not read the word, name the color,” but still drift back toward the stronger pattern in the data: reading the word. In ordinary chatbot use, the same kind of failure can appear when a model starts with your instruction clearly in mind, then gradually gets pulled away by later context, examples, retrieved material, or a more statistically tempting interpretation of the task.

That is why this paper feels important. It gives us a simple, psychologically meaningful stress test for a weakness many users already recognise: the model does not always stay locked onto the goal you gave it.

Why does this matter?

You can probably guess, but here it is.

Points of Potential Impact

LLMs may be least reliable precisely when the user gives them a simple rule that conflicts with a dominant pattern in the input.

Here are some specific examples, and remember that these are all things that people may strive to use AI for …

  • reading forms where the visible label conflicts with the required field
  • medical, legal, or financial workflows where the model must ignore irrelevant but tempting context
  • long document analysis where the model must maintain a narrow instruction throughout
  • agentic tasks where “do not do X” must remain active across many steps
  • visual reasoning tasks where text in an image competes with visual features
  • etc…

Can we Fix This?

Scale it up and just have a bigger context window“, some might suggest.

The problem there is that simply scaling up will still not be executive control.

What may actually be needed are explicit mechanisms such as the following:

  • maintaining goals over time;
  • detecting conflict between task instructions and default responses;
  • suppressing salient but irrelevant information;
  • checking whether current behaviour is still aligned with the instruction;
  • adapting after errors.

The Bottom Line

The paper is significant because it provides a simple, psychologically meaningful stress test showing that the current high-performing LLMs can fail at sustained rule-following under interference.

These current LLMs can recognise the task and perform it locally, but they may not reliably maintain the task goal when the input becomes longer and contains a competing, more statistically dominant interpretation.

This is not evidence that AI has hit a dead end. It is evidence that one of its weaknesses is now easier to see. Current models can be astonishingly capable, but capability is not the same as sustained control. Once we understand where that control breaks down, we can begin to design systems that compensate for it. For example through better multimodal processing, verifier models, scratchpads, tool use, self-checking, or dedicated control layers.

In other words, the paper does not show that AI is useless. It shows why AI sometimes starts by doing exactly what you asked, and then quietly wanders off.

The Paper Itself

The Paper was published in PNAS Nexus on June 2, 2026, and is titled “Deficient executive control in transformer attention

Neuroscience News also covered it on that same date with an article titled “Stroop Test Exposes Inherent LLM Flaw“. Here they summarised the result bluntly:

...when the task becomes longer and more conflicted, the models tend to fall back toward the dominant response — reading the word — rather than maintaining the instructed goal of naming the color. That is the same broad failure pattern users see when an AI initially follows a prompt, then gradually drifts away from it…

Leave a Comment