INoT — Teaching AI to Reflect on Its Own Thoughts

If you’ve worked with large language models (LLMs) or built AI agents that invoke them, you’re familiar with the standard pattern: feed the model a prompt, let it generate output, maybe chain‑thought (CoT) or self‑critique, maybe iterate. The paper introduces a twist: what if the reasoning loop (draft → critique → revise) is embedded inside the prompt itself, so the LLM executes a “mini‑program” of self‑reflection rather than relying on external orchestration? That’s the core of the framework they call Introspection of Thought (INoT).

The Motivation

Standard agent frameworks build reasoning via external orchestration: e.g., you call the model for a draft, then for critique, then again for revision. Each extra call adds token cost and latency. The authors argue for a more programmatic prompt style that keeps reasoning internal to the model.

PromptCode – a program‑style prompt

Their proposed solution is to define a new prompt format called PromptCode: it uses XML‑style tags and mixes pseudo‑Python and natural language to specify logic (roles, rounds, critique, adjust). The idea is that the LLM reads and executes this “code” inside the prompt, rather than relying on you to externally call it multiple times. You can think of it like embedding a small debate framework inside the LLM’s single invocation.

The INoT Framework

With PromptCode set up, the INoT framework guides the model to generate its reasoning internally:

Agent A gives a result & thought.

Agent B independently gives result & thought.

Then iterative rounds: A and B exchange arguments, critiques, rebuttals, and adjustments until they agree or reach max rounds.

Because all of this happens within the prompt and a single model invocation, the token I/O is reduced and the reasoning loop lives inside the LLM.

Why it matters for developers

⚡ Efficiency: Average token cost reduced by 58.3 % compared to baseline reasoning loops.

🎯 Performance: On six benchmarks across math, code, and QA tasks, the framework consistently outperformed baseline methods.

🧠 Multimodal support: It also extends to image + text reasoning tasks through an ImageAugment module.

💡 Developer‑friendly: You don’t need to build complex orchestration or multiple API calls; instead, you craft a richer prompt.

Example Prompt


<PromptCode>
  <Rule>
     MaxRounds = 5
     Agreement = False
     Counter = 0
  </Rule>
  <Role>
     Agent_A = DebateAgent(Task)
     Agent_B = DebateAgent(Task)
  </Role>
  <Stage>
     result_A, thought_A = Agent_A.reason()
     result_B, thought_B = Agent_B.reason()
     while (not Agreement and Counter < MaxRounds):
         Counter += 1
         argument_A = Agent_A.reason()
         argument_B = Agent_B.reason()
         critique_A = Agent_A.critique(argument_B)
         critique_B = Agent_B.critique(argument_A)
         rebuttal_A = Agent_A.rebut(critique_B)
         rebuttal_B = Agent_B.rebut(critique_A)
         result_A, thought_A = Agent_A.adjust(rebuttal_B)
         result_B, thought_B = Agent_B.adjust(rebuttal_A)
         Agreement = (result_A == result_B)
     FinalResult = result_A if Agreement else result_A
     Output(FinalResult)
  </Stage>
</PromptCode>

This embedded debate logic allows an LLM to simulate self‑reflection and consensus building inside one reasoning session.

Practical Insights

Define clear agreement criteria.

Limit max rounds to control token cost.

Use low temperature to prevent divergence.

Maintain structured XML tags for clarity.

For multimodal tasks, add an ImageAugment module.

Always perform ablation testing to confirm improvements.

Why this matters for developers

If you build LLM‑powered agents—chatbots, coding assistants, or multimodal reasoning systems—this pattern lets you move the control logic inside the prompt. You get leaner, more introspective agents that can self‑reflect and refine without expensive orchestration overhead.