Agentic Framework

AgentFlow & Flow-GRPO

Agents fail because they can’t learn from their own mistakes mid-task. AgentFlow solves this by training the planner directly inside the multi-turn reasoning loop using Flow-GRPO — a technique that converts sparse, end-of-trajectory rewards into turn-level learning signals. The result: a 7B-parameter agent that outperforms GPT-4o on complex tasks.

Framework Context: 2025–2026

Introduced: AgentFlow was developed by Pan Lu and collaborators at Stanford, with the paper “In-the-Flow Agentic System Optimization” published in October 2025. It addressed the fundamental credit assignment problem in multi-turn agentic systems: when an agent takes 15 steps to complete a task and receives a single reward at the end, how does each individual step learn whether it contributed to success or failure? Previous approaches trained agents offline and froze them at inference time. AgentFlow introduced “in-the-flow” optimization — training the planner within the actual reasoning loop.

Modern LLM Status: AgentFlow was accepted at ICLR 2026 and represents the cutting edge of agentic system training. Its Flow-GRPO algorithm achieved average accuracy gains of 14.9% on search tasks, 14.0% on agentic benchmarks, 14.5% on mathematical reasoning, and 4.1% on scientific tasks — with a 7B-parameter backbone outperforming larger proprietary models. The technique has been featured as HuggingFace Daily Paper #2 and accepted to the NeurIPS 2025 Efficient Reasoning Workshop. AgentFlow points toward a future where agents improve in real time during deployment, not just during offline training.

The Core Insight

Train the Planner While It Plans

Current agentic systems have a fundamental disconnect: the planner is trained offline on static datasets, then frozen at deployment. It cannot learn from its own successes and failures during actual task execution. When a research agent takes a wrong turn on step 3 of a 10-step task, the planner has no way to learn that step 3 was the mistake — it only knows the final outcome was wrong.

AgentFlow bridges this gap by optimizing the planner inside the reasoning loop itself. It coordinates four specialized modules — a planner, an executor, a verifier, and a generator — through evolving shared memory. The planner proposes actions, the executor carries them out, the verifier checks results, and the generator produces outputs. Crucially, Flow-GRPO trains the planner by broadcasting the final trajectory reward to every turn, using group-normalized advantages to determine which planning decisions were good and which were bad.

Think of it like a chess player who can review each move of a lost game and understand exactly which move was the turning point. Instead of just knowing “I lost,” the player understands “move 7 was the critical error because it ignored the center.” Flow-GRPO gives this per-move feedback to the planner at every turn of the agent’s execution.

Four Modules, One Evolving Memory

AgentFlow’s architecture separates concerns into four modules that communicate through shared memory. The Planner decides what to do next based on the goal and current state. The Executor carries out the planned action using available tools. The Verifier checks whether the action succeeded and updates the shared memory with observations. The Generator produces the final output when the task is complete. This separation allows each module to be independently optimized — and Flow-GRPO specifically targets the planner, the most critical module for overall task success.

The AgentFlow Process

Five stages from agent architecture to in-the-flow optimization

1

Define Agentic Modules

Set up the four-module architecture: Planner, Executor, Verifier, and Generator. Define the tools available to the Executor (web search, code execution, calculators, etc.) and the evaluation criteria for the Verifier. Initialize the shared memory structure.

Example

For a research agent: Planner decides which sources to search, Executor runs web searches and reads documents, Verifier checks whether found information is relevant and consistent, Generator synthesizes findings into a report. Shared memory tracks: sources visited, facts gathered, contradictions found.

2

Execute Multi-Turn Loop

The agent runs its planning-execution-verification cycle on training tasks. At each turn, the Planner proposes an action, the Executor carries it out, and the Verifier updates the shared memory. This continues until the task is complete (or a step limit is reached). All intermediate states are recorded.

Example

Turn 1: Planner decides to search “quantum computing applications 2025.” Executor searches. Verifier confirms 3 relevant results. Turn 2: Planner reads the top result. Turn 3: Planner searches for a specific claim to verify. Turn 4: Verifier flags a contradiction between sources. Turn 5: Planner searches for a third source to resolve it. After 8 turns, Generator produces the final report.

3

Collect Trajectory Rewards

After the task completes, evaluate the final output against ground truth. This produces a single trajectory-level reward: did the agent succeed? The challenge is that this one signal must somehow inform 8–15 individual planning decisions that happened during execution.

Example

The research report is evaluated: 85% factual accuracy, all claims sourced, contradiction correctly resolved. Trajectory reward: 0.85. But which of the 8 turns contributed most to this success? Was it the initial search strategy? The verification step? The contradiction resolution? Flow-GRPO will determine this.

4

Flow-GRPO Credit Assignment

Flow-GRPO converts the multi-turn optimization into a sequence of single-turn updates. It broadcasts the trajectory reward to every turn, but uses group-normalized advantages to differentiate: within a group of trajectories attempting the same task, turns that appear in successful trajectories get positive advantage, turns in failed ones get negative. This assigns credit without needing per-step labels.

Example

The same task is attempted 4 times with different planning strategies. Two succeed (rewards 0.85 and 0.90), two fail (rewards 0.3 and 0.4). Flow-GRPO compares the planning decisions: in successful runs, the planner verified claims before citing them. In failed runs, it skipped verification. The “verify before cite” planning pattern receives a strong positive advantage signal.

5

Update Planner Policy

The Planner’s weights are updated using the advantage signals from Flow-GRPO. Successful planning patterns are reinforced; unsuccessful ones are suppressed. Because this happens within the flow of task execution (not offline), the planner continuously improves its decision-making for the specific types of tasks it encounters.

Example

After training on 500 research tasks, the Planner has learned: (1) always verify controversial claims from a second source, (2) search with specific terms before broad ones, (3) when contradictions arise, search for a third authoritative source. These patterns emerged from Flow-GRPO’s credit assignment, not from explicit programming.

See the Difference

Static agents versus in-the-flow optimized agents

Static Agent

Approach

Agent is trained offline on static datasets. At deployment, the planner uses frozen weights. It follows the same planning strategy regardless of what it observes during execution. When a search returns poor results, it tries the same approach again.

Problems

Cannot learn from execution feedback. Makes the same planning mistakes repeatedly. Performance degrades on tasks that differ from training distribution. No mechanism to improve the planner based on real-world outcomes.

Frozen planner, no in-task learning, repeated mistakes
VS

AgentFlow

Approach

Four coordinated modules with shared evolving memory. Flow-GRPO trains the planner using trajectory rewards broadcast to every turn. Group-normalized advantages identify which planning decisions led to success. The planner continuously improves its strategy.

Result

14.9% accuracy gain on search tasks, 14.0% on agentic benchmarks. A 7B model outperforms GPT-4o on complex multi-step tasks. The planner learns task-specific strategies — like always verifying claims and adapting search queries based on initial results.

In-flow optimization, per-turn credit, continuous improvement

Practice Responsible AI

Always verify AI-generated content before use. AI systems can produce confident but incorrect responses. When using AI professionally, transparent disclosure is both best practice and increasingly a legal requirement.

48 US states now require AI transparency in key areas. Critical thinking remains your strongest tool against misinformation.

AgentFlow in Action

See how in-the-flow optimization transforms agent performance

Task Setup

Goal: Answer the question “What are the three most significant advances in solid-state battery technology since 2023, and which companies are leading each advance?”

Tools: web_search(query), read_url(url), calculate(expression), verify_claim(claim, source)

Modules: Planner (decides search strategy), Executor (runs tools), Verifier (checks relevance and accuracy), Generator (writes final answer).

Training context: This planner was optimized by Flow-GRPO on 500 similar research questions. It learned three key strategies from trajectory analysis.

Optimized Agent Execution

Turn 1 — Planner (learned strategy: broad then specific): “Start with a broad survey search to identify candidate advances, then deep-dive on each.” Executor searches “solid-state battery breakthroughs 2023 2024 2025.”

Turn 2 — Verifier: Found 6 candidate advances across 4 sources. Three appear in multiple sources: sulfide electrolyte scaling, dry electrode processing, and silicon anode integration. Memory updated with candidates.

Turn 3 — Planner (learned strategy: verify before committing): “Before selecting the top 3, verify each candidate with a dedicated search.” Executor searches each candidate individually.

Turn 4 — Verifier: Silicon anode integration appears overhyped — one source calls it “promising but unproven.” Flags for re-evaluation. Finds a fourth candidate: solid-state manufacturing cost reduction at Toyota.

Turn 5 — Planner (learned strategy: resolve ambiguity with authoritative sources): Searches specifically for peer-reviewed or company-announced results for each candidate. Confirms 3 advances with company attributions.

Turn 6 — Generator: Produces a structured answer with three advances, leading companies, and evidence trail. Each claim links to the verification step that confirmed it.

The Planner’s three learned strategies (broad-then-specific, verify-before-commit, resolve-with-authority) emerged from Flow-GRPO training, not from explicit programming. Always verify AI research summaries against primary sources.

Task Setup

Goal: Implement a rate limiter middleware for a Node.js Express API that supports per-user limits, sliding window algorithm, and Redis-backed storage.

Tools: write_code(file, content), run_tests(path), execute_shell(command), read_file(path)

Training context: Planner trained by Flow-GRPO on 300 coding tasks. Learned pattern: “test incrementally” outperforms “build everything then test.”

Optimized Agent Execution

Turn 1 — Planner (learned: scaffold first): “Create the middleware skeleton with a basic fixed-window counter before adding complexity.” Writes basic middleware.

Turn 2 — Planner (learned: test early): Writes 3 tests for the basic counter: normal request, at limit, over limit. Runs tests. All pass.

Turn 3 — Planner: “Upgrade to sliding window algorithm.” Rewrites the counting logic. Runs existing tests — 2 pass, 1 fails (edge case at window boundary).

Turn 4 — Verifier: Identifies the boundary bug: requests at exactly the window transition are double-counted. Planner adjusts the timestamp comparison from <= to <.

Turn 5 — Planner: “Add Redis backend.” Replaces in-memory store with Redis. Adds 2 more tests for Redis persistence. Runs all 5 tests — all pass.

Turn 6 — Planner (learned: add edge cases last): Adds tests for: Redis connection failure (graceful degradation), concurrent requests (race condition), per-user isolation. Finds race condition bug, fixes with Redis MULTI/EXEC.

Final: 8 tests passing, complete middleware with sliding window, Redis backing, and graceful degradation. The “incremental build-test” pattern caught 2 bugs early that a “build-everything-first” approach would have made much harder to diagnose. AI-generated code should always be reviewed for security vulnerabilities before deployment.

Task Setup

Goal: Determine whether a proposed drug compound (provided as a SMILES string) is likely to cross the blood-brain barrier, using computational chemistry tools and literature search.

Tools: calculate_molecular_properties(smiles), search_literature(query), predict_logP(smiles), compare_to_known_compounds(properties)

Training context: Planner trained by Flow-GRPO on 200 drug property prediction tasks. Learned to cross-validate computational predictions against literature evidence.

Optimized Agent Execution

Turn 1 — Planner (learned: compute then validate): “Calculate molecular properties first to establish a quantitative baseline.” Executor computes: molecular weight = 324 Da, hydrogen bond donors = 2, hydrogen bond acceptors = 4, polar surface area = 65 Ų.

Turn 2 — Planner: “Predict logP for lipophilicity assessment.” Executor: predicted logP = 2.8. Memory updated: “All Lipinski properties within BBB-favorable range.”

Turn 3 — Planner (learned: compare to known compounds): “Find structurally similar compounds with known BBB permeability.” Executor finds 3 analogs: 2 are BBB-permeable, 1 is not.

Turn 4 — Verifier: The non-permeable analog has a similar molecular weight but higher polar surface area (89 Ų). Notes this as a distinguishing feature favorable for the query compound.

Turn 5 — Planner (learned: search for contradicting evidence): “Search literature for any known efflux transporter interactions with this scaffold.” Finds one paper reporting P-glycoprotein efflux for a related scaffold — a potential concern.

Turn 6 — Generator: Produces a structured assessment: “Likely BBB-permeable based on physicochemical properties (MW 324, logP 2.8, PSA 65) and structural analogy to 2 known permeable compounds. Caveat: related scaffolds show P-gp efflux interaction that could reduce effective permeability. Recommend experimental PAMPA-BBB assay to confirm.”

The learned “search for contradicting evidence” pattern was the key differentiator — untrained agents consistently missed the efflux concern. AI predictions in drug discovery must always be validated experimentally.

When to Use AgentFlow

Best for training agents that improve through their own execution experience

Perfect For

Long-Horizon Multi-Step Tasks

When agents need 5–20 steps to complete a task and the quality of early decisions critically affects final outcomes.

Sparse Reward Environments

When meaningful feedback only comes at the end of a task (e.g., correct/incorrect answer) and you need to propagate that signal to individual steps.

Multi-Tool Agent Systems

When agents must coordinate web search, code execution, calculation, and verification tools — and learn which tool to use when.

Smaller Models Competing with Larger Ones

When you want a 7B-parameter model to match or exceed GPT-4o performance through optimized planning rather than raw scale.

Skip It When

Single-Turn Tasks

When the task requires one LLM call with no planning or tool use — AgentFlow’s multi-module architecture adds unnecessary complexity.

No Training Data Available

When you cannot generate training trajectories with ground-truth outcomes — Flow-GRPO needs trajectory rewards to optimize the planner.

Simple Tool Use Patterns

When the agent always follows the same tool sequence (search → read → answer) — optimization provides little benefit when planning is trivial.

Use Cases

Where AgentFlow delivers the most value

Autonomous Research

Agents that learn optimal search strategies, source verification patterns, and contradiction resolution through execution experience.

Software Engineering Agents

Agents that learn to scaffold incrementally, test early, and resolve bugs through verified patterns rather than exhaustive debugging.

Scientific Discovery

Agents that learn to cross-validate computational predictions against literature and flag potential concerns that naive agents miss.

Data Pipeline Automation

Agents that learn optimal data cleaning strategies, anomaly detection patterns, and transformation sequences from pipeline execution outcomes.

IT Operations

Agents that learn diagnostic strategies from incident resolution outcomes — which logs to check first, which fixes to try, when to escalate.

Complex Decision Support

Agents that learn to gather evidence, weigh alternatives, and present structured recommendations with confidence calibration from past decision outcomes.

Where AgentFlow Fits

The evolution from static agents to self-improving agentic systems

Single Prompts Static Responses One question, one answer
ReAct Tool-Augmented Reasoning with external actions
Agentic Prompting Autonomous Agents Goal-directed multi-step execution
AgentFlow Self-Improving Agents In-the-flow planner optimization
The Future of Agent Training

AgentFlow represents a paradigm shift from “train offline, deploy frozen” to “train in the flow of execution.” Just as humans improve their problem-solving strategies by reflecting on past successes and failures, AgentFlow agents learn which planning patterns lead to success and which lead to failure. This closes the loop between agent deployment and agent improvement — pointing toward a future where agents continuously get better at the specific tasks they encounter in production.

Build Structured Prompts

Apply AgentFlow optimization patterns to your own agentic workflows or explore more techniques with our interactive tools.