A chatbot can produce a response and stop. An autonomous agent needs durable execution, policy enforcement, memory, observability, retries, human approval, and safe tool isolation. GAAP is a self-hosted, extensible platform for building long-running autonomous agents that reason, remember, and safely execute actions across tools and integrations. This piece walks through its design end to end.
01 What GAAP needs to do
At a high level, GAAP must support the full agent lifecycle. A user should be able to send a natural language command such as:
Summarize unread customer emails, identify urgent ones, draft replies, and schedule follow-ups for tomorrow.
The platform should then:
- Understand the user’s intent.
- Convert the request into a multi-step plan.
- Validate every action against policy.
- Retrieve relevant memory and context.
- Execute tools and APIs.
- Ask for human approval when needed.
- Persist execution state.
- Resume after failures.
- Record a replayable audit trail.
The most important design principle is this:
The LLM should not be the runtime.
The LLM can reason, plan, summarize, and classify. But the runtime itself must be deterministic, durable, auditable, and policy-controlled.
02 High-level architecture
GAAP separates the system into clear components: ingress, runtime, planning, policy, memory, audit, and tool execution.
flowchart LR
U[User / Client] --> C[Channel Layer
Web, CLI, Chat, API, Webhook]
C --> IG[Ingress Gateway]
IG --> AR[Agent Runtime]
subgraph AR[Agent Runtime]
FSM[FSM Engine]
PL[Planner]
PE[Policy Engine]
MM[Memory Manager]
AM[Audit Manager]
end
FSM --> PL
FSM --> PE
FSM --> MM
FSM --> AM
PE -->|Allow / Deny / Require Approval| FSM
FSM --> TR[Tool Router]
TR --> TE[Tool Executors]
subgraph TE[Tool Executors]
Gmail[Gmail Plugin]
Calendar[Calendar Plugin]
Slack[Slack Plugin]
Files[Filesystem Plugin]
HTTP[HTTP/API Plugin]
Custom[Custom Plugins]
end
MM --> MEM[(Persistent Memory)]
AM --> AUDIT[(Audit Log)]
FSM --> STATE[(Execution State Store)]
TE --> EXT[External APIs / Systems]
The Ingress Gateway accepts commands from different channels such as a web UI, CLI, chat app, webhook, or API client.
The Agent Runtime owns execution. It coordinates the finite state machine, planner, policy engine, memory manager, audit manager, and tool router.
The Planner converts natural language into structured plans. The Policy Engine decides whether actions are allowed, denied, or require approval. The Memory Manager retrieves and writes persistent context. The Audit Manager records task lifecycle events, policy decisions, approvals, and tool executions. The Tool Router dispatches approved actions to isolated tool executors, which perform real-world actions through plugins and APIs.
This separation is critical. Without it, the system becomes a fragile LLM loop where reasoning, permissions, memory, execution, and logging are tangled together.
03 Core entities
A production agent platform should make its core entities explicit.
Agent - agent_id - state - finite state machine definition - capabilities - policy profile Session - session_id - user_id - agent_id - channel - created_at - context_window Task - task_id - user_id - goal - plan - status - current_step - created_at - updated_at Tool - tool_id - name - actions - schema - risk_level - required_scopes MemoryRecord - memory_id - user_id - type - content - embedding - source - created_at AuditEvent - audit_id - task_id - agent_id - event_type - actor - action - result - timestamp
These entities give GAAP a stable foundation. Plans become inspectable data. Tool calls become auditable records. Memory becomes structured and queryable. Policies can operate on known fields rather than vague model output.
04 Modeling execution as a finite state machine
A long-running agent should not be an unstructured loop like this:
think -> act -> observe -> think -> act -> observe
That pattern is difficult to debug, replay, and secure. Instead, GAAP should model execution as an explicit finite state machine.
stateDiagram-v2
[*] --> PLAN_INPUT
PLAN_INPUT --> PLAN: Receive natural language command
PLAN --> VALIDATE: Generate structured plan
VALIDATE --> WAIT_APPROVAL: Risky action requires approval
VALIDATE --> EXECUTE: Policy allows action
VALIDATE --> FAILED: Policy denies action
WAIT_APPROVAL --> EXECUTE: User approves
WAIT_APPROVAL --> FAILED: User rejects / timeout
EXECUTE --> OBSERVE: Tool call completed
EXECUTE --> RETRY: Tool call failed
RETRY --> EXECUTE: Retry allowed
RETRY --> FAILED: Retry limit exceeded
OBSERVE --> WRITE_MEMORY: Capture result
WRITE_MEMORY --> RESPOND: Persist memory and audit events
RESPOND --> COMPLETED
FAILED --> RESPOND
COMPLETED --> [*]
This design has several advantages. It makes execution understandable: at any moment, the system knows exactly what state a task is in. It improves reliability: if GAAP crashes, it can reload the last durable state and resume. It improves safety: risky actions must pass through explicit validation and approval states. And it improves auditability: every transition can be recorded and replayed.
A simplified task might look like this:
{
"task_id": "task_123",
"state": "VALIDATE",
"user_id": "user_456",
"plan": [
{
"step_id": "step_1",
"tool": "gmail.search",
"args": {
"query": "is:unread"
},
"risk": "read_only"
},
{
"step_id": "step_2",
"tool": "gmail.create_draft",
"args": {
"body": "$step_1.summary"
},
"risk": "write_requires_approval"
}
],
"created_at": "2026-05-14T12:00:00Z",
"updated_at": "2026-05-14T12:00:02Z"
}
The planner proposes a plan. The FSM controls execution.
05 End-to-end task flow
The complete execution path looks like this:
sequenceDiagram
participant User
participant Gateway as Ingress Gateway
participant Runtime as Agent Runtime
participant Planner
participant Policy as Policy Engine
participant Memory as Memory Manager
participant Tools as Tool Router / Executors
participant Audit as Audit Log
User->>Gateway: Natural language command
Gateway->>Runtime: Create task
Runtime->>Audit: Record TaskCreated
Runtime->>Memory: Retrieve relevant context
Memory-->>Runtime: Short-term / episodic / semantic memory
Runtime->>Planner: Generate structured plan
Planner-->>Runtime: Multi-step plan
Runtime->>Audit: Record PlanGenerated
loop For each plan step
Runtime->>Policy: Validate tool action
Policy-->>Runtime: Allow / Deny / Require Approval
alt Requires approval
Runtime-->>User: Request approval
User-->>Runtime: Approve / reject
Runtime->>Audit: Record ApprovalDecision
end
alt Allowed
Runtime->>Tools: Execute tool call
Tools-->>Runtime: Tool result
Runtime->>Audit: Record ToolExecutionResult
Runtime->>Memory: Write relevant memory
else Denied
Runtime->>Audit: Record PolicyDenied
end
end
Runtime-->>Gateway: Final response
Gateway-->>User: Result
Every meaningful action passes through the runtime. The planner does not call tools directly. The user does not bypass policy. Tools do not write memory independently. The audit log receives every important event. That structure keeps GAAP predictable.
06 Planning: fast, structured, and bounded
The planner converts natural language into a structured plan. A bad design would send every request to a large remote LLM and wait indefinitely. That will not reliably meet a sub-two-second planning target.
A better design uses tiered planning.
flowchart TD
A[User Command] --> B[Intent Classifier]
B --> C{Known workflow?}
C -->|Yes| D[Use Template Plan]
C -->|No| E{Cached similar plan?}
E -->|Yes| F[Reuse / Adapt Cached Plan]
E -->|No| G[Retrieve Relevant Memory]
G --> H{Complex task?}
H -->|Simple| I[Small / Local Model Planner]
H -->|Complex| J[Large Planner Model]
D --> K[Validate Plan Schema]
F --> K
I --> K
J --> K
K --> L{Valid?}
L -->|Yes| M[Persist Plan]
L -->|No| N[Ask Clarifying Question or Fail Safely]
GAAP should support several planning modes:
- Template planning for the fastest path through known workflows.
- Cached planning to reuse common plans for repeated tasks.
- Small or local model planning for simple requests and low-latency execution.
- Large model planning reserved for complex, ambiguous, or high-value tasks.
The planner should produce constrained JSON, not free-form prose.
{
"goal": "Summarize unread emails and draft replies",
"steps": [
{
"step_id": "step_1",
"tool": "gmail.search",
"args": {
"query": "is:unread"
},
"risk": "read_only"
},
{
"step_id": "step_2",
"tool": "llm.summarize",
"args": {
"input": "$step_1.results"
},
"risk": "internal"
},
{
"step_id": "step_3",
"tool": "gmail.create_draft",
"args": {
"body": "$step_2.summary"
},
"risk": "write_requires_approval"
}
]
}
Every generated plan should be validated against a schema before execution. Invalid plans should never reach the tool layer.
07 Durable orchestration and idempotency
Long-running agents need durable orchestration. FSM transitions, tool invocations, memory writes, and audit events should not live only in process memory. If the process crashes, the system must be able to resume safely. GAAP should use an event log plus a durable job queue.
flowchart LR
Runtime[Agent Runtime] --> TX[Transactional Boundary]
TX --> EVENT[(Event Log)]
TX --> STATE[(Task State Store)]
TX --> OUTBOX[(Tool Outbox)]
OUTBOX --> Queue[Durable Job Queue]
Queue --> Worker[Tool Worker]
Worker --> IDEMP{Idempotency Key Seen?}
IDEMP -->|Yes| RESULT[(Return Stored Result)]
IDEMP -->|No| EXEC[Execute Tool]
EXEC --> SAVE[(Store Tool Result)]
SAVE --> EVENT
Worker --> Runtime
Instead of mutating hidden in-memory state, GAAP should append events:
TaskCreated PlanGenerated StepValidated ApprovalRequested ApprovalGranted ToolExecutionRequested ToolExecutionStarted ToolExecutionSucceeded ToolExecutionFailed MemoryWriteCommitted TaskCompleted
Tool calls must be idempotent. Sending an email, deleting a file, charging a customer, or creating a calendar event must not happen twice because a worker retried after a timeout. A tool invocation should include an idempotency key:
{
"tool_call_id": "toolcall_789",
"task_id": "task_123",
"step_id": "step_3",
"tool": "gmail.create_draft",
"idempotency_key": "task_123:step_3:gmail.create_draft",
"status": "pending"
}
Before executing a tool, the worker checks whether the idempotency key already completed. If yes, it returns the stored result. If no, it executes the action and stores the result. This is what separates a reliable agent platform from a demo.
08 Policy engine: deny by default
GAAP should use a deny-by-default safety model. No action should be allowed unless a policy explicitly permits it.
flowchart TD
A[Proposed Tool Action] --> B[Load Tool Manifest]
B --> C[Load User / Agent Policy]
C --> D[Evaluate Risk]
D --> E{Decision}
E -->|Allow| F[Execute]
E -->|Require Approval| G[Pause Task and Ask User]
E -->|Deny| H[Block Action]
G --> I{User Decision}
I -->|Approve| F
I -->|Reject / Timeout| H
F --> J[Audit Decision and Result]
H --> J
Policies should consider user identity, agent role, tool name, action type, resource scope, arguments, risk level, destination, approval history, time of day, and rate limits.
A read-only email search may be allowed automatically. Creating an email draft may be allowed but audited. Sending an external email may require approval. Deleting files may be denied unless the user explicitly enables that capability.
A policy decision should be structured:
{
"tool": "gmail.send",
"decision": "REQUIRE_APPROVAL",
"reason": "Outbound email to external recipient",
"approval_required_from": "user"
}
The approval prompt should be understandable:
GAAP wants to send an email to alex@example.com. Subject: Follow-up on contract review Reason: This is step 4 of your task: “reply to urgent customer emails.” Approve?
The approval itself should be stored as an audit event.
09 Tool execution: plugin-based but isolated
GAAP should support extensible tools through plugins. A plugin might expose actions such as gmail.search, gmail.create_draft, gmail.send, calendar.create_event, slack.send_message, github.create_issue, filesystem.read_file, or http.request.
Each plugin should declare a manifest.
{
"name": "gmail",
"actions": [
{
"name": "search",
"risk": "read",
"required_scopes": ["gmail.readonly"]
},
{
"name": "create_draft",
"risk": "write",
"required_scopes": ["gmail.compose"]
},
{
"name": "send",
"risk": "external_write",
"required_scopes": ["gmail.send"],
"requires_approval": true
}
]
}
The Tool Router uses the manifest to route calls and enforce permissions. But policy checks are not enough. Tool execution also needs containment.
flowchart LR
Runtime[Agent Runtime] --> Router[Tool Router]
Router --> Sandbox[Tool Sandbox]
subgraph Sandbox[Isolated Execution Environment]
Proc[Separate Process / Container]
Timeout[Timeouts]
Limits[CPU / Memory Limits]
Network[Egress Policy]
Secrets[Scoped Secrets]
Schema[Input / Output Schema Validation]
end
Sandbox --> Plugin[Tool Plugin]
Plugin --> API[External API]
Sandbox --> Result[Structured Result]
Result --> Runtime
Tool executors should run with separate process or container isolation, per-tool timeouts, CPU and memory limits, network egress controls, scoped secrets, rate limits, and input and output schema validation.
A Gmail plugin should receive only the Gmail scopes it needs. A filesystem plugin should be restricted to allowed directories. An HTTP plugin should not have unrestricted access to internal network addresses. This turns policy from a soft rule into enforceable containment.
10 Persistent memory
Memory is central to GAAP. The agent needs to remember prior work, user preferences, facts, outcomes, and decisions. But memory should not be a single unstructured blob. GAAP should separate memory into different types.
flowchart TD
A[Memory Manager] --> STM[Short-Term Memory]
A --> EPI[Episodic Memory]
A --> SEM[Semantic Memory]
A --> PREF[Preference Memory]
STM --> S1[(Conversation Store)]
EPI --> S2[(Task / Event Store)]
SEM --> S3[(Vector Store + Metadata DB)]
PREF --> S4[(User Preference Store)]
A --> R[Retriever]
R --> C[Context Builder]
C --> Planner[Planner / Agent Runtime]
| Memory type | Purpose |
|---|---|
| Short-term memory | Current conversation context. |
| Episodic memory | Past tasks, actions, and outcomes. |
| Semantic memory | Durable facts and knowledge. |
| Preference memory | User preferences and behavioral patterns. |
A memory record might look like this:
{
"memory_id": "mem_123",
"user_id": "user_456",
"type": "preference",
"content": "User prefers concise status updates in the morning.",
"source": "conversation",
"created_at": "2026-05-14T12:05:00Z",
"confidence": 0.84
}
Memory writes should also pass through policy. The system should avoid storing secrets, sensitive personal data, or temporary information unless explicitly allowed. Memory needs retention rules, deletion support, and source attribution. Good memory is not just recall. It is governed recall.
11 API design
GAAP needs APIs that work for both synchronous and long-running tasks. The basic API surface might start like this:
POST /tasks
GET /tasks/{task_id}
GET /tasks/{task_id}/events
POST /tasks/{task_id}/approve
POST /tools/execute
GET /memory/search
POST /memory
GET /audit/events
But the APIs need production controls: authentication, authorization, idempotency keys, correlation IDs, streaming progress, pagination, audit access control, and long-running task status.
A better POST /tasks request might look like this:
{
"command": "Summarize unread customer emails and draft replies.",
"channel": "web",
"idempotency_key": "client_req_abc123",
"correlation_id": "corr_456",
"execution_mode": "async"
}
The response should not pretend the work is complete if it is long-running:
{
"task_id": "task_123",
"status": "planning",
"events_url": "/tasks/task_123/events",
"approval_url": "/tasks/task_123/approve"
}
For progress, GAAP can support server-sent events or WebSockets.
sequenceDiagram
participant Client
participant API as GAAP API
participant Runtime as Agent Runtime
participant Queue as Durable Queue
participant Worker as Tool Worker
Client->>API: POST /tasks with idempotency key
API->>Runtime: Create task
Runtime->>Queue: Enqueue planning/execution jobs
API-->>Client: 202 Accepted + task_id
Client->>API: GET /tasks/task_123/events
API-->>Client: Stream task events
Queue->>Worker: Execute next step
Worker-->>Runtime: Step result
Runtime-->>API: Publish event
API-->>Client: Progress update
This gives clients a safe retry model and a way to observe long-running progress.
12 Auditability and replay
Auditability is not just logging. GAAP should record enough information to explain what happened, why it happened, who approved it, which policy applied, and what result was produced.
Audit events should include task lifecycle events, planner inputs and structured outputs, policy decisions, approval requests and responses, tool invocation metadata, tool results, memory reads and writes, errors, and retries.
A sample audit event:
{
"audit_id": "audit_001",
"task_id": "task_123",
"event_type": "POLICY_DECISION",
"actor": "policy_engine",
"action": "gmail.send",
"decision": "REQUIRE_APPROVAL",
"reason": "External email send requires approval",
"timestamp": "2026-05-14T12:06:00Z"
}
Replayability requires deterministic records. The system should store structured plans, tool call metadata, state transitions, policy decisions, and final outcomes. For sensitive data, the audit log should avoid storing full secrets or unnecessary payloads. Instead, it can store hashes, references, redacted previews, and metadata.
13 Handling failures
GAAP must expect failure. LLM calls can time out. Tools can fail. APIs can rate-limit. Users may not approve actions. The process may crash. The network may disappear. A plugin may return malformed output.
The runtime should classify failures and respond accordingly.
flowchart TD
A[Failure Detected] --> B{Failure Type}
B -->|Planner Timeout| C[Fallback Planner or Ask Clarifying Question]
B -->|Policy Denial| D[Stop Step and Explain]
B -->|Approval Timeout| E[Pause or Cancel Task]
B -->|Tool Timeout| F[Retry with Backoff]
B -->|Rate Limit| G[Reschedule]
B -->|Validation Error| H[Reject Output and Repair]
B -->|Crash| I[Reload State from Event Log]
C --> J[Audit Failure]
D --> J
E --> J
F --> J
G --> J
H --> J
I --> J
Retries should be bounded. Side-effecting operations should use idempotency keys. Failed tasks should remain inspectable, not disappear.
14 Storage design
GAAP needs multiple storage patterns.
flowchart LR
Runtime[Agent Runtime] --> DB[(Relational DB)]
Runtime --> Queue[(Durable Queue)]
Runtime --> Obj[(Object Store)]
Memory[Memory Manager] --> Vec[(Vector Store)]
Audit[Audit Manager] --> Log[(Append-Only Event Log)]
DB --> A[Tasks]
DB --> B[Sessions]
DB --> C[Agents]
DB --> D[Tool Calls]
DB --> E[Approvals]
Vec --> F[Semantic Memory]
Obj --> G[Large Artifacts]
Log --> H[Replayable Audit Events]
A practical self-hosted deployment might use PostgreSQL for tasks, sessions, tool calls, approvals, and metadata; pgvector or a dedicated vector database for semantic memory retrieval; Redis or another queue for short-lived job dispatch and worker coordination; object storage for large files, tool outputs, and attachments; and an append-only event table for audit and replay.
For a single-node deployment, PostgreSQL can carry much of the load initially. As the system grows, queues, vector search, and object storage can be separated.
15 Security model
GAAP needs security at multiple layers: authentication (verify who is making the request), authorization (decide what the user or agent can do), policy enforcement (evaluate every proposed action), secret management (store and inject credentials safely), tool isolation (contain plugins and side effects), audit (record decisions and actions), and human approval (require confirmation for risky operations).
The platform should avoid giving agents broad credentials. Instead, each tool call should receive the minimum required secret for the shortest practical time.
Bad: Agent process has access to all API keys. Better: Tool executor receives a scoped token only for the approved action.
This is especially important for plugins, because extensibility increases attack surface.
16 Extensibility model
GAAP should make it easy to add new channels, tools, memory backends, and models. A plugin system can support tool plugins, channel plugins, memory backend plugins, policy plugins, and model provider plugins.
A tool plugin should define a name, version, actions, input schema, output schema, required scopes, risk classification, timeouts, rate limits, and sandbox requirements.
{
"name": "calendar",
"version": "1.0.0",
"actions": [
{
"name": "create_event",
"input_schema": "schemas/calendar_create_event.json",
"output_schema": "schemas/calendar_event_result.json",
"risk": "external_write",
"required_scopes": ["calendar.write"],
"requires_approval": true,
"timeout_ms": 5000
}
]
}
The runtime should not need to know the internal implementation of each plugin. It only needs a manifest, schemas, policy metadata, and a safe execution interface.
17 Observability
A long-running agent platform needs more than application logs. GAAP should expose metrics such as planning latency, tool execution latency, policy decision counts, approval wait time, task success rate, task failure rate, retry count, memory retrieval latency, memory write count, LLM token usage, and cost per task.
Structured traces should connect the user request, task ID, plan ID, step ID, tool call ID, audit event ID, and correlation ID. This makes debugging possible when a user asks:
Why did the agent stop halfway? Why did it ask me for approval? Which tool caused the failure?
18 Deployment model
GAAP can start as a single-node system while still using production-grade boundaries.
flowchart TD
subgraph SingleNode[Single-Node GAAP Deployment]
API[API Server]
Runtime[Agent Runtime]
Worker[Tool Worker]
DB[(PostgreSQL)]
Queue[(Queue)]
Sandbox[Tool Sandboxes]
UI[Web UI]
end
UI --> API
API --> Runtime
Runtime --> DB
Runtime --> Queue
Queue --> Worker
Worker --> Sandbox
Sandbox --> External[External APIs]
A single-node deployment is simpler to operate, but the architecture should not rely on in-memory state. Even on one machine, GAAP should still use persistent task state, a durable queue, an append-only audit log, idempotency keys, tool isolation, configurable model providers, and plugin manifests. That makes the system easier to scale later without rewriting the core runtime.
19 Final recommended architecture
A strong production-ready GAAP design combines all of the above ideas.
flowchart TD
User[User / Channel] --> Gateway[Ingress Gateway]
Gateway --> Auth[AuthN / AuthZ]
Auth --> Runtime[Durable Agent Runtime]
Runtime --> FSM[FSM Engine]
Runtime --> Planner[Planner Service]
Runtime --> Policy[Policy Engine]
Runtime --> Memory[Memory Manager]
Runtime --> Audit[Audit Manager]
Planner --> ModelRouter[Model Router]
ModelRouter --> LocalModel[Small / Local Model]
ModelRouter --> RemoteModel[Remote LLM]
ModelRouter --> PlanCache[(Plan Cache)]
Memory --> Vector[(Vector Store)]
Memory --> MemoryDB[(Memory Metadata DB)]
FSM --> StateDB[(Task State DB)]
Audit --> EventLog[(Append-Only Audit Log)]
FSM --> Outbox[(Tool Outbox)]
Outbox --> Queue[(Durable Queue)]
Queue --> ToolWorker[Tool Worker]
ToolWorker --> ToolPolicy[Runtime Tool Guard]
ToolPolicy --> Sandbox[Sandboxed Executor]
Sandbox --> Plugins[Tool Plugins]
Plugins --> APIs[External APIs]
Runtime --> Approval[Human Approval Service]
Approval --> User
This architecture addresses the main risks.
| Risk | Design response |
|---|---|
| Slow planning | Template plans, cached plans, local models, model routing. |
| Lost execution state | Durable FSM, event log, task state store. |
| Duplicate side effects | Idempotency keys and tool result records. |
| Unsafe tools | Sandboxes, scoped secrets, egress controls, rate limits. |
| Weak APIs | Auth, authorization, idempotency, progress streaming. |
| Poor auditability | Append-only audit events and replayable state transitions. |
| Unbounded memory | Typed memory, policy-controlled writes, retention rules. |
20 Conclusion
GAAP should not be designed as a thin wrapper around an LLM. It should be designed as an agent operating system. The LLM is important, but it is only one part of the platform. The real value comes from the runtime around it: deterministic execution, durable orchestration, safe tool use, persistent memory, human approval, and auditability.
A well-designed GAAP system has five core properties:
- Deterministic execution through an FSM.
- Durable orchestration through event logs and queues.
- Safe action execution through policy and sandboxing.
- Useful memory through typed persistent records.
- Extensibility through plugins and clear manifests.
That is what makes long-running autonomous agents practical. Not just impressive in a demo. Reliable enough to run continuously.