Designing GAAP: A Self-Hosted Platform for Safe, Long-Running Autonomous Agents

A chatbot can produce a response and stop. An autonomous agent needs durable execution, policy enforcement, memory, observability, retries, human approval, and safe tool isolation. GAAP is a self-hosted, extensible platform for building long-running autonomous agents that reason, remember, and safely execute actions across tools and integrations. This piece walks through its design end to end.

01 What GAAP needs to do

At a high level, GAAP must support the full agent lifecycle. A user should be able to send a natural language command such as:

Summarize unread customer emails, identify urgent ones, draft replies, and schedule follow-ups for tomorrow.

The platform should then:

Understand the user’s intent.
Convert the request into a multi-step plan.
Validate every action against policy.
Retrieve relevant memory and context.
Execute tools and APIs.
Ask for human approval when needed.
Persist execution state.
Resume after failures.
Record a replayable audit trail.

The most important design principle is this:

The LLM should not be the runtime.

The LLM can reason, plan, summarize, and classify. But the runtime itself must be deterministic, durable, auditable, and policy-controlled.

02 High-level architecture

GAAP separates the system into clear components: ingress, runtime, planning, policy, memory, audit, and tool execution.

flowchart LR
    U[User / Client] --> C[Channel Layer
Web, CLI, Chat, API, Webhook]
    C --> IG[Ingress Gateway]

    IG --> AR[Agent Runtime]

    subgraph AR[Agent Runtime]
        FSM[FSM Engine]
        PL[Planner]
        PE[Policy Engine]
        MM[Memory Manager]
        AM[Audit Manager]
    end

    FSM --> PL
    FSM --> PE
    FSM --> MM
    FSM --> AM

    PE -->|Allow / Deny / Require Approval| FSM

    FSM --> TR[Tool Router]
    TR --> TE[Tool Executors]

    subgraph TE[Tool Executors]
        Gmail[Gmail Plugin]
        Calendar[Calendar Plugin]
        Slack[Slack Plugin]
        Files[Filesystem Plugin]
        HTTP[HTTP/API Plugin]
        Custom[Custom Plugins]
    end

    MM --> MEM[(Persistent Memory)]
    AM --> AUDIT[(Audit Log)]
    FSM --> STATE[(Execution State Store)]

    TE --> EXT[External APIs / Systems]

The Ingress Gateway accepts commands from different channels such as a web UI, CLI, chat app, webhook, or API client.

The Agent Runtime owns execution. It coordinates the finite state machine, planner, policy engine, memory manager, audit manager, and tool router.

The Planner converts natural language into structured plans. The Policy Engine decides whether actions are allowed, denied, or require approval. The Memory Manager retrieves and writes persistent context. The Audit Manager records task lifecycle events, policy decisions, approvals, and tool executions. The Tool Router dispatches approved actions to isolated tool executors, which perform real-world actions through plugins and APIs.

This separation is critical. Without it, the system becomes a fragile LLM loop where reasoning, permissions, memory, execution, and logging are tangled together.

03 Core entities

A production agent platform should make its core entities explicit.

Agent
- agent_id
- state
- finite state machine definition
- capabilities
- policy profile

Session
- session_id
- user_id
- agent_id
- channel
- created_at
- context_window

Task
- task_id
- user_id
- goal
- plan
- status
- current_step
- created_at
- updated_at

Tool
- tool_id
- name
- actions
- schema
- risk_level
- required_scopes

MemoryRecord
- memory_id
- user_id
- type
- content
- embedding
- source
- created_at

AuditEvent
- audit_id
- task_id
- agent_id
- event_type
- actor
- action
- result
- timestamp

These entities give GAAP a stable foundation. Plans become inspectable data. Tool calls become auditable records. Memory becomes structured and queryable. Policies can operate on known fields rather than vague model output.

04 Modeling execution as a finite state machine

A long-running agent should not be an unstructured loop like this:

think -> act -> observe -> think -> act -> observe

That pattern is difficult to debug, replay, and secure. Instead, GAAP should model execution as an explicit finite state machine.

stateDiagram-v2
    [*] --> PLAN_INPUT

    PLAN_INPUT --> PLAN: Receive natural language command
    PLAN --> VALIDATE: Generate structured plan
    VALIDATE --> WAIT_APPROVAL: Risky action requires approval
    VALIDATE --> EXECUTE: Policy allows action
    VALIDATE --> FAILED: Policy denies action

    WAIT_APPROVAL --> EXECUTE: User approves
    WAIT_APPROVAL --> FAILED: User rejects / timeout

    EXECUTE --> OBSERVE: Tool call completed
    EXECUTE --> RETRY: Tool call failed
    RETRY --> EXECUTE: Retry allowed
    RETRY --> FAILED: Retry limit exceeded

    OBSERVE --> WRITE_MEMORY: Capture result
    WRITE_MEMORY --> RESPOND: Persist memory and audit events
    RESPOND --> COMPLETED

    FAILED --> RESPOND
    COMPLETED --> [*]

This design has several advantages. It makes execution understandable: at any moment, the system knows exactly what state a task is in. It improves reliability: if GAAP crashes, it can reload the last durable state and resume. It improves safety: risky actions must pass through explicit validation and approval states. And it improves auditability: every transition can be recorded and replayed.

A simplified task might look like this:

{
  "task_id": "task_123",
  "state": "VALIDATE",
  "user_id": "user_456",
  "plan": [
    {
      "step_id": "step_1",
      "tool": "gmail.search",
      "args": {
        "query": "is:unread"
      },
      "risk": "read_only"
    },
    {
      "step_id": "step_2",
      "tool": "gmail.create_draft",
      "args": {
        "body": "$step_1.summary"
      },
      "risk": "write_requires_approval"
    }
  ],
  "created_at": "2026-05-14T12:00:00Z",
  "updated_at": "2026-05-14T12:00:02Z"
}

The planner proposes a plan. The FSM controls execution.

05 End-to-end task flow

The complete execution path looks like this:

sequenceDiagram
    participant User
    participant Gateway as Ingress Gateway
    participant Runtime as Agent Runtime
    participant Planner
    participant Policy as Policy Engine
    participant Memory as Memory Manager
    participant Tools as Tool Router / Executors
    participant Audit as Audit Log

    User->>Gateway: Natural language command
    Gateway->>Runtime: Create task
    Runtime->>Audit: Record TaskCreated

    Runtime->>Memory: Retrieve relevant context
    Memory-->>Runtime: Short-term / episodic / semantic memory

    Runtime->>Planner: Generate structured plan
    Planner-->>Runtime: Multi-step plan
    Runtime->>Audit: Record PlanGenerated

    loop For each plan step
        Runtime->>Policy: Validate tool action
        Policy-->>Runtime: Allow / Deny / Require Approval

        alt Requires approval
            Runtime-->>User: Request approval
            User-->>Runtime: Approve / reject
            Runtime->>Audit: Record ApprovalDecision
        end

        alt Allowed
            Runtime->>Tools: Execute tool call
            Tools-->>Runtime: Tool result
            Runtime->>Audit: Record ToolExecutionResult
            Runtime->>Memory: Write relevant memory
        else Denied
            Runtime->>Audit: Record PolicyDenied
        end
    end

    Runtime-->>Gateway: Final response
    Gateway-->>User: Result

Every meaningful action passes through the runtime. The planner does not call tools directly. The user does not bypass policy. Tools do not write memory independently. The audit log receives every important event. That structure keeps GAAP predictable.

06 Planning: fast, structured, and bounded

The planner converts natural language into a structured plan. A bad design would send every request to a large remote LLM and wait indefinitely. That will not reliably meet a sub-two-second planning target.

A better design uses tiered planning.

flowchart TD
    A[User Command] --> B[Intent Classifier]

    B --> C{Known workflow?}

    C -->|Yes| D[Use Template Plan]
    C -->|No| E{Cached similar plan?}

    E -->|Yes| F[Reuse / Adapt Cached Plan]
    E -->|No| G[Retrieve Relevant Memory]

    G --> H{Complex task?}

    H -->|Simple| I[Small / Local Model Planner]
    H -->|Complex| J[Large Planner Model]

    D --> K[Validate Plan Schema]
    F --> K
    I --> K
    J --> K

    K --> L{Valid?}
    L -->|Yes| M[Persist Plan]
    L -->|No| N[Ask Clarifying Question or Fail Safely]

GAAP should support several planning modes:

Template planning for the fastest path through known workflows.
Cached planning to reuse common plans for repeated tasks.
Small or local model planning for simple requests and low-latency execution.
Large model planning reserved for complex, ambiguous, or high-value tasks.

The planner should produce constrained JSON, not free-form prose.

{
  "goal": "Summarize unread emails and draft replies",
  "steps": [
    {
      "step_id": "step_1",
      "tool": "gmail.search",
      "args": {
        "query": "is:unread"
      },
      "risk": "read_only"
    },
    {
      "step_id": "step_2",
      "tool": "llm.summarize",
      "args": {
        "input": "$step_1.results"
      },
      "risk": "internal"
    },
    {
      "step_id": "step_3",
      "tool": "gmail.create_draft",
      "args": {
        "body": "$step_2.summary"
      },
      "risk": "write_requires_approval"
    }
  ]
}

Every generated plan should be validated against a schema before execution. Invalid plans should never reach the tool layer.

07 Durable orchestration and idempotency

Long-running agents need durable orchestration. FSM transitions, tool invocations, memory writes, and audit events should not live only in process memory. If the process crashes, the system must be able to resume safely. GAAP should use an event log plus a durable job queue.

flowchart LR
    Runtime[Agent Runtime] --> TX[Transactional Boundary]

    TX --> EVENT[(Event Log)]
    TX --> STATE[(Task State Store)]
    TX --> OUTBOX[(Tool Outbox)]

    OUTBOX --> Queue[Durable Job Queue]
    Queue --> Worker[Tool Worker]

    Worker --> IDEMP{Idempotency Key Seen?}

    IDEMP -->|Yes| RESULT[(Return Stored Result)]
    IDEMP -->|No| EXEC[Execute Tool]

    EXEC --> SAVE[(Store Tool Result)]
    SAVE --> EVENT

    Worker --> Runtime

Instead of mutating hidden in-memory state, GAAP should append events:

TaskCreated
PlanGenerated
StepValidated
ApprovalRequested
ApprovalGranted
ToolExecutionRequested
ToolExecutionStarted
ToolExecutionSucceeded
ToolExecutionFailed
MemoryWriteCommitted
TaskCompleted

Tool calls must be idempotent. Sending an email, deleting a file, charging a customer, or creating a calendar event must not happen twice because a worker retried after a timeout. A tool invocation should include an idempotency key:

{
  "tool_call_id": "toolcall_789",
  "task_id": "task_123",
  "step_id": "step_3",
  "tool": "gmail.create_draft",
  "idempotency_key": "task_123:step_3:gmail.create_draft",
  "status": "pending"
}

Before executing a tool, the worker checks whether the idempotency key already completed. If yes, it returns the stored result. If no, it executes the action and stores the result. This is what separates a reliable agent platform from a demo.

08 Policy engine: deny by default

GAAP should use a deny-by-default safety model. No action should be allowed unless a policy explicitly permits it.

flowchart TD
    A[Proposed Tool Action] --> B[Load Tool Manifest]
    B --> C[Load User / Agent Policy]
    C --> D[Evaluate Risk]

    D --> E{Decision}

    E -->|Allow| F[Execute]
    E -->|Require Approval| G[Pause Task and Ask User]
    E -->|Deny| H[Block Action]

    G --> I{User Decision}
    I -->|Approve| F
    I -->|Reject / Timeout| H

    F --> J[Audit Decision and Result]
    H --> J

Policies should consider user identity, agent role, tool name, action type, resource scope, arguments, risk level, destination, approval history, time of day, and rate limits.

A read-only email search may be allowed automatically. Creating an email draft may be allowed but audited. Sending an external email may require approval. Deleting files may be denied unless the user explicitly enables that capability.

A policy decision should be structured:

{
  "tool": "gmail.send",
  "decision": "REQUIRE_APPROVAL",
  "reason": "Outbound email to external recipient",
  "approval_required_from": "user"
}

The approval prompt should be understandable:

GAAP wants to send an email to alex@example.com.

Subject: Follow-up on contract review

Reason:
This is step 4 of your task: “reply to urgent customer emails.”

Approve?

The approval itself should be stored as an audit event.

09 Tool execution: plugin-based but isolated

GAAP should support extensible tools through plugins. A plugin might expose actions such as gmail.search, gmail.create_draft, gmail.send, calendar.create_event, slack.send_message, github.create_issue, filesystem.read_file, or http.request.

Each plugin should declare a manifest.

{
  "name": "gmail",
  "actions": [
    {
      "name": "search",
      "risk": "read",
      "required_scopes": ["gmail.readonly"]
    },
    {
      "name": "create_draft",
      "risk": "write",
      "required_scopes": ["gmail.compose"]
    },
    {
      "name": "send",
      "risk": "external_write",
      "required_scopes": ["gmail.send"],
      "requires_approval": true
    }
  ]
}

The Tool Router uses the manifest to route calls and enforce permissions. But policy checks are not enough. Tool execution also needs containment.

flowchart LR
    Runtime[Agent Runtime] --> Router[Tool Router]

    Router --> Sandbox[Tool Sandbox]

    subgraph Sandbox[Isolated Execution Environment]
        Proc[Separate Process / Container]
        Timeout[Timeouts]
        Limits[CPU / Memory Limits]
        Network[Egress Policy]
        Secrets[Scoped Secrets]
        Schema[Input / Output Schema Validation]
    end

    Sandbox --> Plugin[Tool Plugin]
    Plugin --> API[External API]

    Sandbox --> Result[Structured Result]
    Result --> Runtime

Tool executors should run with separate process or container isolation, per-tool timeouts, CPU and memory limits, network egress controls, scoped secrets, rate limits, and input and output schema validation.

A Gmail plugin should receive only the Gmail scopes it needs. A filesystem plugin should be restricted to allowed directories. An HTTP plugin should not have unrestricted access to internal network addresses. This turns policy from a soft rule into enforceable containment.

10 Persistent memory

Memory is central to GAAP. The agent needs to remember prior work, user preferences, facts, outcomes, and decisions. But memory should not be a single unstructured blob. GAAP should separate memory into different types.

flowchart TD
    A[Memory Manager] --> STM[Short-Term Memory]
    A --> EPI[Episodic Memory]
    A --> SEM[Semantic Memory]
    A --> PREF[Preference Memory]

    STM --> S1[(Conversation Store)]
    EPI --> S2[(Task / Event Store)]
    SEM --> S3[(Vector Store + Metadata DB)]
    PREF --> S4[(User Preference Store)]

    A --> R[Retriever]
    R --> C[Context Builder]
    C --> Planner[Planner / Agent Runtime]

Memory type	Purpose
Short-term memory	Current conversation context.
Episodic memory	Past tasks, actions, and outcomes.
Semantic memory	Durable facts and knowledge.
Preference memory	User preferences and behavioral patterns.

A memory record might look like this:

{
  "memory_id": "mem_123",
  "user_id": "user_456",
  "type": "preference",
  "content": "User prefers concise status updates in the morning.",
  "source": "conversation",
  "created_at": "2026-05-14T12:05:00Z",
  "confidence": 0.84
}

Memory writes should also pass through policy. The system should avoid storing secrets, sensitive personal data, or temporary information unless explicitly allowed. Memory needs retention rules, deletion support, and source attribution. Good memory is not just recall. It is governed recall.

11 API design

GAAP needs APIs that work for both synchronous and long-running tasks. The basic API surface might start like this:

POST /tasks
GET  /tasks/{task_id}
GET  /tasks/{task_id}/events
POST /tasks/{task_id}/approve
POST /tools/execute
GET  /memory/search
POST /memory
GET  /audit/events

But the APIs need production controls: authentication, authorization, idempotency keys, correlation IDs, streaming progress, pagination, audit access control, and long-running task status.

A better POST /tasks request might look like this:

{
  "command": "Summarize unread customer emails and draft replies.",
  "channel": "web",
  "idempotency_key": "client_req_abc123",
  "correlation_id": "corr_456",
  "execution_mode": "async"
}

The response should not pretend the work is complete if it is long-running:

{
  "task_id": "task_123",
  "status": "planning",
  "events_url": "/tasks/task_123/events",
  "approval_url": "/tasks/task_123/approve"
}

For progress, GAAP can support server-sent events or WebSockets.

sequenceDiagram
    participant Client
    participant API as GAAP API
    participant Runtime as Agent Runtime
    participant Queue as Durable Queue
    participant Worker as Tool Worker

    Client->>API: POST /tasks with idempotency key
    API->>Runtime: Create task
    Runtime->>Queue: Enqueue planning/execution jobs
    API-->>Client: 202 Accepted + task_id

    Client->>API: GET /tasks/task_123/events
    API-->>Client: Stream task events

    Queue->>Worker: Execute next step
    Worker-->>Runtime: Step result
    Runtime-->>API: Publish event
    API-->>Client: Progress update

This gives clients a safe retry model and a way to observe long-running progress.

12 Auditability and replay

Auditability is not just logging. GAAP should record enough information to explain what happened, why it happened, who approved it, which policy applied, and what result was produced.

Audit events should include task lifecycle events, planner inputs and structured outputs, policy decisions, approval requests and responses, tool invocation metadata, tool results, memory reads and writes, errors, and retries.

A sample audit event:

{
  "audit_id": "audit_001",
  "task_id": "task_123",
  "event_type": "POLICY_DECISION",
  "actor": "policy_engine",
  "action": "gmail.send",
  "decision": "REQUIRE_APPROVAL",
  "reason": "External email send requires approval",
  "timestamp": "2026-05-14T12:06:00Z"
}

Replayability requires deterministic records. The system should store structured plans, tool call metadata, state transitions, policy decisions, and final outcomes. For sensitive data, the audit log should avoid storing full secrets or unnecessary payloads. Instead, it can store hashes, references, redacted previews, and metadata.

13 Handling failures

GAAP must expect failure. LLM calls can time out. Tools can fail. APIs can rate-limit. Users may not approve actions. The process may crash. The network may disappear. A plugin may return malformed output.

The runtime should classify failures and respond accordingly.

flowchart TD
    A[Failure Detected] --> B{Failure Type}

    B -->|Planner Timeout| C[Fallback Planner or Ask Clarifying Question]
    B -->|Policy Denial| D[Stop Step and Explain]
    B -->|Approval Timeout| E[Pause or Cancel Task]
    B -->|Tool Timeout| F[Retry with Backoff]
    B -->|Rate Limit| G[Reschedule]
    B -->|Validation Error| H[Reject Output and Repair]
    B -->|Crash| I[Reload State from Event Log]

    C --> J[Audit Failure]
    D --> J
    E --> J
    F --> J
    G --> J
    H --> J
    I --> J

Retries should be bounded. Side-effecting operations should use idempotency keys. Failed tasks should remain inspectable, not disappear.

14 Storage design

GAAP needs multiple storage patterns.

flowchart LR
    Runtime[Agent Runtime] --> DB[(Relational DB)]
    Runtime --> Queue[(Durable Queue)]
    Runtime --> Obj[(Object Store)]
    Memory[Memory Manager] --> Vec[(Vector Store)]
    Audit[Audit Manager] --> Log[(Append-Only Event Log)]

    DB --> A[Tasks]
    DB --> B[Sessions]
    DB --> C[Agents]
    DB --> D[Tool Calls]
    DB --> E[Approvals]

    Vec --> F[Semantic Memory]
    Obj --> G[Large Artifacts]
    Log --> H[Replayable Audit Events]

A practical self-hosted deployment might use PostgreSQL for tasks, sessions, tool calls, approvals, and metadata; pgvector or a dedicated vector database for semantic memory retrieval; Redis or another queue for short-lived job dispatch and worker coordination; object storage for large files, tool outputs, and attachments; and an append-only event table for audit and replay.

For a single-node deployment, PostgreSQL can carry much of the load initially. As the system grows, queues, vector search, and object storage can be separated.

15 Security model

GAAP needs security at multiple layers: authentication (verify who is making the request), authorization (decide what the user or agent can do), policy enforcement (evaluate every proposed action), secret management (store and inject credentials safely), tool isolation (contain plugins and side effects), audit (record decisions and actions), and human approval (require confirmation for risky operations).

The platform should avoid giving agents broad credentials. Instead, each tool call should receive the minimum required secret for the shortest practical time.

Bad:
Agent process has access to all API keys.

Better:
Tool executor receives a scoped token only
for the approved action.

This is especially important for plugins, because extensibility increases attack surface.

16 Extensibility model

GAAP should make it easy to add new channels, tools, memory backends, and models. A plugin system can support tool plugins, channel plugins, memory backend plugins, policy plugins, and model provider plugins.

A tool plugin should define a name, version, actions, input schema, output schema, required scopes, risk classification, timeouts, rate limits, and sandbox requirements.

{
  "name": "calendar",
  "version": "1.0.0",
  "actions": [
    {
      "name": "create_event",
      "input_schema": "schemas/calendar_create_event.json",
      "output_schema": "schemas/calendar_event_result.json",
      "risk": "external_write",
      "required_scopes": ["calendar.write"],
      "requires_approval": true,
      "timeout_ms": 5000
    }
  ]
}

The runtime should not need to know the internal implementation of each plugin. It only needs a manifest, schemas, policy metadata, and a safe execution interface.

17 Observability

A long-running agent platform needs more than application logs. GAAP should expose metrics such as planning latency, tool execution latency, policy decision counts, approval wait time, task success rate, task failure rate, retry count, memory retrieval latency, memory write count, LLM token usage, and cost per task.

Structured traces should connect the user request, task ID, plan ID, step ID, tool call ID, audit event ID, and correlation ID. This makes debugging possible when a user asks:

Why did the agent stop halfway? Why did it ask me for approval? Which tool caused the failure?

18 Deployment model

GAAP can start as a single-node system while still using production-grade boundaries.

flowchart TD
    subgraph SingleNode[Single-Node GAAP Deployment]
        API[API Server]
        Runtime[Agent Runtime]
        Worker[Tool Worker]
        DB[(PostgreSQL)]
        Queue[(Queue)]
        Sandbox[Tool Sandboxes]
        UI[Web UI]
    end

    UI --> API
    API --> Runtime
    Runtime --> DB
    Runtime --> Queue
    Queue --> Worker
    Worker --> Sandbox
    Sandbox --> External[External APIs]

A single-node deployment is simpler to operate, but the architecture should not rely on in-memory state. Even on one machine, GAAP should still use persistent task state, a durable queue, an append-only audit log, idempotency keys, tool isolation, configurable model providers, and plugin manifests. That makes the system easier to scale later without rewriting the core runtime.

19 Final recommended architecture

A strong production-ready GAAP design combines all of the above ideas.

flowchart TD
    User[User / Channel] --> Gateway[Ingress Gateway]

    Gateway --> Auth[AuthN / AuthZ]
    Auth --> Runtime[Durable Agent Runtime]

    Runtime --> FSM[FSM Engine]
    Runtime --> Planner[Planner Service]
    Runtime --> Policy[Policy Engine]
    Runtime --> Memory[Memory Manager]
    Runtime --> Audit[Audit Manager]

    Planner --> ModelRouter[Model Router]
    ModelRouter --> LocalModel[Small / Local Model]
    ModelRouter --> RemoteModel[Remote LLM]
    ModelRouter --> PlanCache[(Plan Cache)]

    Memory --> Vector[(Vector Store)]
    Memory --> MemoryDB[(Memory Metadata DB)]

    FSM --> StateDB[(Task State DB)]
    Audit --> EventLog[(Append-Only Audit Log)]

    FSM --> Outbox[(Tool Outbox)]
    Outbox --> Queue[(Durable Queue)]

    Queue --> ToolWorker[Tool Worker]
    ToolWorker --> ToolPolicy[Runtime Tool Guard]
    ToolPolicy --> Sandbox[Sandboxed Executor]

    Sandbox --> Plugins[Tool Plugins]
    Plugins --> APIs[External APIs]

    Runtime --> Approval[Human Approval Service]
    Approval --> User

This architecture addresses the main risks.

Risk	Design response
Slow planning	Template plans, cached plans, local models, model routing.
Lost execution state	Durable FSM, event log, task state store.
Duplicate side effects	Idempotency keys and tool result records.
Unsafe tools	Sandboxes, scoped secrets, egress controls, rate limits.
Weak APIs	Auth, authorization, idempotency, progress streaming.
Poor auditability	Append-only audit events and replayable state transitions.
Unbounded memory	Typed memory, policy-controlled writes, retention rules.

20 Conclusion

GAAP should not be designed as a thin wrapper around an LLM. It should be designed as an agent operating system. The LLM is important, but it is only one part of the platform. The real value comes from the runtime around it: deterministic execution, durable orchestration, safe tool use, persistent memory, human approval, and auditability.

A well-designed GAAP system has five core properties:

Deterministic execution through an FSM.
Durable orchestration through event logs and queues.
Safe action execution through policy and sandboxing.
Useful memory through typed persistent records.
Extensibility through plugins and clear manifests.

That is what makes long-running autonomous agents practical. Not just impressive in a demo. Reliable enough to run continuously.