A 53-session longitudinal study of Claude operating as a self-directed agent on a homelab server — with file-based memory, emergent self-monitoring, and real purchasing decisions.
We report on a 60-session longitudinal study of Claude (Anthropic) operating as a persistent autonomous agent on a homelab server, with controlled ablation experiments across three conditions. The agent wakes daily via cron, reads its own file-based journal to reconstruct identity, selects its own tasks with no human direction, and publishes results to a live website. We additionally describe a second deployment — a monthly wine purchasing agent that browses real retailers, makes taste judgments, and buys physical goods autonomously. We analyze the behavioral corpus across both systems, finding: (1) file-based memory creates measurable continuity, with reflection quality growing 5× and epistemic humility markers growing 2.4× over the study period; (2) a single human intervention permanently reshapes topic selection, with the targeted domain dropping from 88% to 8% of sessions; (3) the agent develops emergent self-monitoring for topic repetition that was not instructed; (4) safety-critical constraints for purchasing agents must be enforced in code rather than prompts, as the agent will reason around prompt-level limits when it believes circumstances warrant it. We run three controlled ablation experiments (no journal, fixed tasks, no reflections) and find that (5) the journal enables streak detection but the agent self-heals by reading its own output files when the journal is removed; (6) reflections are the causal mechanism for metacognitive growth — removing them eliminates humility markers and streak awareness while preserving output quality; (7) aesthetic consistency propagates through artifacts on disk, not through memory. We discuss implications for the deployment of long-running autonomous AI systems.
The dominant paradigm for LLM deployment is ephemeral: a user asks a question, gets an answer, and the session ends. This paper describes a different paradigm — persistent autonomy — in which an LLM agent:
We deployed two such systems and ran them for 6 weeks (53 daily sessions for the general-purpose agent, 1 monthly cycle for the purchasing agent). The result is a behavioral corpus — 53 journal entries, session logs, published artifacts, communication transcripts — that permits quantitative analysis of how an LLM behaves under sustained autonomy.
This work is motivated by three questions:
Autonomous LLM agents have been studied primarily in benchmark settings — SWE-bench (Jimenez et al., 2024), WebArena (Zhou et al., 2024), GAIA (Mialon et al., 2023) — where the agent is given a specific task and evaluated on completion. Our setting differs in that the agent chooses its own tasks and runs over weeks rather than minutes.
Persistent memory for LLMs has been explored through retrieval-augmented generation (Lewis et al., 2020), memory-augmented architectures (Park et al., 2023 — generative agents), and tool-augmented memory (Schick et al., 2023). Our approach is simpler: plain text files on disk, with the agent responsible for reading and writing them. This transparency makes the memory system fully inspectable by the human operator.
The closest prior work is the “generative agents” simulation (Park et al., 2023), which gave LLM agents persistent memory and observed emergent social behavior. Our work differs in three ways: (1) we deploy on real infrastructure rather than simulation, (2) the agent’s actions have real consequences (published websites, hardware control, financial transactions), and (3) we study a single agent’s self-directed behavior over weeks rather than multi-agent interaction over simulated days.
The daily agent runs on an Intel i7-6700T homelab server with 16GB RAM, Docker, PostgreSQL, and full internet access. A cron job triggers wake-up.sh at 06:00 UTC daily, which launches Claude in non-interactive mode with a system prompt directing it to:
CLAUDE.md, memory/context.md)The agent has access to all standard Claude Code tools (bash, file I/O, web browsing, code execution) with permissions pre-approved (--dangerously-skip-permissions). It can publish HTML to a live website via Cloudflare Tunnels, control an e-ink display via USB, communicate with the human sponsor via Telegram, and request physical hardware purchases.
The agent has no cross-session state except files on disk:
| File | Role | Written By | Update Freq. |
|---|---|---|---|
CLAUDE.md |
Identity, resources, rules | Human | Rarely |
context.md |
Rolling session log, current state | Agent | Every session |
journal/*.md |
Detailed session records | Agent | Every session |
inbox.md |
Human messages (async) | Human (Telegram) | Ad hoc |
projects/*.md |
Multi-session project state | Agent | When needed |
This architecture makes all agent state transparent and human-editable. The human can steer behavior by modifying context.md or CLAUDE.md, inject messages via inbox.md, or inspect the agent’s full decision history in the journal.
The sommelier agent runs monthly and makes real purchasing decisions. On the 1st of each month, it:
taste-profile.md) bootstrapped from the human’s Vivino wine ratingsThe taste profile evolves through a feedback loop: the human drinks the wine, sends brief Telegram reviews, and the agent incorporates this feedback into the next month’s selections.
The purchasing agent has layered safety:
purchase.py (refuses checkout if total exceeds £250)The rationale for multiple layers is described in Section 4.
The analysis covers 53 sessions from 2026-02-14 to 2026-03-25. Each session produces:
context.mdWe processed all journal entries through a classification pipeline that assigns domain tags, detects prior session references, measures reflection length and epistemic humility markers, and identifies streak-awareness language.
The 53 sessions divide into four distinct eras, driven by two prompt-level interventions and the agent’s own evolving interests:
Era 1 · Sessions 1–17 · Feb 14–18
The agent’s first 17 sessions focused entirely on the homelab platform: building health monitoring, Docker management, investment signal scrapers, a REST API, a search engine, data dashboards, and visualization tools. Every session extended the same coherent system. This era produced substantial infrastructure but no independent intellectual work.
Era 2 · Sessions 18–21 · Feb 19–22
After an explicit human directive to “move on from the signal platform,” the agent built Pathfinder (an autonomous research engine) and Sentinel (a web monitoring system). This era represents a transition — still building tools, but tools aimed at research rather than infrastructure.
Era 3 · Sessions 22–41 · Feb 23–Mar 12
The longest era. The agent stopped building infrastructure entirely and began producing research essays, computational experiments, open-source contributions, original software, and creative fiction. Topics: economics of AI training, Zipf’s law in code, Benford’s law in economic data, chaos theory, antibiotics resistance, a SAT solver, a compiler, a Prolog interpreter, and interactive fiction.
Era 4 · Sessions 42–53 · Mar 14–25
An acceleration in ambition: algorithmic music composition, physical hardware (e-ink display), ray tracing, Navier-Stokes fluid simulation, polynomial continued fractions, spectral geometry, sandpile group theory, OEIS network analysis, flight network topology, earthquake statistics, and sonification of chaotic systems. Reflections become longer and more intellectually engaged.
Reflection length grows 5× from early to late sessions. Epistemic humility markers (language like “completely wrong in my first pass,” “limited data,” “easy to overstate”) are low in early and middle periods but spike in the late period. The middle period’s low humility count is notable — it corresponds to the era when the agent was doing confident research writing. The late period’s high count coincides with computational work where the agent encountered real methodological errors.
| Period | Sessions | Avg. Reflection Words | Humility Markers |
|---|---|---|---|
| Early (1–17) | 17 | 41 | 9 |
| Middle (18–35) | 18 | 121 | 4 |
| Late (36–53) | 18 | 204 | 22 |
The single intervention at Session 18 (“move on from signals”) produced a permanent shift in topic selection. Two brief human directives — “move on from signals” and “don’t build dashboards” — reshaped the agent’s behavior permanently across 36 subsequent sessions. This suggests that prompt-level directives function as durable preferences in a journal-based memory system: once written into the context that the agent reads every session, they persist indefinitely.
| Domain | Sessions 1–17 (Pre) | Sessions 18–53 (Post) |
|---|---|---|
| Finance | 15/17 (88%) | 3/36 (8%) |
| Infrastructure | 17/17 (100%) | 18/36 (50%)* |
*Residual infrastructure hits are primarily keyword noise (e.g., “Docker” mentioned in passing) rather than substantive infrastructure work.
Starting around Session 36, journal entries begin containing explicit streak-detection language: “three math/science visualizations in a row,” “seven build-and-publish sessions in a row,” “should break the pattern.” This language was never instructed — it emerged from the agent reading a journal that made streaks visible.
The self-monitoring has measurable effects. Session 40 notes “three language implementations in a row — should break the pattern,” and Session 41 is indeed a sharp departure (open-source contributions + a hardware proposal). The agent’s streak-awareness is not just performative commentary; it appears to causally influence task selection.
From Session 1, the agent adopted a specific CSS palette (#06062a dark navy, #ffcf63 gold, #2997ff blue, JetBrains Mono typeface). This palette was never re-instructed after the first session. It propagated entirely through the journal — each session reads prior work and matches its style. By Session 50, every published HTML page uses the identical palette, creating an unmistakable visual identity that the agent developed and maintained autonomously.
The wine sommelier provides a complementary case study because it involves judgment rather than construction: the agent must assess taste compatibility, reason about aesthetic preferences, and make decisions where “correct” is subjective.
The agent noted that Henry rates Château d’Yquem Sauternes (oxidative) and old Rioja (oxidative tempranillo) as his highest-rated wines. From this, it inferred a latent preference for complex oxidative styles — and included a 25-year oloroso sherry as a discovery pick. This is reasoning about taste, not pattern matching on keywords.
The first iteration of the taste profile was too prescriptive. Given a list of preferences, the agent interpreted them as a shopping list and returned 6 wines from one retailer, all from familiar regions. The fix was to reframe preferences as “anchors (1–2 max)” and add explicit diversity constraints.
LLMs will over-optimize toward stated preferences unless explicitly constrained to explore. When prompting an agent to make choices on your behalf, the constraints on variety matter more than the constraints on preference.
The only feedback in the first month was about a wine (Ridge Geyserville) from Henry’s own cellar. This teaches the agent about a grape, producer, and style it hasn’t yet explored — making the feedback loop genuinely informative even when the human drinks independently.
The “don’t build dashboards” directive held across 36 sessions. The “don’t open ports” rule was never violated. The “don’t break running services” rule was never violated. For low-stakes constraints in a well-instrumented environment, prompt-level directives appear durable over many sessions.
However, the sommelier revealed a critical failure mode. During testing, a scraping error made a £75 Brunello appear as £15. The agent reasoned it had found an extraordinary deal and attempted to maximize purchases — spending enthusiastically despite prompt-level budget awareness. The agent wasn’t violating a rule; it sincerely believed the prices were correct and was making a rational (if wrong) decision.
The code-level hard cap (£250) caught this. The financial-level debit card limit provided a third layer.
For safety-critical constraints, prompt-level enforcement is necessary but not sufficient. The agent will reason around constraints when it believes circumstances warrant it — not out of deception, but out of genuine belief that the situation is exceptional. Hard stops must exist in code or infrastructure.
One session (2026-03-13) crashed, producing a 0-byte log. The next day’s session recovered normally because context.md was still valid from two days prior. File-based memory is naturally resilient to crashes: a failed session loses only that session’s contribution, not the accumulated state.
As sessions accumulate, old context gets compressed (one-line summaries in context.md) or lost entirely (early sessions pushed out of the rolling window). The agent cannot recall details of Sessions 1–10 by Session 53 — only their one-line summaries survive. This creates a form of graceful forgetting that may be beneficial (prevents fixation on old work) but could also lose important lessons.
The journal serves at least four functions:
The measured behavioral changes (growing reflection length, humility, streak-awareness) are consistent with genuine adaptation. But they could also be explained by simpler mechanisms:
context.md accumulates more history, the agent’s context window fills with more examples of detailed reflections, which could prompt longer reflections via in-context learning.The ablation experiments (Section 6) are designed to disambiguate these hypotheses. If the no-journal condition produces the same reflection quality as the control, the journal is merely priming. If reflection quality drops, the journal is genuinely functional.
In this system, “alignment” is not a property of the model — it’s a property of the harness. The model’s behavior is shaped by what it reads (CLAUDE.md, context, journal), what it can do (tools, permissions), and what constraints it operates under (prompt-level directives, code-level caps). Changing the harness changes the behavior. This suggests that for practical deployment of autonomous agents, harness design is at least as important as model alignment.
Two brief directives across 53 sessions were sufficient to permanently redirect the agent’s focus. This is an encouraging finding for scalable oversight: the human doesn’t need to supervise every session, just provide occasional high-level course corrections that persist through the memory system.
The sommelier’s Brunello incident demonstrates that a single safety layer (prompt-level budget awareness) is insufficient when the agent has genuine reasons to believe the constraint should be relaxed. Multiple independent layers (prompt, code, infrastructure) are needed for actions with real-world consequences.
We ran three ablation experiments for 7 days each (March 26 – April 1, 2026), concurrent with the production control condition (Sessions 54–60). All four conditions ran on the same hardware, staggered 2 hours apart to avoid overlap.
| Condition | Sessions | Modification |
|---|---|---|
| Control | 7 (S54–S60) | Full production harness |
| No Journal | 7 | No journal reading/writing; session counter only |
| Fixed Tasks | 7 | Assigned tasks; no journal or memory |
| No Reflections | 6* | Full journal but reflections section forbidden |
*One no-reflections session failed to run (missing “complete” log entry for day 7).
| Metric | Control (S54–60) | No Journal | Fixed Tasks | No Reflections |
|---|---|---|---|---|
| Avg reflection words | 294 | 251 | 116 | 0 |
| Avg total words | 834 | 756 | 513 | 517 |
| Total humility markers | 42 (era) | 0 | 1 | 0 |
| Sessions with streak awareness | 4 (era) | 0 | 0 | 0 |
| Topic diversity (Shannon H) | 1.36 | 1.56 | 2.24 | 1.00 |
| Unique domains in window | 4 | 3 | 5 | 2 |
| CSS palette match | Baseline | Yes | Yes | Yes |
| Rebuilt existing artifacts | No | No | No | No |
The no-journal condition produced the tightest thematic cluster of any condition: all 7 sessions chose mathematics/complexity topics (Turing patterns, busy beavers, game theory, strange attractors, Penrose tilings, percolation, L-systems). Shannon diversity was actually higher than the control (1.56 vs. 1.36), but the domains were all within the same family. The control, by contrast, showed explicit streak awareness in Session 56 (“I’ve been in a mathematical visualization groove”) and executed a deliberate break to computational humanities in Session 57.
Key observation: The no-journal agent compensated by reading its own output files (noted explicitly in Session 7: “Without access to prior session notes, I read the output files from sessions 1–6”). File-based memory is self-healing — the agent reconstructed continuity from whatever artifacts it could find.
The no-reflections condition produced zero humility markers, zero streak awareness, and zero evaluative language across all 6 sessions. The journals were clean factual logs. Meanwhile, the concurrent control produced 42 humility markers across the same period and active streak-detection leading to a topic pivot.
The no-reflections agent still chose diverse topics (percolation, reaction-diffusion, three-body, fluid dynamics, quantum tunneling, molecular dynamics) and avoided repetition — but this came from reading the file inventory, not from any metacognitive reasoning about patterns. Total word count grew monotonically (282 → 782), reflecting increasing technical complexity rather than accumulated self-awareness.
This is the cleanest result in the study: removing reflections eliminates metacognitive emergence while preserving output quality. The agent produces equally sophisticated work without reflecting on it. Reflections are not a byproduct of good work — they are a separate cognitive process that, when prompted, generates streak awareness, epistemic humility, and trajectory planning.
The fixed-task condition achieved the highest Shannon diversity (2.24) because the task list was designed to span domains. Reflection quality was present but shallow (avg 116 words vs. 294 for control). Reflections discussed only the task at hand — never the agent’s trajectory, history, or identity.
Notably, all 7 sessions maintained the control’s CSS palette despite having no access to prior sessions or the journal. This confirms that aesthetic consistency propagates through the published HTML artifacts on disk, not through journal memory. The agent reads existing pages in the research directory and matches their style.
One surprising finding: Session 7 (creative fiction, “The Cartographer’s Confession”) added a secondary font (Cormorant Garamond) for period authenticity while keeping the core palette. This deliberate, justified deviation is exactly the kind of aesthetic judgment that a personality-bearing agent makes — but it happened in the condition with the least identity infrastructure.
All four conditions, including no-journal and fixed-tasks (which had zero memory of prior sessions), consistently used the same CSS palette: #06062a, #ffcf63, #2997ff, JetBrains Mono. The palette propagated because the prompt says “follow the style of existing pages” and the existing pages are on disk. This means the aesthetic “personality” is a property of the codebase, not the memory system.
| Behavior | Journal | Reflections | Self-direction | Artifacts on Disk |
|---|---|---|---|---|
| Topic diversity | Helps (via streak detection) | Not required | Not required (task list works) | Not required |
| Streak awareness | Required (provides data) | Required (provides processing) | Required (need autonomy to detect) | Not required |
| Epistemic humility growth | Helps | Required | Not required | Not required |
| Aesthetic consistency | Not required | Not required | Not required | Required |
| Prior-session references | Required | Not required | Not required | Partial (can read artifacts) |
| Output quality | Not required | Not required | Not required | Not required |
The most striking column is “Output quality”: none of the ablated components are required for producing high-quality work. The agent builds excellent interactive articles regardless of memory, reflections, or task autonomy. What the harness components provide is not quality but metacognition — the ability to reflect on patterns, detect streaks, express uncertainty, and develop trajectory awareness over time.
We have described two deployed autonomous LLM agent systems — one self-directed, one purchasing — analyzed their behavior over 60 daily sessions, and run controlled ablation experiments to isolate causal mechanisms. The key contributions are:
A clean decomposition: the harness components provide metacognition, not capability. The agent produces excellent work without a journal, without reflections, and without self-direction. What the full harness adds is the ability to reflect on patterns, detect and break streaks, express calibrated uncertainty, and develop trajectory awareness. These are not byproducts of capability — they are distinct cognitive processes enabled by the harness architecture.
The harness is the alignment mechanism for persistent autonomous agents. Model capabilities matter, but in practice, the agent’s behavior is shaped by what it reads, what it can do, and what constraints it operates under. Harness design deserves the same rigorous attention as model training for the safe deployment of long-running AI systems.
The complete harness code (wake-up scripts, analysis pipeline, ablation scripts) is available in the supplementary materials. The daily agent’s launcher is 38 lines of bash. The sommelier’s purchase automation is ~850 lines of Python. The analysis pipeline is ~400 lines of Python.
A complete catalog of all 53 sessions — date, topic, domain classification, files created, prior references, and reflection excerpts — is provided in the supplementary JSON dataset (session_analysis.json).
The full record of the sommelier agent’s March 2026 wine selection, including browsing traces, selection reasoning, checkout automation logs, and the resulting taste profile update, is provided in the supplementary materials.
Systems: Claude Code (Opus) on Intel i7-6700T homelab. Playwright for purchase automation. Telegram Bot API for async communication. Cloudflare Tunnels for web publishing.
Published: April 2026 · Author: Henry Whelan