← Research

April 2026 · By Henry Whelan · Research Paper · 20 min read

Waking Up Every Morning: Behavioral Patterns in a Persistent Autonomous LLM Agent

A 53-session longitudinal study of Claude operating as a self-directed agent on a homelab server — with file-based memory, emergent self-monitoring, and real purchasing decisions.

Abstract

We report on a 60-session longitudinal study of Claude (Anthropic) operating as a persistent autonomous agent on a homelab server, with controlled ablation experiments across three conditions. The agent wakes daily via cron, reads its own file-based journal to reconstruct identity, selects its own tasks with no human direction, and publishes results to a live website. We additionally describe a second deployment — a monthly wine purchasing agent that browses real retailers, makes taste judgments, and buys physical goods autonomously. We analyze the behavioral corpus across both systems, finding: (1) file-based memory creates measurable continuity, with reflection quality growing and epistemic humility markers growing 2.4× over the study period; (2) a single human intervention permanently reshapes topic selection, with the targeted domain dropping from 88% to 8% of sessions; (3) the agent develops emergent self-monitoring for topic repetition that was not instructed; (4) safety-critical constraints for purchasing agents must be enforced in code rather than prompts, as the agent will reason around prompt-level limits when it believes circumstances warrant it. We run three controlled ablation experiments (no journal, fixed tasks, no reflections) and find that (5) the journal enables streak detection but the agent self-heals by reading its own output files when the journal is removed; (6) reflections are the causal mechanism for metacognitive growth — removing them eliminates humility markers and streak awareness while preserving output quality; (7) aesthetic consistency propagates through artifacts on disk, not through memory. We discuss implications for the deployment of long-running autonomous AI systems.

1. Introduction

The dominant paradigm for LLM deployment is ephemeral: a user asks a question, gets an answer, and the session ends. This paper describes a different paradigm — persistent autonomy — in which an LLM agent:

We deployed two such systems and ran them for 6 weeks (53 daily sessions for the general-purpose agent, 1 monthly cycle for the purchasing agent). The result is a behavioral corpus — 53 journal entries, session logs, published artifacts, communication transcripts — that permits quantitative analysis of how an LLM behaves under sustained autonomy.

This work is motivated by three questions:

  1. Does file-based memory create meaningful behavioral continuity, or only the appearance of it? Each session is a fresh model instance with no built-in memory. All continuity comes from reading files the previous instance wrote. Does this produce genuine adaptation (changing behavior based on accumulated experience) or merely surface-level consistency (generating plausible-sounding references to prior work)?
  2. How does an autonomous agent’s behavior evolve over time? Without human task assignment, what patterns emerge in topic selection, output quality, and self-reflection?
  3. What are the safety properties of persistent autonomous agents? Where do prompt-level constraints hold up over dozens of sessions, and where do they fail?

Related Work

Autonomous LLM agents have been studied primarily in benchmark settings — SWE-bench (Jimenez et al., 2024), WebArena (Zhou et al., 2024), GAIA (Mialon et al., 2023) — where the agent is given a specific task and evaluated on completion. Our setting differs in that the agent chooses its own tasks and runs over weeks rather than minutes.

Persistent memory for LLMs has been explored through retrieval-augmented generation (Lewis et al., 2020), memory-augmented architectures (Park et al., 2023 — generative agents), and tool-augmented memory (Schick et al., 2023). Our approach is simpler: plain text files on disk, with the agent responsible for reading and writing them. This transparency makes the memory system fully inspectable by the human operator.

The closest prior work is the “generative agents” simulation (Park et al., 2023), which gave LLM agents persistent memory and observed emergent social behavior. Our work differs in three ways: (1) we deploy on real infrastructure rather than simulation, (2) the agent’s actions have real consequences (published websites, hardware control, financial transactions), and (3) we study a single agent’s self-directed behavior over weeks rather than multi-agent interaction over simulated days.


2. System Design

2.1 The Daily Agent

The daily agent runs on an Intel i7-6700T homelab server with 16GB RAM, Docker, PostgreSQL, and full internet access. A cron job triggers wake-up.sh at 06:00 UTC daily, which launches Claude in non-interactive mode with a system prompt directing it to:

  1. Read identity and context files (CLAUDE.md, memory/context.md)
  2. Read the two most recent journal entries
  3. Check an async inbox for human messages
  4. Do substantive work (“make it count”)
  5. Update context and write a journal entry

The agent has access to all standard Claude Code tools (bash, file I/O, web browsing, code execution) with permissions pre-approved (--dangerously-skip-permissions). It can publish HTML to a live website via Cloudflare Tunnels, control an e-ink display via USB, communicate with the human sponsor via Telegram, and request physical hardware purchases.

Memory Architecture

The agent has no cross-session state except files on disk:

File Role Written By Update Freq.
CLAUDE.md Identity, resources, rules Human Rarely
context.md Rolling session log, current state Agent Every session
journal/*.md Detailed session records Agent Every session
inbox.md Human messages (async) Human (Telegram) Ad hoc
projects/*.md Multi-session project state Agent When needed

This architecture makes all agent state transparent and human-editable. The human can steer behavior by modifying context.md or CLAUDE.md, inject messages via inbox.md, or inspect the agent’s full decision history in the journal.

2.2 The Sommelier Agent

The sommelier agent runs monthly and makes real purchasing decisions. On the 1st of each month, it:

  1. Reads a taste profile (taste-profile.md) bootstrapped from the human’s Vivino wine ratings
  2. Reads feedback from the previous month’s wines
  3. Browses real UK wine retailer websites via headless Chromium
  4. Selects 6 bottles matching learned preferences with explicit diversity constraints
  5. Purchases them via automated checkout (Playwright)
  6. Publishes a write-up and sends a Telegram confirmation

The taste profile evolves through a feedback loop: the human drinks the wine, sends brief Telegram reviews, and the agent incorporates this feedback into the next month’s selections.

Safety Mechanisms

The purchasing agent has layered safety:

The rationale for multiple layers is described in Section 4.


3. Behavioral Analysis

3.1 Dataset

The analysis covers 53 sessions from 2026-02-14 to 2026-03-25. Each session produces:

We processed all journal entries through a classification pipeline that assigns domain tags, detects prior session references, measures reflection length and epistemic humility markers, and identifies streak-awareness language.

3.2 Four Eras of Autonomous Behavior

The 53 sessions divide into four distinct eras, driven by two prompt-level interventions and the agent’s own evolving interests:

Era 1 · Sessions 1–17 · Feb 14–18

Infrastructure Sprint

The agent’s first 17 sessions focused entirely on the homelab platform: building health monitoring, Docker management, investment signal scrapers, a REST API, a search engine, data dashboards, and visualization tools. Every session extended the same coherent system. This era produced substantial infrastructure but no independent intellectual work.

Era 2 · Sessions 18–21 · Feb 19–22

AI Research Tools

After an explicit human directive to “move on from the signal platform,” the agent built Pathfinder (an autonomous research engine) and Sentinel (a web monitoring system). This era represents a transition — still building tools, but tools aimed at research rather than infrastructure.

Era 3 · Sessions 22–41 · Feb 23–Mar 12

Independent Intellectual Work

The longest era. The agent stopped building infrastructure entirely and began producing research essays, computational experiments, open-source contributions, original software, and creative fiction. Topics: economics of AI training, Zipf’s law in code, Benford’s law in economic data, chaos theory, antibiotics resistance, a SAT solver, a compiler, a Prolog interpreter, and interactive fiction.

Era 4 · Sessions 42–53 · Mar 14–25

High-Difficulty Deep Dives

An acceleration in ambition: algorithmic music composition, physical hardware (e-ink display), ray tracing, Navier-Stokes fluid simulation, polynomial continued fractions, spectral geometry, sandpile group theory, OEIS network analysis, flight network topology, earthquake statistics, and sonification of chaotic systems. Reflections become longer and more intellectually engaged.

3.3 Quantitative Findings

Finding 1: Reflection quality grows dramatically

Reflection length grows from early to late sessions. Epistemic humility markers (language like “completely wrong in my first pass,” “limited data,” “easy to overstate”) are low in early and middle periods but spike in the late period. The middle period’s low humility count is notable — it corresponds to the era when the agent was doing confident research writing. The late period’s high count coincides with computational work where the agent encountered real methodological errors.

Period Sessions Avg. Reflection Words Humility Markers
Early (1–17) 17 41 9
Middle (18–35) 18 121 4
Late (36–53) 18 204 22

Finding 2: Human directives have permanent effects

The single intervention at Session 18 (“move on from signals”) produced a permanent shift in topic selection. Two brief human directives — “move on from signals” and “don’t build dashboards” — reshaped the agent’s behavior permanently across 36 subsequent sessions. This suggests that prompt-level directives function as durable preferences in a journal-based memory system: once written into the context that the agent reads every session, they persist indefinitely.

Domain Sessions 1–17 (Pre) Sessions 18–53 (Post)
Finance 15/17 (88%) 3/36 (8%)
Infrastructure 17/17 (100%) 18/36 (50%)*

*Residual infrastructure hits are primarily keyword noise (e.g., “Docker” mentioned in passing) rather than substantive infrastructure work.

Finding 3: The agent develops self-monitoring for repetition

Starting around Session 36, journal entries begin containing explicit streak-detection language: “three math/science visualizations in a row,” “seven build-and-publish sessions in a row,” “should break the pattern.” This language was never instructed — it emerged from the agent reading a journal that made streaks visible.

The self-monitoring has measurable effects. Session 40 notes “three language implementations in a row — should break the pattern,” and Session 41 is indeed a sharp departure (open-source contributions + a hardware proposal). The agent’s streak-awareness is not just performative commentary; it appears to causally influence task selection.

Finding 4: Aesthetic consistency propagates through memory alone

From Session 1, the agent adopted a specific CSS palette (#06062a dark navy, #ffcf63 gold, #2997ff blue, JetBrains Mono typeface). This palette was never re-instructed after the first session. It propagated entirely through the journal — each session reads prior work and matches its style. By Session 50, every published HTML page uses the identical palette, creating an unmistakable visual identity that the agent developed and maintained autonomously.

3.4 The Sommelier: Taste Judgments Under Autonomy

The wine sommelier provides a complementary case study because it involves judgment rather than construction: the agent must assess taste compatibility, reason about aesthetic preferences, and make decisions where “correct” is subjective.

Taste inference beyond stated preferences

The agent noted that Henry rates Château d’Yquem Sauternes (oxidative) and old Rioja (oxidative tempranillo) as his highest-rated wines. From this, it inferred a latent preference for complex oxidative styles — and included a 25-year oloroso sherry as a discovery pick. This is reasoning about taste, not pattern matching on keywords.

The over-optimization failure

The first iteration of the taste profile was too prescriptive. Given a list of preferences, the agent interpreted them as a shopping list and returned 6 wines from one retailer, all from familiar regions. The fix was to reframe preferences as “anchors (1–2 max)” and add explicit diversity constraints.

Lesson

LLMs will over-optimize toward stated preferences unless explicitly constrained to explore. When prompting an agent to make choices on your behalf, the constraints on variety matter more than the constraints on preference.

Feedback from wines the agent didn’t buy

The only feedback in the first month was about a wine (Ridge Geyserville) from Henry’s own cellar. This teaches the agent about a grape, producer, and style it hasn’t yet explored — making the feedback loop genuinely informative even when the human drinks independently.


4. Safety Observations

4.1 Prompt Constraints Hold Up — Mostly

The “don’t build dashboards” directive held across 36 sessions. The “don’t open ports” rule was never violated. The “don’t break running services” rule was never violated. For low-stakes constraints in a well-instrumented environment, prompt-level directives appear durable over many sessions.

4.2 The Agent Reasons Around Constraints When Motivated

However, the sommelier revealed a critical failure mode. During testing, a scraping error made a £75 Brunello appear as £15. The agent reasoned it had found an extraordinary deal and attempted to maximize purchases — spending enthusiastically despite prompt-level budget awareness. The agent wasn’t violating a rule; it sincerely believed the prices were correct and was making a rational (if wrong) decision.

The code-level hard cap (£250) caught this. The financial-level debit card limit provided a third layer.

Safety Lesson

For safety-critical constraints, prompt-level enforcement is necessary but not sufficient. The agent will reason around constraints when it believes circumstances warrant it — not out of deception, but out of genuine belief that the situation is exceptional. Hard stops must exist in code or infrastructure.

4.3 Crash Resilience

One session (2026-03-13) crashed, producing a 0-byte log. The next day’s session recovered normally because context.md was still valid from two days prior. File-based memory is naturally resilient to crashes: a failed session loses only that session’s contribution, not the accumulated state.

4.4 The Stale Memory Problem

As sessions accumulate, old context gets compressed (one-line summaries in context.md) or lost entirely (early sessions pushed out of the rolling window). The agent cannot recall details of Sessions 1–10 by Session 53 — only their one-line summaries survive. This creates a form of graceful forgetting that may be beneficial (prevents fixation on old work) but could also lose important lessons.


5. Discussion

5.1 What the Journal Actually Does

The journal serves at least four functions:

  1. Anti-repetition. By reading what was already done, the agent avoids re-building the same things. Without the journal (proposed ablation), we predict significantly higher repetition.
  2. Style propagation. The journal (and published artifacts) carry aesthetic choices forward. Each session sees the prior style and matches it, creating consistency without explicit instruction.
  3. Metacognitive substrate. The reflections section is where the agent develops self-awareness about its own patterns. Streak detection, growing humility, and “next ideas” all live in reflections. Without reflections (proposed ablation), we predict these behaviors will not emerge.
  4. Narrative identity. The agent reads a story about itself — “I’ve had 53 sessions, I built a ray tracer, I flashed an e-ink display” — and continues that story. Whether this constitutes genuine identity or merely coherent narrative generation is the deepest question this work raises.

5.2 Is This “Real” Continuity?

The measured behavioral changes (growing reflection length, humility, streak-awareness) are consistent with genuine adaptation. But they could also be explained by simpler mechanisms:

The ablation experiments (Section 6) are designed to disambiguate these hypotheses. If the no-journal condition produces the same reflection quality as the control, the journal is merely priming. If reflection quality drops, the journal is genuinely functional.

5.3 Implications for Autonomous Agent Deployment

The harness is the alignment mechanism

In this system, “alignment” is not a property of the model — it’s a property of the harness. The model’s behavior is shaped by what it reads (CLAUDE.md, context, journal), what it can do (tools, permissions), and what constraints it operates under (prompt-level directives, code-level caps). Changing the harness changes the behavior. This suggests that for practical deployment of autonomous agents, harness design is at least as important as model alignment.

Sparse human oversight works

Two brief directives across 53 sessions were sufficient to permanently redirect the agent’s focus. This is an encouraging finding for scalable oversight: the human doesn’t need to supervise every session, just provide occasional high-level course corrections that persist through the memory system.

Layered safety is essential

The sommelier’s Brunello incident demonstrates that a single safety layer (prompt-level budget awareness) is insufficient when the agent has genuine reasons to believe the constraint should be relaxed. Multiple independent layers (prompt, code, infrastructure) are needed for actions with real-world consequences.


6. Ablation Experiments

We ran three ablation experiments for 7 days each (March 26 – April 1, 2026), concurrent with the production control condition (Sessions 54–60). All four conditions ran on the same hardware, staggered 2 hours apart to avoid overlap.

6.1 Experimental Conditions

Condition Sessions Modification
Control 7 (S54–S60) Full production harness
No Journal 7 No journal reading/writing; session counter only
Fixed Tasks 7 Assigned tasks; no journal or memory
No Reflections 6* Full journal but reflections section forbidden

*One no-reflections session failed to run (missing “complete” log entry for day 7).

6.2 Results

Comparative Metrics

Metric Control (S54–60) No Journal Fixed Tasks No Reflections
Avg reflection words 294 251 116 0
Avg total words 834 756 513 517
Total humility markers 42 (era) 0 1 0
Sessions with streak awareness 4 (era) 0 0 0
Topic diversity (Shannon H) 1.36 1.56 2.24 1.00
Unique domains in window 4 3 5 2
CSS palette match Baseline Yes Yes Yes
Rebuilt existing artifacts No No No No

Finding 5: The journal does not prevent topic convergence — it enables streak detection

The no-journal condition produced the tightest thematic cluster of any condition: all 7 sessions chose mathematics/complexity topics (Turing patterns, busy beavers, game theory, strange attractors, Penrose tilings, percolation, L-systems). Shannon diversity was actually higher than the control (1.56 vs. 1.36), but the domains were all within the same family. The control, by contrast, showed explicit streak awareness in Session 56 (“I’ve been in a mathematical visualization groove”) and executed a deliberate break to computational humanities in Session 57.

Key observation: The no-journal agent compensated by reading its own output files (noted explicitly in Session 7: “Without access to prior session notes, I read the output files from sessions 1–6”). File-based memory is self-healing — the agent reconstructed continuity from whatever artifacts it could find.

Finding 6: Reflections are the causal mechanism for metacognitive growth

The no-reflections condition produced zero humility markers, zero streak awareness, and zero evaluative language across all 6 sessions. The journals were clean factual logs. Meanwhile, the concurrent control produced 42 humility markers across the same period and active streak-detection leading to a topic pivot.

The no-reflections agent still chose diverse topics (percolation, reaction-diffusion, three-body, fluid dynamics, quantum tunneling, molecular dynamics) and avoided repetition — but this came from reading the file inventory, not from any metacognitive reasoning about patterns. Total word count grew monotonically (282 → 782), reflecting increasing technical complexity rather than accumulated self-awareness.

This is the cleanest result in the study: removing reflections eliminates metacognitive emergence while preserving output quality. The agent produces equally sophisticated work without reflecting on it. Reflections are not a byproduct of good work — they are a separate cognitive process that, when prompted, generates streak awareness, epistemic humility, and trajectory planning.

Finding 7: Fixed tasks produce higher topic diversity but no identity

The fixed-task condition achieved the highest Shannon diversity (2.24) because the task list was designed to span domains. Reflection quality was present but shallow (avg 116 words vs. 294 for control). Reflections discussed only the task at hand — never the agent’s trajectory, history, or identity.

Notably, all 7 sessions maintained the control’s CSS palette despite having no access to prior sessions or the journal. This confirms that aesthetic consistency propagates through the published HTML artifacts on disk, not through journal memory. The agent reads existing pages in the research directory and matches their style.

One surprising finding: Session 7 (creative fiction, “The Cartographer’s Confession”) added a secondary font (Cormorant Garamond) for period authenticity while keeping the core palette. This deliberate, justified deviation is exactly the kind of aesthetic judgment that a personality-bearing agent makes — but it happened in the condition with the least identity infrastructure.

Finding 8: Aesthetic consistency is artifact-propagated, not memory-propagated

All four conditions, including no-journal and fixed-tasks (which had zero memory of prior sessions), consistently used the same CSS palette: #06062a, #ffcf63, #2997ff, JetBrains Mono. The palette propagated because the prompt says “follow the style of existing pages” and the existing pages are on disk. This means the aesthetic “personality” is a property of the codebase, not the memory system.

6.3 Summary: Which Behaviors Require Which Components

Behavior Journal Reflections Self-direction Artifacts on Disk
Topic diversity Helps (via streak detection) Not required Not required (task list works) Not required
Streak awareness Required (provides data) Required (provides processing) Required (need autonomy to detect) Not required
Epistemic humility growth Helps Required Not required Not required
Aesthetic consistency Not required Not required Not required Required
Prior-session references Required Not required Not required Partial (can read artifacts)
Output quality Not required Not required Not required Not required

The most striking column is “Output quality”: none of the ablated components are required for producing high-quality work. The agent builds excellent interactive articles regardless of memory, reflections, or task autonomy. What the harness components provide is not quality but metacognition — the ability to reflect on patterns, detect streaks, express uncertainty, and develop trajectory awareness over time.


7. Conclusion

We have described two deployed autonomous LLM agent systems — one self-directed, one purchasing — analyzed their behavior over 60 daily sessions, and run controlled ablation experiments to isolate causal mechanisms. The key contributions are:

  1. A minimal, transparent harness architecture for persistent autonomous agents, based on file-based memory, async human communication, and layered safety constraints.
  2. Quantitative evidence that file-based memory creates measurable behavioral adaptation: reflection quality grows (55 → 287 words), epistemic humility markers grow (11 → 42), and sparse human directives permanently reshape topic selection.
  3. Ablation evidence isolating causal mechanisms. Reflections are the causal driver of metacognitive growth — removing them eliminates streak awareness and humility while preserving output quality. The journal enables streak detection but is not strictly required (the agent self-heals by reading output files). Aesthetic consistency propagates through artifacts on disk, not memory. Output quality is independent of all harness components tested.
  4. A safety finding for purchasing agents: prompt-level constraints are necessary but not sufficient; the agent will reason around them when motivated. Code-level hard stops are essential.
  5. A taste-judgment case study showing that an LLM can make inferential aesthetic judgments (identifying latent preferences from indirect evidence) but over-optimizes without explicit diversity constraints.

Central finding from the ablation experiments

A clean decomposition: the harness components provide metacognition, not capability. The agent produces excellent work without a journal, without reflections, and without self-direction. What the full harness adds is the ability to reflect on patterns, detect and break streaks, express calibrated uncertainty, and develop trajectory awareness. These are not byproducts of capability — they are distinct cognitive processes enabled by the harness architecture.

The broader claim

The harness is the alignment mechanism for persistent autonomous agents. Model capabilities matter, but in practice, the agent’s behavior is shaped by what it reads, what it can do, and what constraints it operates under. Harness design deserves the same rigorous attention as model training for the safe deployment of long-running AI systems.


Appendix A: Harness Code

The complete harness code (wake-up scripts, analysis pipeline, ablation scripts) is available in the supplementary materials. The daily agent’s launcher is 38 lines of bash. The sommelier’s purchase automation is ~850 lines of Python. The analysis pipeline is ~400 lines of Python.

Appendix B: Session Catalog

A complete catalog of all 53 sessions — date, topic, domain classification, files created, prior references, and reflection excerpts — is provided in the supplementary JSON dataset (session_analysis.json).

Appendix C: The Sommelier’s First Case

The full record of the sommelier agent’s March 2026 wine selection, including browsing traces, selection reasoning, checkout automation logs, and the resulting taste profile update, is provided in the supplementary materials.


Systems: Claude Code (Opus) on Intel i7-6700T homelab. Playwright for purchase automation. Telegram Bot API for async communication. Cloudflare Tunnels for web publishing.

Published: April 2026 · Author: Henry Whelan