Implementation Roadmap¶

Related: [[guiding-design-principles]], [[playbooks]], [[vault]], [[memory-plugin]], [[memory-scoping]], [[profiles]], [[self-improvement]], [[agent-coordination]]

This roadmap breaks the design specs into phased work with explicit dependencies, spec backlinks, and testing checkpoints. Phases are ordered so each builds on the last. Within each phase, tasks are grouped into workstreams that can run in parallel where dependencies allow.

Phase 0: Prerequisite Refactors ✅ COMPLETE¶

All 23 tasks completed and verified. 542 Phase 0 tests passing. 52 event schemas defined. EventBus filtering, git event emission, Supervisor config flexibility, and task records migration all landed on next-gen.

Source: [[playbooks#17. Prerequisite Refactors]]

0.1 EventBus Payload Filtering¶

Extend EventBus subscriptions to support dict-based payload filters. Required for cross-playbook composition ([[playbooks#10. Composability|playbook composability]]).

Spec: [[playbooks#17. Prerequisite Refactors]] — EventBus Payload Filtering Existing code: src/event_bus.py ([[specs/event-bus]])

#	Task	Depends On
0.1.1	Add `filter` parameter to `EventBus.subscribe()` accepting `dict[str, Any]`	—
0.1.2	Implement filter matching logic (all conditions must match event payload fields)	0.1.1
0.1.3	Add tests: EventBus filtered subscriptions. Cases: (a) subscriber with filter `{"project_id": "foo"}` receives event with matching project_id, (b) same subscriber does NOT receive event with different project_id, (c) subscriber with multi-field filter only receives events matching ALL fields, (d) multiple filtered subscribers on same event type each receive only their matches, (e) mix of filtered and unfiltered subscribers — unfiltered gets all, filtered gets only matches, (f) filter on nested payload field works correctly, (g) filter with None/null value matches events where field is absent or null	0.1.2
0.1.4	Add tests: EventBus backward compatibility for unfiltered subscriptions. Cases: (a) subscriber registered without filter receives ALL events of that type (same as pre-filter behavior), (b) existing test suite passes without modification, (c) subscriber with `filter=None` behaves identically to subscriber with no filter arg, (d) subscriber with empty filter `{}` receives all events (no conditions to fail), (e) unfiltered subscriber still receives events that also match another subscriber's filter	0.1.2
0.1.5	Add tests: EventBus filter behavior on missing/extra payload fields. Cases: (a) event payload missing a field required by filter is NOT delivered to that subscriber, (b) event with extra fields beyond filter still matches (filter is subset check, not exact), (c) event with empty payload `{}` does not match any filter with required fields, (d) filter on field that is present but with wrong type (e.g., filter expects string, payload has int) does not match, (e) rapid emission of mixed matching/non-matching events delivers correct subset in order	0.1.2

Test checkpoint: Run full existing event-bus test suite + new filter tests. Verify zero regressions in hook engine and orchestrator event handling. Specific validations: - All pre-existing test_event_bus.py tests pass without modification - Hook engine still fires correctly for task.completed, task.failed, etc. - Orchestrator event subscriptions (task lifecycle, agent status) still work - New filter tests cover single-field, multi-field, missing-field, and mixed scenarios - Performance: subscribing 100+ filtered subscribers does not degrade emit latency significantly - No memory leaks from filter dict references held by EventBus

0.2 Event Schema Registry¶

Lightweight validation for event payloads. Prevents silent scope mismatches from missing project_id fields.

Spec: [[playbooks#17. Prerequisite Refactors]] — Event Schema Registry Existing code: event types defined across src/orchestrator.py, src/event_bus.py, src/file_watcher.py, src/notifications/events.py

#	Task	Depends On
0.2.1	Define `EVENT_SCHEMAS` dict with required/optional fields per event type	—
0.2.2	Implement `validate_event(event_type, payload)` function	0.2.1
0.2.3	Wire validation into `EventBus.emit()` (warn in prod, error in dev)	0.2.2
0.2.4	Add schemas for all existing event types (task., note., file., plugin., config.*)	0.2.1
0.2.5	Add schemas for new event types (git., playbook., human., workflow.)	0.2.1
0.2.6	Add tests: Event Schema Registry validation. Cases: (a) event with all required fields passes validation silently, (b) event missing a required field triggers warning in prod mode and error in dev mode, (c) event with extra fields beyond schema passes (forward compatibility), (d) event with wrong field type (e.g., `project_id` as int instead of str) triggers validation error, (e) unregistered event type passes through without validation (graceful degradation), (f) all existing event types (`task.`, `note.`, `file.`, `plugin.`, `config.*`) have schemas and current emissions pass them, (g) validation error message includes event type, field name, and expected type for debugging	0.2.3

0.3 GitManager Event Emission¶

Add event emission to existing git operations. Enables code quality gates and commit-triggered playbooks.

Spec: [[playbooks#7. Event System]] — New Events Needed, [[playbooks#17. Prerequisite Refactors]] — GitManager Event Emission Existing code: src/git/manager.py ([[specs/git]])

#	Task	Depends On
0.3.1	Define schemas for `git.commit`, `git.push`, `git.pr.created` events	0.2.1
0.3.2	Emit `git.commit` from `GitManager.acommit_all()` with commit_hash, branch, changed_files, message, project_id, agent_id	—
0.3.3	Emit `git.push` from `GitManager.apush_branch()` with branch, remote, commit_range, project_id	—
0.3.4	Emit `git.pr.created` from `GitManager.create_pr()` with pr_url, branch, title, project_id	—
0.3.5	Add tests: GitManager event emission. Cases: (a) `acommit_all()` emits `git.commit` with commit_hash, branch, changed_files list, message, project_id, and agent_id — verify each field present and correct, (b) `apush_branch()` emits `git.push` with branch, remote, commit_range, project_id — verify remote defaults to "origin", (c) `create_pr()` emits `git.pr.created` with pr_url, branch, title, project_id — verify URL is valid, (d) failed git operation (e.g., push to protected branch) does NOT emit event, (e) events pass the schemas defined in 0.3.1, (f) event payloads are captured by an EventBus subscriber (integration test with real EventBus), (g) concurrent git operations from different agents emit separate events with correct agent_id isolation	0.3.2, 0.3.3, 0.3.4

Test checkpoint: Create a task, let an agent commit + push + create PR. Verify all three git events fire with correct payloads in the event log. Specific validations: - git.commit event contains accurate commit_hash (matches git log output) - git.commit changed_files list matches actual files in the commit diff - git.push commit_range accurately reflects the pushed commits - git.pr.created pr_url is a valid GitHub/GitLab URL that resolves - All three events carry the correct project_id from the task context - Events are emitted in correct temporal order (commit before push before PR) - Existing git operations (checkout, fetch, clone) do NOT emit spurious events

0.4 Supervisor Configuration Flexibility¶

Enable per-call model and tool overrides. Required for playbook per-node llm_config and transition evaluation with cheaper models.

Spec: [[playbooks#6. Execution Model]] — Customizable Agent Configuration, [[playbooks#17. Prerequisite Refactors]] — Supervisor Configuration Flexibility Existing code: src/supervisor.py ([[specs/supervisor]])

#	Task	Depends On
0.4.1	Add `llm_config` optional parameter to `Supervisor.chat()`	—
0.4.2	Implement chat provider swap based on `llm_config` within a call	0.4.1
0.4.3	Add `tool_overrides` parameter to `Supervisor.chat()` for restricting available tool set	0.4.1
0.4.4	Add tests: Supervisor model override via `llm_config`. Cases: (a) `chat()` with `llm_config={"model": "gpt-4o"}` routes to OpenAI provider instead of default, (b) `chat()` with `llm_config={"model": "claude-sonnet-4-20250514"}` routes to Anthropic provider, (c) `chat()` without `llm_config` uses the default model from agent profile (backward compat), (d) `llm_config` with invalid/unknown model returns clear error, does not fall back silently, (e) model override applies only to that single call — subsequent calls without override use default, (f) `llm_config` with additional parameters (temperature, max_tokens) are passed through to provider	0.4.2
0.4.5	Add tests: Supervisor tool restriction via `tool_overrides`. Cases: (a) `chat()` with `tool_overrides=["read_file", "write_file"]` only exposes those two tools to the LLM, (b) LLM attempt to call a tool not in the override list is blocked with clear error, (c) `chat()` without `tool_overrides` exposes the full default tool set (backward compat), (d) empty `tool_overrides=[]` disables all tools (LLM can only produce text), (e) `tool_overrides` with unknown tool name raises validation error before LLM call, (f) tool restriction applies only to that single call — next call without override has full tools	0.4.3

0.5 Task Records Migration¶

Move task records out of the memory search path. Stops task files from polluting memory retrieval results.

Spec: [[vault#6. Migration Path]] — Phase 1, [[playbooks#17. Prerequisite Refactors]] — Task Records Migration Existing code: src/memory.py (task record storage paths)

#	Task	Depends On
0.5.1	Create `~/.agent-queue/tasks/{project_id}/` directory structure	—
0.5.2	Update task record write path in MemoryManager to use new location	0.5.1
0.5.3	Write migration script to move existing `memory//tasks/` to `tasks//`	0.5.1
0.5.4	Update memory indexer to exclude the old tasks path	0.5.2
0.5.5	Re-index existing project memory collections (without task files)	0.5.4
0.5.6	Add tests: Task records migration. Cases: (a) new task record writes to `~/.agent-queue/tasks/{project_id}/` not `memory/*/tasks/`, (b) memory search query returns zero results from task files (no task pollution), (c) migration script moves existing task files and preserves content byte-for-byte, (d) migration script is idempotent — running twice does not duplicate or corrupt files, (e) old task path is empty after migration, (f) task read operations find records at new location, (g) projects with no existing tasks do not cause migration errors (empty source dir), (h) re-index after migration produces a clean collection with no task entries	0.5.5

Test checkpoint: Full system test — create tasks, complete them, verify records appear in new location. Memory search should return cleaner results without task noise. Specific validations: - Create 5+ tasks across 2 projects, complete them — all records in tasks/{project_id}/ - Memory search for project-related terms returns notes and insights, NOT task records - Verify task records retain all metadata (status, timestamps, agent_id, outcome) - Migration from old layout: seed old-style task files, run migration, verify new paths - Memory index file counts decrease after re-index (task files no longer indexed) - System startup with fresh install creates tasks/ directory structure automatically

Phase 1: Vault Structure ✅ COMPLETE¶

Vault directory structure created, content migration implemented, unified VaultWatcher with path-based dispatch for all 6 handler types landed. All tests passing.

Originally: Create the vault directory layout and the unified file watcher.

Source: [[vault#2. Directory Layout]], [[vault#5. Obsidian Integration]]

1.1 Vault Directory Creation¶

Spec: [[vault#2. Directory Layout]]

#	Task	Depends On
1.1.1	Create vault directory structure at `~/.agent-queue/vault/` with all subdirectories (system/, orchestrator/, agent-types/, projects/, templates/) per [[vault#2. Directory Layout]]	—
1.1.2	Add vault path constants to `AppConfig` ([[specs/config]])	—
1.1.3	Create `vault_manager.py` module for vault path resolution and directory creation	1.1.1, 1.1.2
1.1.4	Wire vault initialization into orchestrator startup ([[specs/orchestrator]])	1.1.3

1.2 Content Migration¶

Spec: [[vault#6. Migration Path]] — Phase 1

#	Task	Depends On
1.2.1	Move existing `.obsidian/` config from `memory/` to `vault/`	1.1.1
1.2.2	Move existing rule files from `memory/*/rules/` to `vault/system/playbooks/` (or project playbooks per [[playbooks#8. Scoping]])	1.1.1
1.2.3	Move existing notes from `notes/` to `vault/projects/*/notes/`	1.1.1
1.2.4	Copy existing project memory files to `vault/projects/*/memory/`	1.1.1
1.2.5	Write migration script that handles all moves idempotently	1.2.1–1.2.4
1.2.6	Add startup check: if old paths exist and vault is empty, run migration	1.2.5

1.3 Unified Vault File Watcher¶

Spec: [[playbooks#17. Prerequisite Refactors]] — Unified Vault File Watcher, [[vault#5. Obsidian Integration]] (file changes drive re-indexing)

#	Task	Depends On
1.3.1	Implement `VaultWatcher` class using existing `FileWatcher` pattern	1.1.1
1.3.2	Implement path-based dispatch: `/playbooks/.md` → playbook compilation handler	1.3.1
1.3.3	Implement path-based dispatch: `*/profile.md` → profile sync handler	1.3.1
1.3.4	Implement path-based dispatch: `/memory//.md` → memory re-index handler	1.3.1
1.3.5	Implement path-based dispatch: `projects/*/README.md` → orchestrator summary handler	1.3.1
1.3.6	Implement path-based dispatch: `/overrides/.md` → override re-index handler	1.3.1
1.3.7	Implement path-based dispatch: `*/facts.md` → KV sync handler	1.3.1
1.3.8	Wire `VaultWatcher` into orchestrator startup and tick loop	1.3.1
1.3.9	Add tests: VaultWatcher path-based dispatch. Cases: (a) creating/editing a file in `/playbooks/.md` triggers the playbook compilation handler only, (b) creating/editing `/profile.md` triggers the profile sync handler only, (c) creating/editing a file in `/memory/*/.md` triggers the memory re-index handler, (d) creating/editing `projects//README.md` triggers the orchestrator summary handler, (e) creating/editing a file in `/overrides/.md` triggers the override re-index handler, (f) creating/editing `/facts.md` triggers the KV sync handler, (g) editing a file outside any watched path triggers NO handler, (h) deleting a watched file triggers the appropriate handler (not just create/modify), (i) renaming a file across path categories triggers both the old-path handler (delete) and new-path handler (create)	1.3.2–1.3.7
1.3.10	Add tests: VaultWatcher debounce behavior. Cases: (a) editing the same file 10 times in 100ms triggers the handler only once (debounce window), (b) editing two different files in the same category in quick succession triggers handler once per file (separate debounce keys), (c) editing a file, waiting past debounce window, editing again triggers handler twice, (d) debounce does not drop the final state — handler receives the latest content, (e) handler errors during debounced call do not prevent future triggers for the same file, (f) debounce window is configurable and defaults to a reasonable value (e.g., 500ms)	1.3.8

Test checkpoint: Create vault structure, edit files in each location, verify the correct handler fires. Edit a profile.md, verify change is detected. Edit a memory file, verify re-index triggered. This is the foundation — it must be solid. Specific validations: - Vault directory structure matches [[vault#2. Directory Layout]] exactly (all subdirs present) - Each of the 6 dispatch paths (playbooks, profiles, memory, README, overrides, facts) fires the correct handler and no others - File watcher detects changes within 1 second on all supported platforms (Linux inotify, macOS FSEvents) - Debounce prevents handler storms during bulk file operations (e.g., git checkout switching many files) - Handler exceptions are logged but do not crash the watcher or block other handlers - Watcher survives vault directory being temporarily unavailable (e.g., network mount disconnect) - Content migration from Phase 1.2 triggers appropriate handlers upon startup (or is explicitly suppressed during migration)

Phase 2: memsearch Fork & Memory Plugin v2¶

Fork memsearch, add KV + temporal + topic support, build the new plugin.

Source: [[memory-plugin#3. New Architecture]], [[memory-plugin#5. memsearch Fork]], [[memory-plugin#6. Collection Schema]], [[memory-plugin#7. Milvus Backend Topology]]

2.1 memsearch Fork¶

Spec: [[memory-plugin#5. memsearch Fork]], [[memory-plugin#6. Collection Schema]]

#	Task	Depends On
2.1.1	~~Fork memsearch to internal repo~~ ✅ DONE — forked to ElectricJack/memsearch, added as git subtree at `packages/memsearch/`	—
2.1.2	Implement unified collection schema per [[memory-plugin#6. Collection Schema]] (entry_type, content, original, kv fields, valid_from/to, topic, tags, updated_at). Work in `packages/memsearch/`.	✅ 2.1.1
2.1.3	Implement KV insert/query methods (scalar-only, no vector search) per [[memory-plugin#6. Collection Schema]] KV Queries	2.1.2
2.1.4	Implement temporal insert/query methods (valid_from/valid_to windowed lookups) per [[memory-plugin#6. Collection Schema]] Temporal Queries	2.1.2
2.1.5	Implement temporal fact lifecycle (close old window, open new on update) per [[memory-plugin#6. Collection Schema]] Temporal Fact Lifecycle	2.1.4
2.1.6	Implement historical "as-of" query method	2.1.4
2.1.7	Implement topic field support (scalar filter before vector search) per [[memory-scoping#3. Topic Filtering]]	2.1.2
2.1.8	Implement topic filter fallback: auto-widen to unfiltered when < 3 results per [[memory-scoping#3. Topic Filtering]]	2.1.7
2.1.9	Implement multi-collection parallel search with weighted merging per [[memory-scoping#4. Scope Hierarchy]]	2.1.2
2.1.10	Implement scope-aware collection naming (`aq_system`, `aq_agenttype_`, `aq_project_`) per [[memory-plugin#7. Milvus Backend Topology]]	2.1.2
2.1.11	Implement tag-based cross-collection search per [[memory-plugin#7. Milvus Backend Topology]] Tag-Based Cross-Scope Discovery	2.1.2
2.1.12	Implement `original` field storage (full content alongside summary embedding) per [[memory-scoping#9. Summary + Original Pattern]]	2.1.2
2.1.13	Implement retrieval tracking (update `retrieval_count`, `last_retrieved`) per [[self-improvement#6. Memory Health & Observability]]	2.1.2
2.1.14	Implement embedding model version tracking per collection for future re-indexing per [[memory-plugin#8. Open Questions]]	2.1.2
2.1.15	Add tests: KV insert/query round-trip per [[memory-plugin#6. Collection Schema]] KV Queries. Cases: (a) insert a KV pair and query by exact key returns correct value, (b) insert multiple KV pairs with different keys and query each independently, (c) overwrite an existing key and verify query returns the new value, (d) query for non-existent key returns empty/None (not error), (e) KV query uses scalar-only path (no vector search invoked — verify with mock), (f) insert KV pair with namespace and query with same namespace returns it, query with different namespace does not, (g) KV values can store complex strings (multi-line, unicode, special characters)	2.1.3
2.1.16	Add tests: Temporal fact lifecycle per [[memory-plugin#6. Collection Schema]] Temporal Fact Lifecycle. Cases: (a) insert temporal fact with valid_from=now, valid_to=None — as-of query at now returns it, (b) update fact: old record gets valid_to=now, new record gets valid_from=now — as-of query at now returns new value, (c) as-of query at past timestamp returns the old value (before update), (d) as-of query at future timestamp returns current value (valid_to=None), (e) multiple updates create a complete history chain — as-of query at each point returns correct version, (f) temporal query with no matching time window returns empty, (g) concurrent updates to same fact do not corrupt the window chain (no overlapping valid_from/valid_to)	2.1.5, 2.1.6
2.1.17	Add tests: Topic-filtered search per [[memory-scoping#3. Topic Filtering]]. Cases: (a) search with topic="testing" returns only memories tagged with "testing" topic, (b) search without topic filter returns memories across all topics, (c) topic filter with < 3 results auto-widens to unfiltered search and returns more results (fallback per spec), (d) topic filter with >= 3 results does NOT widen (stays filtered), (e) search with non-existent topic returns 0 filtered results then falls back to unfiltered, (f) topic filter is applied as scalar pre-filter before vector similarity (verify with query plan or mock), (g) fallback results are clearly distinguishable from direct matches (e.g., metadata flag)	2.1.7, 2.1.8
2.1.18	Add tests: Multi-collection weighted merge per [[memory-scoping#4. Scope Hierarchy]]. Cases: (a) search across 3 collections with weights [1.0, 0.7, 0.4] — result from weight-1.0 collection ranks above equally-similar result from weight-0.4 collection, (b) very high similarity in low-weight collection can still outrank moderate similarity in high-weight collection (weight adjusts score, doesn't override), (c) empty collection in the merge set does not cause errors, (d) results are deduplicated across collections (same content in two scopes appears once with highest weighted score), (e) merge respects requested result limit (top-K after merge, not top-K per collection), (f) parallel search across collections completes within reasonable time (not sequential N * latency)	2.1.9
2.1.19	Add tests: Cross-collection tag search per [[memory-plugin#7. Milvus Backend Topology]] Tag-Based Cross-Scope Discovery. Cases: (a) memory tagged `#api-pattern` in project collection is found by tag search from system scope, (b) tag search returns results from multiple collections with correct source attribution, (c) tag search with non-existent tag returns empty (not error), (d) memory with multiple tags is found by search on any single tag, (e) tag search combined with topic filter narrows results correctly, (f) tag names are case-insensitive (`#API` matches `#api`), (g) special characters in tags (hyphens, underscores) work correctly	2.1.11
2.1.20	Add tests: Retrieval tracking per [[self-improvement#6. Memory Health & Observability]]. Cases: (a) searching and returning a memory increments its `retrieval_count` by 1, (b) `last_retrieved` timestamp updates to current time on retrieval, (c) memory not returned by search has retrieval_count unchanged, (d) multiple searches returning same memory increment count cumulatively, (e) retrieval tracking works for both semantic search and KV query, (f) retrieval tracking does not slow down search response time noticeably (< 10% overhead), (g) initial retrieval_count is 0 and last_retrieved is null for newly inserted memories	2.1.13

Test checkpoint: Run the full memsearch fork test suite against Milvus Lite. Verify all new features work in isolation before integrating with the plugin. Specific validations: - Original memsearch test suite passes (no regressions in base functionality) - KV round-trip: insert 50 key-value pairs, query each, verify 100% accuracy - Temporal: create a fact, update it 5 times, query as-of each historical timestamp - Topic filter: insert 20 memories across 4 topics, verify filtered search returns only matching topic - Topic fallback: search a topic with 1 result, verify auto-widen returns additional cross-topic results - Multi-collection merge: create 3 collections with overlapping content, verify weighted ranking is correct - Tag search: tag memories in different collections, verify cross-collection discovery works - Retrieval tracking: search 10 times, verify counts and timestamps are accurate - Collection naming follows aq_system, aq_agenttype_*, aq_project_* convention - All tests run against Milvus Lite (in-process, no external dependency)

2.2 Memory Plugin v2 Skeleton¶

Spec: [[memory-plugin#3. New Architecture]], [[memory-plugin#4. Why a Plugin (Not Core)]], [[memory-scoping#7. Agent Memory Tools (MCP)]]

#	Task	Depends On
2.2.1	Create `src/plugins/internal/memory_v2.py` plugin skeleton implementing InternalPlugin per [[specs/plugin-system]]	—
2.2.2	Register plugin with PluginRegistry, coexisting with v1 during transition	2.2.1
2.2.3	Implement `MemoryService` v2 protocol wrapping the memsearch fork per [[memory-plugin#3. New Architecture]]	2.1.10, 2.2.1
2.2.4	Implement MCP tool: `memory_search` (semantic search with optional topic filter) per [[memory-scoping#7. Agent Memory Tools (MCP)]]	2.2.3
2.2.5	Implement MCP tool: `memory_save` (with dedup, summary/original, topic auto-detect) per [[memory-scoping#8. memory_save Flow]]	2.2.3
2.2.6	Implement topic auto-detection in `memory_save` (infer from content + task context) per [[memory-scoping#3. Topic Filtering]]	2.2.5
2.2.7	Implement MCP tool: `memory_list` (browse memories in a scope)	2.2.3
2.2.8	Implement MCP tool: `memory_recall` (KV lookup with scope resolution) per [[memory-scoping#6. Multi-Scope Query]]	2.2.3
2.2.9	Implement MCP tool: `memory_store` (KV write to scope + vault fact file update)	2.2.3
2.2.10	Implement MCP tool: `memory_list_facts` (list KV entries by scope/namespace)	2.2.3
2.2.11	Implement MCP tool: `memory_get` (unified auto-routing: KV first, then semantic) per [[memory-scoping#7. Agent Memory Tools (MCP)]]	2.2.8, 2.2.4
2.2.12	Implement `full=true` parameter on `memory_get` to return original instead of summary per [[memory-scoping#9. Summary + Original Pattern]]	2.2.11
2.2.13	Implement facts.md parser (key:value pairs under markdown headings → KV entries) per [[memory-plugin#7. Milvus Backend Topology]] Fact Files	2.2.3
2.2.14	Implement facts.md writer (KV changes → update vault fact file bidirectionally)	2.2.13
2.2.15	Wire facts.md file watcher handler from Phase 1.3 to facts.md parser	1.3.7, 2.2.13
2.2.16	Add tests: MCP memory tool round-trips per [[memory-scoping#7. Agent Memory Tools (MCP)]]. Cases: (a) `memory_save` then `memory_search` returns the saved content with high similarity, (b) `memory_store` a KV pair then `memory_recall` retrieves exact value, (c) `memory_list` returns all memories in scope with correct metadata, (d) `memory_list_facts` returns all KV entries in scope/namespace, (e) `memory_save` with duplicate content does not create a second entry (dedup), (f) `memory_search` with no results returns empty list (not error), (g) `memory_store` then overwrite same key then `memory_recall` returns latest value, (h) all tools return well-formed response dicts with `success` field	2.2.4–2.2.11
2.2.17	Add tests: facts.md bidirectional sync per [[memory-plugin#7. Milvus Backend Topology]] Fact Files. Cases: (a) parse a facts.md with `key: value` pairs under headings — each pair appears as KV entry in collection, (b) facts.md with multiple headings creates KV entries with heading as namespace, (c) `memory_recall` for a key parsed from facts.md returns correct value, (d) editing facts.md (change a value) triggers re-parse and updates KV entry in collection, (e) `memory_store` a new KV pair triggers facts.md writer to append the entry to the file, (f) facts.md with malformed lines (no colon, empty value) logs warning but does not crash — valid lines still parsed, (g) facts.md with markdown formatting in values (bold, links) preserves formatting in stored value	2.2.13, 2.2.8
2.2.18	Add tests: `memory_get` auto-routing per [[memory-scoping#7. Agent Memory Tools (MCP)]]. Cases: (a) `memory_get("preferred_language")` where a KV entry with that exact key exists returns the KV value (no vector search), (b) `memory_get("how do we handle errors in the API")` with no KV match falls through to semantic search, (c) `memory_get` with `full=true` returns original content instead of summary per [[memory-scoping#9. Summary + Original Pattern]], (d) `memory_get` for a key that exists in KV but also has semantic matches returns the KV result (KV takes priority), (e) `memory_get` with empty query returns error/empty (not crash), (f) routing decision is transparent in response metadata (indicates whether KV or semantic was used)	2.2.11
2.2.19	Add tests: Topic auto-detection in `memory_save` per [[memory-scoping#3. Topic Filtering]]. Cases: (a) saving content about "pytest fixtures and mocking" auto-detects topic as "testing" or similar, (b) saving content about "database schema migration" auto-detects a topic related to "database" or "infrastructure", (c) saving with explicit topic parameter overrides auto-detection, (d) saving very short content (< 10 tokens) still assigns a topic (falls back to task context), (e) topic detection is consistent — saving similar content twice assigns the same topic, (f) auto-detected topic is from a reasonable controlled vocabulary (not arbitrary free text)	2.2.6

Test checkpoint: End-to-end: an agent saves an insight via MCP, a second agent searches and finds it. An agent stores a fact, another agent recalls it. Both the vault markdown files and the Milvus collections are consistent. Specific validations: - Agent A calls memory_save("API rate limiting needs exponential backoff") — verify entry in Milvus collection - Agent B calls memory_search("rate limit handling") — verify Agent A's insight is in results - Agent A calls memory_store("api_base_url", "https://api.example.com") — verify KV in collection AND facts.md updated - Agent B calls memory_recall("api_base_url") — verify returns "https://api.example.com" - Edit facts.md directly (Obsidian simulation): change api_base_url value — verify memory_recall returns new value - Verify memory_list shows both semantic memories and KV entries with correct entry_type - Verify memory_get auto-routes correctly for both KV keys and semantic queries - Plugin coexists with v1 during transition: both registered, v2 handles new calls, v1 still available

Phase 3: Memory Scoping & Tiers¶

Build the scope hierarchy, tiered loading, and override model.

Source: [[memory-scoping]]

3.1 Scope Resolution¶

Spec: [[memory-scoping#4. Scope Hierarchy]], [[memory-scoping#6. Multi-Scope Query]]

#	Task	Depends On
3.1.1	Implement scope resolver: given (agent_type, project_id), return ordered collection list with weights per [[memory-scoping#4. Scope Hierarchy]]	2.1.10
3.1.2	Create per-agent-type collections on first profile creation	3.1.1
3.1.3	Create system-level collection on startup	3.1.1
3.1.4	Create orchestrator collection on startup	3.1.1
3.1.5	Migrate existing per-project collections to new naming convention	3.1.1
3.1.6	Implement first-match-wins KV scope resolution (project → agent-type → system) per [[memory-scoping#6. Multi-Scope Query]]	3.1.1
3.1.7	Implement weighted merge for semantic search across scopes (project weight=1.0, agent-type=0.7, system=0.4) per [[memory-scoping#6. Multi-Scope Query]]	3.1.1
3.1.8	Add tests: Scope resolution per [[memory-scoping#4. Scope Hierarchy]]. Cases: (a) resolver for (agent_type="coding", project_id="myapp") returns collections in order: [aq_project_myapp, aq_agenttype_coding, aq_system] with weights [1.0, 0.7, 0.4], (b) resolver for (agent_type="coding", project_id=None) returns [aq_agenttype_coding, aq_system] (no project scope), (c) resolver for (agent_type=None, project_id="myapp") returns [aq_project_myapp, aq_system] (no agent-type scope), (d) resolver for unknown agent_type still returns system collection, (e) collections are created on-demand if they don't exist yet, (f) weight values match the spec exactly and are configurable	3.1.1
3.1.9	Add tests: KV scope resolution with first-match-wins per [[memory-scoping#6. Multi-Scope Query]]. Cases: (a) KV key exists in project scope — returns project value, does NOT query agent-type or system, (b) KV key missing from project scope, exists in agent-type scope — returns agent-type value, (c) KV key missing from project and agent-type, exists in system — returns system value, (d) KV key missing from all scopes — returns None/empty, (e) same key exists in both project and system scope — project value wins (first-match), (f) writing a KV entry writes to the most specific scope (project if project_id is set), (g) deleting a project-scope KV entry causes fallthrough to agent-type/system value	3.1.6
3.1.10	Add tests: Semantic search weighted merge across scopes per [[memory-scoping#6. Multi-Scope Query]]. Cases: (a) insert similar content in project (weight 1.0) and system (weight 0.4) — project result ranks first, (b) insert highly relevant content in system scope and weakly relevant in project — system result can still rank high if raw similarity is much higher, (c) search across 3 scopes with 5 results each — merged output is top-K by weighted score, (d) scope with no matching results contributes nothing to merge (no padding with low-score results), (e) results include source scope metadata so caller knows which scope each result came from, (f) total search latency is bounded (parallel scope queries, not sequential)	3.1.7

3.2 Override Model¶

Spec: [[memory-scoping#5. Override Model]]

#	Task	Depends On
3.2.1	Implement override file indexing (`vault/projects/{id}/overrides/{type}.md` into project collection with highest weight) per [[memory-scoping#5. Override Model]]	3.1.1
3.2.2	Wire override file watcher handler from Phase 1.3	1.3.6, 3.2.1
3.2.3	Implement override injection into agent context alongside base [[profiles	profile]]
3.2.4	Add tests: Override indexing and retrieval per [[memory-scoping#5. Override Model]]. Cases: (a) create `vault/projects/myapp/overrides/coding.md` — content appears in project-scope search results, (b) override content has highest weight (above normal project memories) so it ranks first for matching queries, (c) override content is injected into agent context alongside base profile, (d) updating override file triggers re-index and new content appears in subsequent searches, (e) deleting override file removes it from search results, (f) override with empty content does not inject empty string into context	3.2.1
3.2.5	Add tests: Override scope isolation. Cases: (a) override file `overrides/coding.md` does NOT appear in searches for agent-type "qa", (b) override file `overrides/coding.md` DOES appear for agent-type "coding" in that project, (c) system-level override in `vault/system/overrides/` applies to all agent types, (d) project override takes precedence over system override for the same agent type, (e) agent with no matching override file still works normally (no override is fine), (f) override for project A does not leak into project B searches even for the same agent type	3.2.1

3.3 Memory Tiers¶

Spec: [[memory-scoping#2. Memory Tiers (L0–L3)]]

#	Task	Depends On
3.3.1	Implement L0 injection: extract `## Role` from [[profiles	profile.md]] into agent system prompt (~50 tokens)
3.3.2	Implement L1 injection: eager-load project + agent-type facts.md KV entries at task start (~200 tokens)	2.2.8, 3.1.6
3.3.3	Implement L2 topic detection from task description/context per [[memory-scoping#3. Topic Filtering]]	2.1.7
3.3.4	Implement L2 topic-filtered memory loading when topic is detected (~500 tokens)	3.3.3
3.3.5	Wire L0 + L1 into task execution path (adapter/prompt context building, [[specs/prompt-builder]])	3.3.1, 3.3.2
3.3.6	Wire L2 into task execution path (on-demand when topic emerges)	3.3.4
3.3.7	Add tests: L0+L1 tier injection per [[memory-scoping#2. Memory Tiers (L0–L3)]]. Cases: (a) every task context includes the `## Role` section from the agent's profile.md (L0, ~50 tokens), (b) every task context includes project + agent-type facts.md KV entries (L1, ~200 tokens), (c) combined L0+L1 is approximately 250 tokens baseline (verify within tolerance), (d) L0 is absent if agent has no profile.md (graceful degradation), (e) L1 is absent if no facts.md exists for the scope (no error), (f) L0+L1 content appears in the system prompt section (not user message), (g) agent with profile but no project still gets L0 + agent-type L1 facts	3.3.5
3.3.8	Add tests: L2 topic-filtered memory loading per [[memory-scoping#2. Memory Tiers (L0–L3)]]. Cases: (a) task about "testing the payment API" detects topic "testing"/"payments" and loads relevant topic memories (~500 tokens), (b) task about "update README" does NOT load "testing" topic memories (topic mismatch), (c) L2 memories are loaded on-demand when topic emerges mid-task (not at initial context build), (d) L2 memories do not exceed ~500 token budget (truncated or top-K limited), (e) task with no detectable topic does not load any L2 memories (L0+L1 only), (f) L2 topic detection works from both task description and ongoing conversation context	3.3.6
3.3.9	Add tests: L3 on-demand search via `memory_search` tool per [[memory-scoping#2. Memory Tiers (L0–L3)]]. Cases: (a) agent explicitly calls `memory_search("database optimization")` and gets results from all topics (not limited to current topic), (b) L3 search returns results from all scopes (project + agent-type + system) with correct weighted merge, (c) L3 search does not duplicate results already loaded in L1 or L2, (d) L3 search works even when L2 is not active (no topic detected), (e) L3 results include source scope and topic metadata, (f) L3 search respects the same retrieval tracking (retrieval_count increments)	3.3.6

3.4 Deduplication & Summary¶

Spec: [[memory-scoping#8. memory_save Flow]], [[memory-scoping#9. Summary + Original Pattern]]

#	Task	Depends On
3.4.1	Implement similarity-based dedup in `memory_save` (>0.95 timestamp update, 0.8-0.95 LLM merge, <0.8 create new) per [[memory-scoping#8. memory_save Flow]]	2.2.5
3.4.2	Implement LLM merge for 0.8-0.95 similarity: combine content, prefer newer on contradiction, preserve tags per [[memory-scoping#8. memory_save Flow]]	3.4.1
3.4.3	Implement summary generation for long memories (>200 tokens → summarize for embedding, keep original) per [[memory-scoping#9. Summary + Original Pattern]]	2.1.12, 2.2.5
3.4.4	Add tests: Duplicate detection (>0.95 similarity) per [[memory-scoping#8. memory_save Flow]]. Cases: (a) saving identical content twice results in only one entry (second save updates timestamp only), (b) saving near-identical content (e.g., minor typo fix) with >0.95 similarity also deduplicates, (c) collection entry count does not increase on duplicate save, (d) the updated timestamp reflects the second save time, (e) duplicate detection works across the same scope only (same content in different scopes creates separate entries), (f) dedup check does not trigger on very short content where similarity is unreliable (< 5 tokens)	3.4.1
3.4.5	Add tests: LLM merge for similar content (0.8-0.95 similarity) per [[memory-scoping#8. memory_save Flow]]. Cases: (a) saving content with 0.85 similarity to existing triggers LLM merge call, (b) merged result contains information from both the old and new content, (c) on contradiction between old and new, merged result prefers newer information, (d) tags from both old and new content are preserved in merged entry, (e) merged entry replaces the old one (not two entries), (f) if LLM merge fails (provider error), original is kept and new content is saved separately with a warning, (g) merge produces content that is coherent and not just concatenated	3.4.2
3.4.6	Add tests: Distinct content save (<0.8 similarity) per [[memory-scoping#8. memory_save Flow]]. Cases: (a) saving content with <0.8 similarity to any existing entry creates a new entry, (b) collection entry count increases by 1 after distinct save, (c) both old and new entries are independently searchable, (d) saving to an empty collection always creates new (no dedup check needed), (e) saving 10 distinct pieces of content creates 10 entries with correct topics and tags, (f) distinct save assigns its own topic and tags independent of existing entries	3.4.1
3.4.7	Add tests: Summary + original pattern per [[memory-scoping#9. Summary + Original Pattern]]. Cases: (a) saving content >200 tokens generates a summary for the embedding and stores the original separately, (b) `memory_search` returns the summary (shorter, optimized for search), (c) `memory_get` with `full=true` returns the original full content, (d) `memory_get` without `full=true` returns the summary, (e) saving content <=200 tokens stores it as-is (no summary generated), (f) summary is meaningfully shorter than original (not just truncated), (g) original content is byte-for-byte identical to what was saved (no transformation loss)	3.4.3, 2.2.12

Test checkpoint: Full integration test: create an agent with a [[profiles|profile]], set up project facts, create [[memory-scoping#5. Override Model|overrides]]. Start a task — verify L0 (role) and L1 (facts) are in the context. Save several insights with topics. Start another task on the same topic — verify L2 topic memories appear. Search across topics — verify L3 returns cross-topic results with correct scope weighting. Specific validations: - Create coding agent with profile.md containing ## Role, set up project facts.md with 5 KV pairs - Create project override overrides/coding.md with project-specific instructions - Start a task: verify context contains role (L0, ~50 tokens), facts (L1, ~200 tokens), override (highest weight) - Agent saves 3 insights about "testing" topic and 3 about "deployment" topic via memory_save - Start new task about testing: verify L2 loads testing-related memories, NOT deployment ones - Call memory_search("deployment strategies"): verify L3 returns deployment memories from any topic - Save near-duplicate insight: verify dedup merges (0.8-0.95) or updates timestamp (>0.95) - Save long insight (>200 tokens): verify summary generated, full=true returns original - KV scope test: store key in project, same key in system — memory_recall returns project value - Override isolation: coding agent sees override, qa agent does not - Scope weights: project result (1.0) outranks equally-relevant system result (0.4)

Phase 4: Profiles as Markdown¶

Move profiles from DB-only to markdown source of truth in the [[vault]].

Source: [[profiles]]

4.1 Profile Parser & Sync¶

Spec: [[profiles#2. Hybrid Format]], [[profiles#3. Sync Model]]

#	Task	Depends On
4.1.1	Implement markdown profile parser: extract JSON blocks from `## Config`, `## Tools`, `## MCP Servers` per [[profiles#2. Hybrid Format]]	—
4.1.2	Implement English section extractor for `## Role`, `## Rules`, `## Reflection`	4.1.1
4.1.3	Implement JSON validation for Config block (model, permission_mode, max_tokens_per_task)	4.1.1
4.1.4	Implement JSON validation for Tools block (tool names checked against [[specs/tiered-tools	tool registry]], warn on unknown)
4.1.5	Implement JSON validation for MCP Servers block (command, args, env structure)	4.1.1
4.1.6	Implement profile → DB sync (parsed fields → `agent_profiles` table upsert) per [[profiles#3. Sync Model]]	4.1.1, 4.1.2
4.1.7	Wire profile.md file watcher handler from Phase 1.3 to parser + sync	1.3.3, 4.1.6
4.1.8	Implement error handling: bad JSON → sync fails, previous config retained, notification sent per [[profiles#3. Sync Model]]	4.1.6
4.1.9	Update chat/dashboard profile commands to write to markdown file instead of DB ([[specs/command-handler]])	4.1.6
4.1.10	Add tests: Profile parser and DB sync per [[profiles#2. Hybrid Format]], [[profiles#3. Sync Model]]. Cases: (a) valid profile.md with all sections (Config, Tools, MCP Servers, Role, Rules, Reflection) parses every field correctly, (b) parsed Config JSON values (model, permission_mode, max_tokens_per_task) sync to `agent_profiles` DB table, (c) parsed Tools list syncs and each tool name is validated against tool registry, (d) English sections (Role, Rules, Reflection) are stored as raw markdown strings, (e) profile.md with only some sections (e.g., Role + Config, no Tools) parses the present sections and leaves others as defaults, (f) round-trip: write profile → parse → sync to DB → read DB → verify all fields match, (g) sync is an upsert — existing profile is updated, not duplicated	4.1.6
4.1.11	Add tests: Profile error handling per [[profiles#3. Sync Model]]. Cases: (a) malformed JSON in `## Config` block triggers sync failure and sends notification, (b) DB row retains previous valid values after failed sync (no partial update), (c) invalid tool name in `## Tools` produces warning but does not block sync of other sections, (d) malformed JSON in `## MCP Servers` triggers failure notification, (e) completely empty profile.md does not crash parser (returns empty/default), (f) profile.md with valid frontmatter but garbled body sections fails gracefully, (g) error notification includes the file path and specific parse error for debugging	4.1.8
4.1.12	Add tests: File watcher profile sync integration. Cases: (a) edit profile.md Role section — DB updates with new Role text within one watcher cycle, (b) edit profile.md Config JSON (change model) — DB reflects new model value, (c) rapid edits to profile.md (3 edits in 500ms) trigger only one sync due to debounce, (d) creating a new profile.md in vault triggers initial sync and DB row creation, (e) deleting profile.md does NOT delete DB row (preserves last known config with warning), (f) concurrent edits to two different agents' profile.md files sync independently and correctly	4.1.7

4.2 Profile Migration¶

Spec: [[vault#6. Migration Path]] — Phase 4

#	Task	Depends On
4.2.1	Write migration script: read existing DB profiles → generate markdown files in vault per [[profiles#2. Hybrid Format]]	4.1.1
4.2.2	Create default profile templates in `vault/templates/`	—
4.2.3	Create orchestrator profile.md (the orchestrator is its own agent type per [[self-improvement#5. Orchestrator Memory]])	4.2.2
4.2.4	Add startup check: if DB profiles exist but no vault markdown, run migration	4.2.1

4.3 Starter Knowledge Packs¶

Spec: [[profiles#4. Starter Knowledge Packs]]

#	Task	Depends On
4.3.1	Create starter knowledge files for `coding` agent type (common pitfalls, git conventions)	—
4.3.2	Create starter knowledge files for `code-review` agent type (review checklist)	—
4.3.3	Create starter knowledge files for `qa` agent type (testing patterns)	—
4.3.4	Implement knowledge pack copy on first profile.md creation (detect new profile, copy matching templates, tag `#starter`) per [[profiles#4. Starter Knowledge Packs]]	4.1.7, 4.3.1
4.3.5	Add tests: Starter knowledge pack provisioning per [[profiles#4. Starter Knowledge Packs]]. Cases: (a) creating first profile.md for agent-type "coding" copies coding knowledge pack files to agent-type memory directory, (b) copied files are tagged `#starter` in frontmatter, (c) creating second profile.md for same agent-type does NOT copy pack again (already provisioned), (d) agent-type with no matching knowledge pack creates profile without error (pack is optional), (e) knowledge pack files are indexed into agent-type memory collection after copy, (f) `#starter` tag allows users to identify and optionally remove starter content, (g) starter files are copies (not symlinks) — editing them does not affect the template	4.3.4

Test checkpoint: Create a new agent profile via chat command. Verify: markdown file appears in vault, DB row syncs, starter knowledge pack copied. Edit the profile.md in Obsidian — verify DB updates. Intentionally break JSON in profile — verify graceful failure with notification. Specific validations: - Chat command create_profile coding-agent creates vault/agent-types/coding/profile.md with template structure - DB agent_profiles row matches all parsed fields from the new profile.md - Starter knowledge pack for "coding" type is copied to vault/agent-types/coding/memory/ with #starter tags - Edit profile.md ## Role in Obsidian → DB role field updates within watcher cycle - Edit profile.md ## Config JSON to change model → DB model field updates - Break ## Config JSON (missing closing brace) → notification sent, DB retains previous model value - Fix the JSON → next watcher cycle syncs successfully, DB now has corrected value - Migration test: existing DB-only profile → markdown generated → DB still matches - Verify profile commands (list, show, update) all work with the new markdown-backed flow

Phase 5: Playbook System¶

The core new automation system replacing rules + hooks.

Source: [[playbooks]]

5.1 Playbook Compilation¶

Spec: [[playbooks#4. Authoring Model]] — LLM Compilation, [[playbooks#5. Compiled Format (JSON Schema)]]

#	Task	Depends On
5.1.1	Define playbook JSON schema as a Python dataclass or JSON Schema file per [[playbooks#5. Compiled Format (JSON Schema)]] (node fields, transition fields, top-level fields)	—
5.1.2	Implement `PlaybookCompiler` class: reads markdown + frontmatter, invokes LLM with schema, validates output	5.1.1
5.1.3	Implement graph validation: entry node exists, all transitions reference valid nodes, no unreachable nodes, cycles have exit conditions per [[playbooks#19. Open Questions]] #6	5.1.1
5.1.4	Implement compiled JSON storage in `~/.agent-queue/compiled/` with scope-mirrored directory structure per [[playbooks#8. Scoping]] Storage	5.1.2
5.1.5	Implement source_hash change detection (skip recompilation when unchanged) per [[playbooks#4. Authoring Model]]	5.1.4
5.1.6	Wire playbook file watcher handler from Phase 1.3 to compiler	1.3.2, 5.1.2
5.1.7	Implement compilation error handling (keep previous version active, surface error notification) per [[playbooks#4. Authoring Model]]	5.1.2
5.1.8	Add tests: Playbook compilation happy path per [[playbooks#4. Authoring Model]], [[playbooks#5. Compiled Format (JSON Schema)]]. Cases: (a) sample 3-node playbook markdown compiles to JSON that validates against the schema, (b) compiled JSON contains correct entry_node, all node definitions, and all transitions, (c) node fields (prompt, tools, llm_config, summarize_before) are correctly extracted, (d) transition fields (condition, target, structured expression) are correctly extracted, (e) frontmatter fields (trigger, scope, cooldown) are preserved in compiled output, (f) compilation is idempotent — compiling same markdown twice produces identical JSON, (g) compiled JSON is stored at correct path in `~/.agent-queue/compiled/` mirroring source scope	5.1.2
5.1.9	Add tests: Playbook compilation error handling per [[playbooks#4. Authoring Model]]. Cases: (a) markdown with no recognizable node structure produces compilation error notification, (b) previous valid compiled JSON is retained on disk after failed recompilation, (c) PlaybookManager continues to use the previous version for event matching, (d) error notification includes the file path and LLM/validation error details, (e) markdown with valid structure but LLM provider failure retains previous version and notifies, (f) partially valid markdown (some nodes valid, some broken) fails entire compilation (atomic — no partial updates), (g) fixing the markdown and saving again triggers successful recompilation	5.1.7
5.1.10	Add tests: Playbook source_hash change detection per [[playbooks#4. Authoring Model]]. Cases: (a) saving playbook markdown without content changes does NOT trigger recompilation (hash unchanged), (b) changing a comment or whitespace-only change does NOT trigger recompilation (hash based on normalized content), (c) changing a node prompt DOES trigger recompilation (hash changes), (d) after recompilation, stored source_hash matches new content, (e) compiled JSON timestamp updates only on actual recompilation, (f) force-compile command bypasses hash check and recompiles regardless	5.1.5
5.1.11	Add tests: Graph validation per [[playbooks#19. Open Questions]] #6. Cases: (a) graph with unreachable node (no incoming transitions, not entry) produces validation error naming the unreachable node, (b) graph with no entry node defined produces validation error, (c) transition referencing non-existent target node produces validation error with the invalid target name, (d) graph with cycle but no exit condition produces validation warning (cycles are allowed but must have exit), (e) graph with cycle AND exit condition passes validation, (f) valid graph (all nodes reachable, entry exists, all targets valid) passes validation silently, (g) graph with single node (entry = terminal) is valid, (h) graph with duplicate node names produces validation error	5.1.3

5.2 Playbook Executor¶

Spec: [[playbooks#6. Execution Model]]

#	Task	Depends On
5.2.1	Create `PlaybookRun` DB table (Alembic migration) per [[playbooks#6. Execution Model]] Run Persistence	—
5.2.2	Implement `PlaybookRunner` class: graph walker with conversation history per [[playbooks#6. Execution Model]] Context via Conversation History	5.2.1
5.2.3	Implement node execution: build prompt + context, invoke `Supervisor.chat()` with accumulated history per [[playbooks#6. Execution Model]]	5.2.2, 0.4.1
5.2.4	Implement transition evaluation: separate LLM call with condition list per [[playbooks#6. Execution Model]] Transition Evaluation	5.2.3
5.2.5	Implement structured transitions: function-call expressions evaluated without LLM per [[playbooks#6. Execution Model]] Transition Evaluation	5.2.4
5.2.6	Implement `summarize_before` node support (compress conversation history) per [[playbooks#6. Execution Model]] Context Size Management	5.2.3
5.2.7	Implement token budget tracking per run (fail gracefully on exceed) per [[playbooks#6. Execution Model]] Token Budget	5.2.3
5.2.8	Implement global daily playbook token cap (`max_daily_playbook_tokens` in config) per [[playbooks#6. Execution Model]] Token Budget	5.2.7
5.2.9	Implement `PlaybookRun` persistence: conversation history, node trace, status per [[playbooks#6. Execution Model]] Run Persistence	5.2.2
5.2.10	Implement run status transitions: running → completed/failed/paused/timed_out per [[playbooks#6. Execution Model]]	5.2.9
5.2.11	Implement per-playbook and per-node `llm_config` override support per [[playbooks#6. Execution Model]] Customizable Agent Configuration	5.2.3, 0.4.2
5.2.12	Implement playbook version pinning: in-flight runs continue with old version when recompiled per [[playbooks#19. Open Questions]] #3	5.2.9
5.2.13	Add tests: Playbook execution happy path per [[playbooks#6. Execution Model]]. Cases: (a) 3-node linear playbook (start → middle → end) executes all nodes in order and completes with status "completed", (b) each node receives accumulated conversation history from prior nodes, (c) each node's prompt is built with correct context (task data, memory tier content), (d) `Supervisor.chat()` is invoked once per node with correct parameters, (e) run duration and per-node token usage are recorded, (f) final PlaybookRun status is "completed" with correct node trace [start, middle, end], (g) playbook with single node (entry = terminal) executes and completes	5.2.3
5.2.14	Add tests: Branching transition evaluation per [[playbooks#6. Execution Model]] Transition Evaluation. Cases: (a) node with two conditional transitions — LLM evaluator picks the correct branch based on prior node output, (b) node with three branches — middle branch is selected when conditions match, (c) transition with "else/default" condition is selected when no other conditions match, (d) transition evaluation uses the cheaper model specified in playbook config (not the node's model), (e) transition evaluation prompt includes the condition list and conversation context, (f) ambiguous conditions (multiple could match) — first matching transition wins (ordered evaluation), (g) no matching transition and no default produces run failure with descriptive error	5.2.4
5.2.15	Add tests: Structured transition evaluation per [[playbooks#6. Execution Model]] Transition Evaluation. Cases: (a) structured expression `task.status == "completed"` evaluates to true/false without any LLM call (verify mock not invoked), (b) structured expression referencing node output field (e.g., `output.approval == "yes"`) evaluates correctly, (c) invalid expression syntax produces clear error (not silent failure), (d) expression referencing undefined variable fails gracefully with descriptive error, (e) structured transitions are significantly faster than LLM-evaluated transitions (no network call), (f) mix of structured and LLM transitions on same node — structured is evaluated first, LLM only if structured doesn't match	5.2.5
5.2.16	Add tests: Token budget enforcement per [[playbooks#6. Execution Model]] Token Budget. Cases: (a) run that exceeds per-playbook token budget stops execution at current node with status "failed" and reason "token_budget_exceeded", (b) conversation history and node trace are preserved in the failed run record, (c) run approaching budget (within 10%) logs a warning but continues, (d) global daily token cap (`max_daily_playbook_tokens`) blocks new runs when exceeded, (e) daily cap resets at midnight (or configured time), (f) token counting includes both input and output tokens for each node, (g) run that would exceed budget on the FIRST node fails immediately (does not start)	5.2.7
5.2.17	Add tests: PlaybookRun persistence per [[playbooks#6. Execution Model]] Run Persistence. Cases: (a) completed run has DB record with status "completed", full node trace, and total token usage, (b) node trace contains ordered list of node IDs visited (e.g., ["start", "analyze", "report"]), (c) conversation history in DB matches the actual messages exchanged at each node, (d) failed run has status "failed" with error details and partial node trace up to failure point, (e) timed-out run has status "timed_out" with the node where timeout occurred, (f) run record includes playbook source version hash (for version tracking), (g) querying runs by playbook_id returns all runs sorted by start time, (h) run record includes start_time, end_time, and per-node durations	5.2.9

5.3 Event Integration¶

Spec: [[playbooks#7. Event System]], [[playbooks#8. Scoping]] — Scope Resolution

#	Task	Depends On
5.3.1	Implement `PlaybookManager`: loads all compiled playbooks, maintains trigger → playbook mapping	5.1.4
5.3.2	Subscribe PlaybookManager to EventBus with payload filtering per [[playbooks#10. Composability]] Event Payload Filtering	5.3.1, 0.1.2
5.3.3	Implement event-to-scope matching per [[playbooks#7. Event System]] Event-to-Scope Matching: events with project_id match project + system playbooks, events without match system only	5.3.2
5.3.4	Implement cooldown tracking per playbook per [[playbooks#6. Execution Model]] Concurrency	5.3.1
5.3.5	Implement concurrency limits (`max_concurrent_playbook_runs`) per [[playbooks#6. Execution Model]] Concurrency	5.3.1
5.3.6	Emit `playbook.run.completed` and `playbook.run.failed` events per [[playbooks#7. Event System]]	5.2.10
5.3.7	Implement timer service per [[playbooks#7. Event System]] Timer Service: scan compiled playbooks for timer triggers, emit synthetic timer events, minimum 1m interval	5.3.1
5.3.8	Add tests: Event-to-playbook scope matching per [[playbooks#7. Event System]] Event-to-Scope Matching. Cases: (a) `task.completed` event with `project_id="myapp"` triggers project-scoped playbook for myapp AND system-scoped playbooks, (b) `task.completed` event with `project_id="myapp"` does NOT trigger project-scoped playbook for a different project, (c) event without `project_id` triggers only system-scoped playbooks, (d) agent-type-scoped playbook triggers only when event's agent matches that type, (e) multiple playbooks subscribed to same event type all trigger (not just first), (f) playbook with trigger event type that never fires does not interfere with other playbooks, (g) unrecognized event type does not cause errors in PlaybookManager	5.3.3
5.3.9	Add tests: Playbook cooldown per [[playbooks#6. Execution Model]] Concurrency. Cases: (a) playbook with 60s cooldown that just completed ignores trigger event within 60s, (b) same playbook triggers normally after cooldown expires, (c) cooldown is per-playbook — different playbooks with same trigger event are independent, (d) cooldown is tracked per scope (project-level cooldown does not block system-level for same playbook template), (e) playbook that fails still applies cooldown (prevents error loops), (f) cooldown of 0 means no cooldown (every event triggers), (g) concurrent events during cooldown are dropped (not queued)	5.3.4
5.3.10	Add tests: Timer service per [[playbooks#7. Event System]] Timer Service. Cases: (a) playbook with `trigger: timer:30m` receives synthetic timer event every 30 minutes, (b) timer interval is respected within reasonable tolerance (+/- 5 seconds), (c) minimum 1-minute interval is enforced — `timer:30s` is rejected or upgraded to 1m, (d) multiple timer-triggered playbooks with different intervals each fire at their own cadence, (e) timer continues firing after playbook run completes (recurring), (f) timer stops firing when playbook is removed/disabled, (g) system restart resumes timers from configuration (not from last fire time — fires immediately if overdue)	5.3.7
5.3.11	Add tests: Playbook composition via event chaining per [[playbooks#10. Composability]] Event Payload Filtering. Cases: (a) playbook A completes and emits `playbook.run.completed` with `playbook_id="code-review"` — playbook B subscribed with filter `{"playbook_id": "code-review"}` triggers, (b) playbook B does NOT trigger for `playbook.run.completed` from a different playbook_id, (c) 3-playbook chain: A → B → C each triggered by predecessor's completion event, (d) composition with payload data: playbook A's output is available in playbook B's trigger event payload, (e) circular composition (A triggers B triggers A) is prevented by cooldown or detected and blocked, (f) failed playbook emits `playbook.run.failed` — downstream playbooks subscribed to failure events trigger correctly, (g) composition across scopes: system playbook triggers project playbook via filtered event	5.3.6, 0.1.2

5.4 Human-in-the-Loop¶

Spec: [[playbooks#9. Human-in-the-Loop]]

#	Task	Depends On
5.4.1	Implement `wait_for_human` node: persist run state to DB, pause execution per [[playbooks#9. Human-in-the-Loop]] Pause and Resume	5.2.9
5.4.2	Implement notification for human review via [[messaging/discord	Discord]] / [[messaging/telegram
5.4.3	Implement `human.review.completed` event handling: resume run from saved conversation state per [[playbooks#9. Human-in-the-Loop]]	5.4.1
5.4.4	Implement timeout for paused runs (configurable, default 24h, transition to timeout node or fail) per [[playbooks#9. Human-in-the-Loop]] Timeout	5.4.1
5.4.5	Implement `resume_playbook` command per [[playbooks#15. Playbook Commands]]	5.4.3
5.4.6	Add tests: Human-in-the-loop pause and resume per [[playbooks#9. Human-in-the-Loop]]. Cases: (a) playbook reaching `wait_for_human` node persists run state to DB and pauses with status "paused", (b) notification is sent via Discord/Telegram with context summary of what the playbook has done so far, (c) `human.review.completed` event resumes the run from the exact saved conversation state, (d) resumed run continues to the next node with human's input appended to conversation history, (e) human can provide structured input (approve/reject/feedback) that influences the transition, (f) `resume_playbook` command with run_id resumes the correct paused run, (g) multiple paused runs can coexist — resuming one does not affect others, (h) run state survives system restart (persisted to DB, not just in-memory)	5.4.3
5.4.7	Add tests: Paused playbook timeout per [[playbooks#9. Human-in-the-Loop]] Timeout. Cases: (a) paused run exceeding default 24h timeout transitions to "timed_out" status, (b) custom timeout (e.g., 1h) is respected — run times out after 1h, not 24h, (c) timed-out run transitions to timeout node if one is defined in the playbook graph, (d) timed-out run with no timeout node transitions to "failed" status, (e) resuming a timed-out run is rejected with clear error message, (f) timeout notification is sent to the same channel as the original pause notification, (g) timeout countdown resets if human provides partial input and re-pauses	5.4.4

5.5 Playbook Commands¶

Spec: [[playbooks#15. Playbook Commands]]

#	Task	Depends On
5.5.1	Implement `compile_playbook` command (manual compilation trigger)	5.1.2
5.5.2	Implement `dry_run_playbook` command (simulate with mock event, no side effects) per [[playbooks#19. Open Questions]] #2	5.2.2
5.5.3	Implement `show_playbook_graph` command (ASCII or mermaid output of compiled graph)	5.1.4
5.5.4	Implement `list_playbooks` command (all playbooks across scopes with status and last run)	5.3.1
5.5.5	Implement `list_playbook_runs` command (recent runs with status/path taken)	5.2.9
5.5.6	Implement `inspect_playbook_run` command (full node trace, conversation, token usage)	5.2.9
5.5.7	Register all playbook commands in [[specs/command-handler	CommandHandler]] via [[specs/tiered-tools
5.5.8	Add tests: Playbook commands per [[playbooks#15. Playbook Commands]]. Cases: (a) `compile_playbook` with valid markdown returns success and compiled JSON path, (b) `compile_playbook` with invalid markdown returns error with details, (c) `dry_run_playbook` simulates execution with mock event and returns node trace without side effects (no DB changes, no events emitted), (d) `show_playbook_graph` returns ASCII or mermaid representation with correct nodes and transitions, (e) `list_playbooks` returns all playbooks across scopes with status (active/error) and last run time, (f) `list_playbook_runs` returns recent runs with status and node path taken, (g) `inspect_playbook_run` returns full node trace, conversation history, and token usage for a specific run, (h) all commands return `{"success": bool, ...}` dict format per command handler convention, (i) commands with invalid arguments return helpful error messages	5.5.7

5.6 Default Playbooks & Migration¶

Spec: [[playbooks#12. Default Playbooks]], [[playbooks#13. Migration Path]]

#	Task	Depends On
5.6.1	Write default `task-outcome.md` playbook (consolidates post-action-reflection + spec-drift-detector + error-recovery-monitor) per [[playbooks#12. Default Playbooks]]	5.1.2
5.6.2	Write default `system-health-check.md` playbook (30m, replaces periodic-project-review) per [[playbooks#12. Default Playbooks]]	5.1.2
5.6.3	Write default `codebase-inspector.md` playbook (4h, replaces proactive-codebase-inspector rule) per [[playbooks#12. Default Playbooks]]	5.1.2
5.6.4	Write default `dependency-audit.md` playbook (24h, replaces dependency-update-check rule) per [[playbooks#12. Default Playbooks]]	5.1.2
5.6.5	Install default playbooks to vault on first run	5.6.1–5.6.4
5.6.6	Validate default playbooks produce equivalent results to current rules (run both, compare) per [[playbooks#13. Migration Path]] Phase 1	5.6.5, 5.3.1
5.6.7	Migrate plugin `@cron()` hooks to timer-triggered playbooks per [[playbooks#16. Plugin Integration]]	5.6.6
5.6.8	Remove default rules when playbook equivalents are validated per [[playbooks#13. Migration Path]] Phase 2	5.6.6

5.7 Observability¶

Spec: [[playbooks#14. Dashboard Visualization]], [[playbooks#19. Open Questions]] #4

#	Task	Depends On
5.7.1	Implement playbook health metrics: tokens per node, run duration, transition paths, failure rates per [[playbooks#19. Open Questions]] #4	5.2.9
5.7.2	Design dashboard playbook graph view (nodes as boxes, transitions as arrows, live state highlighting) per [[playbooks#14. Dashboard Visualization]]	5.7.1

Test checkpoint: Full system test with playbooks running alongside hooks. Trigger task.completed — verify both the old hook AND the new playbook fire. Compare outputs. Verify token budgets tracked correctly. Test a human-in-the-loop playbook end-to-end via Discord/Telegram. Verify timer-based playbooks fire on schedule. Test composition: playbook A completes → playbook B triggers via filtered event. This is the biggest validation point in the entire roadmap. Specific validations: - Coexistence: task.completed fires both hook-engine rule and playbook — compare outputs for equivalence - Compilation: all 4 default playbooks (task-outcome, system-health-check, codebase-inspector, dependency-audit) compile without errors - Execution: task-outcome playbook runs 3-node graph to completion on task.completed - Branching: task-outcome playbook takes different branch on success vs. failure tasks - Token budget: set a 1000-token budget, run playbook — verify it stops gracefully on exceed - Daily cap: set low daily cap, run multiple playbooks — verify cap enforced and resets - Human-in-the-loop: run a playbook that pauses for review, send approval via Discord, verify resume - Timeout: run same playbook with 5-second timeout, do NOT approve — verify timeout transition - Timer: install system-health-check with 30m timer, verify it fires (use mock clock in test) - Composition: task-outcome emits playbook.run.completed → downstream playbook triggers - Cooldown: trigger same playbook twice in 10 seconds — second trigger is ignored - Scope isolation: project playbook does not fire for events from a different project - Version pinning: recompile playbook while a run is in-flight — in-flight uses old version - Persistence: kill and restart system mid-run — paused runs resume, completed runs are in DB - Commands: list_playbooks, list_playbook_runs, inspect_playbook_run all return correct data - Migration validation: default playbooks produce equivalent results to the rules they replace

Phase 6: Self-Improvement Loop¶

Close the loop: agents learn from experience.

Source: [[self-improvement]]

6.1 Reflection Playbooks¶

Spec: [[self-improvement#2. The Loop]], [[memory-scoping#10. Reflection Playbook (Periodic Consolidation)]]

#	Task	Depends On
6.1.1	Write coding agent reflection playbook (`vault/agent-types/coding/playbooks/reflection.md`) per [[self-improvement#2. The Loop]]	5.1.2
6.1.2	Write generic agent reflection playbook template for other agent types	6.1.1
6.1.3	Implement reflection playbook trigger on `task.completed` for matching agent type per [[playbooks#8. Scoping]] agent-type scope	5.3.3, 6.1.1
6.1.4	Implement memory consolidation within reflection: merge duplicates, update outdated, promote cross-scope per [[memory-scoping#10. Reflection Playbook (Periodic Consolidation)]]	6.1.3, 3.4.1
6.1.5	Verify reflection playbook reads task records, extracts patterns, writes insights to agent-type memory	6.1.3

6.2 Log Analysis¶

Spec: [[self-improvement#2. The Loop]] — Log Analysis Playbook

#	Task	Depends On
6.2.1	Write log analysis playbook (`vault/system/playbooks/log-analysis.md`) per [[self-improvement#2. The Loop]]	5.1.2
6.2.2	Implement log access tools for playbook use (read recent logs, filter by severity/date)	6.2.1
6.2.3	Verify log analysis writes operational insights to orchestrator memory per [[self-improvement#5. Orchestrator Memory]]	6.2.1

6.3 Reference Stub Indexer¶

Spec: [[vault#4. Reference Stubs for External Docs]]

#	Task	Depends On
6.3.1	Implement workspace spec/doc change detector (file watcher or git diff based) per [[vault#4. Reference Stubs for External Docs]] Generation	1.3.1
6.3.2	Implement stub generator: read full doc, LLM-summarize, write to `vault/projects/{id}/references/` per [[vault#4. Reference Stubs for External Docs]] Stub Format	6.3.1
6.3.3	Implement source_hash tracking to avoid regenerating unchanged stubs	6.3.2
6.3.4	Implement stale stub detection: flag stubs where `source_hash` no longer matches source file per [[vault#7. Open Questions]] #2	6.3.3
6.3.5	Add tests: Reference stub regeneration per [[vault#4. Reference Stubs for External Docs]]. Cases: (a) changing a spec file in workspace triggers stub regeneration in `vault/projects/{id}/references/`, (b) regenerated stub summary reflects the new content (not stale), (c) stub retains Obsidian-compatible frontmatter and wikilink format, (d) stub file name matches source file name (e.g., `api-spec.md` → `api-spec.md` stub), (e) multiple spec files changed simultaneously each get their own stub regenerated, (f) stub generation handles large spec files (>5000 tokens) by summarizing effectively, (g) git-based detection: only files changed since last indexed commit trigger regeneration	6.3.2
6.3.6	Add tests: Reference stub source_hash caching per [[vault#7. Open Questions]] #2. Cases: (a) unchanged spec file does NOT trigger stub regeneration (source_hash matches), (b) touching file without content change does NOT trigger regeneration, (c) source_hash is persisted across system restarts, (d) stale stub detection: manually editing source file without triggering watcher flags stub as potentially stale, (e) force-regenerate command bypasses hash check, (f) deleting source file flags corresponding stub as orphaned	6.3.3

6.4 Orchestrator Memory¶

Spec: [[self-improvement#5. Orchestrator Memory]]

#	Task	Depends On
6.4.1	Implement startup scan of `vault/projects/*/README.md` → generate orchestrator summaries in `vault/orchestrator/memory/project-{id}.md` per [[self-improvement#5. Orchestrator Memory]]	1.1.1, 2.2.5
6.4.2	Wire README file watcher from Phase 1.3 to orchestrator re-summary per [[self-improvement#5. Orchestrator Memory]] On README change	1.3.5, 6.4.1
6.4.3	Add tests: Orchestrator memory from project READMEs per [[self-improvement#5. Orchestrator Memory]]. Cases: (a) creating a new `vault/projects/myapp/README.md` triggers generation of `vault/orchestrator/memory/project-myapp.md` summary, (b) summary captures key project details (tech stack, purpose, status) from the README, (c) editing README triggers summary update — new content reflected in updated summary, (d) startup scan processes all existing READMEs and creates/updates summaries for each, (e) project with no README does not cause errors (skipped gracefully), (f) deleting README flags orchestrator summary as potentially stale (or removes it), (g) summary is concise enough to fit in orchestrator's context alongside other project summaries	6.4.2

6.5 Memory Health¶

Spec: [[self-improvement#6. Memory Health & Observability]]

#	Task	Depends On
6.5.1	Implement memory audit trail in frontmatter (created, source_task, source_playbook, last_retrieved, retrieval_count) per [[self-improvement#6. Memory Health & Observability]] Memory Audit Trail	2.1.13
6.5.2	Implement `memory_health` command: collection sizes, growth rate, stale count, most-retrieved, retrieval hit rate, contradictions per [[self-improvement#6. Memory Health & Observability]] Memory Health View	6.5.1
6.5.3	Implement stale memory detection (not retrieved in N days) per [[self-improvement#6. Memory Health & Observability]]	6.5.1
6.5.4	Implement contradiction detection: flag memories tagged `#contested` per [[self-improvement#7. Open Questions]] #2	6.5.1
6.5.5	Add stale memory flagging and contradiction surfacing to reflection playbook	6.5.3, 6.5.4, 6.1.1

Test checkpoint: Let the system run for a day with reflection playbooks active. Check: did insights get extracted from completed tasks? Did duplicate insights get merged? Are retrieval counts incrementing? Are stale memories flagged? Is the orchestrator's project understanding current? Are contradictions detected and surfaced? This validates the entire [[self-improvement|self-improvement loop]]. Specific validations: - Reflection playbook triggers on task.completed for each agent type that has one configured - After 5 completed tasks: at least 3 insights extracted and saved to agent-type memory collection - Duplicate insights (>0.95 similarity) are merged — collection size does not grow unboundedly - Retrieval counts increment each time an insight is returned by memory_search - memory_health command shows accurate collection sizes, growth rates, and retrieval hit rates - Stale memories (not retrieved in configurable N days) are flagged in health report - Contradictions between two memories on same topic are detected and tagged #contested - Reflection playbook surfaces stale/contradicted memories for review in its output - Orchestrator summaries for all active projects are current (updated within last README change) - Reference stubs are regenerated for changed workspace specs and stale stubs are flagged - Log analysis playbook writes operational insights to orchestrator memory collection - Memory consolidation in reflection merges cross-scope duplicates and promotes reusable patterns

Phase 7: Agent Coordination¶

Playbook-driven multi-agent workflows.

Source: [[agent-coordination]]

7.1 Workflow Infrastructure¶

Spec: [[agent-coordination#6. Workflow Runtime]]

#	Task	Depends On
7.1.1	Add `workflow_id` nullable field to tasks table (Alembic migration) per [[agent-coordination#6. Workflow Runtime]] Workflow State	—
7.1.2	Create `workflows` DB table (Alembic migration) with fields per [[agent-coordination#6. Workflow Runtime]] Workflow State	—
7.1.3	Implement Workflow CRUD queries (create, update status, add task, get by ID)	7.1.2
7.1.4	Implement `workflow.stage.completed` event emission per [[agent-coordination#6. Workflow Runtime]]	7.1.3
7.1.5	Add tests: Workflow CRUD and lifecycle per [[agent-coordination#6. Workflow Runtime]] Workflow State. Cases: (a) create workflow returns valid workflow_id and initial status "pending", (b) associate tasks with workflow via `workflow_id` FK — tasks queryable by workflow, (c) workflow status transitions: pending → running → completed, pending → running → failed, (d) invalid status transitions (e.g., completed → running) are rejected, (e) `workflow.stage.completed` event emitted when a stage's tasks all complete, (f) workflow with no associated tasks can still be created and tracked, (g) deleting a workflow does not delete its associated tasks (tasks survive independently), (h) concurrent workflow creation does not produce ID collisions	7.1.3

7.2 Coordination Commands¶

Spec: [[agent-coordination#5. How Coordination Playbooks Change the Scheduler]] — The Interface Between Them

#	Task	Depends On
7.2.1	Extend `create_task` command with `agent_type`, `affinity_agent_id`, `workspace_mode` parameters per [[agent-coordination#5. How Coordination Playbooks Change the Scheduler]]	7.1.1
7.2.2	Implement `set_project_constraint` command (exclusive access, max agents by type, pause scheduling) per [[agent-coordination#5. How Coordination Playbooks Change the Scheduler]]	—
7.2.3	Implement `release_project_constraint` command	7.2.2
7.2.4	Implement constraint enforcement in scheduler (check before assignment) per [[agent-coordination#5. How Coordination Playbooks Change the Scheduler]] What the Scheduler Owns	7.2.2
7.2.5	Add tests: Project constraint enforcement per [[agent-coordination#5. How Coordination Playbooks Change the Scheduler]]. Cases: (a) `set_project_constraint` with `exclusive=true` blocks scheduler from assigning any tasks to other agents on that project, (b) `release_project_constraint` lifts the block — scheduler resumes normal assignment, (c) constraint with `max_agents={"coding": 2}` allows up to 2 coding agents but blocks a third, (d) `pause_scheduling=true` constraint stops all task assignment for that project, (e) constraint on project A does not affect scheduling for project B, (f) attempting to set constraint on non-existent project returns clear error, (g) multiple constraints on same project stack correctly (e.g., exclusive + max_agents), (h) constraint persists across scheduler tick cycles until explicitly released	7.2.4

7.3 Agent Affinity¶

Spec: [[agent-coordination#3. Core Concepts]] Agent Affinity, [[agent-coordination#6. Workflow Runtime]] Agent Affinity Implementation

#	Task	Depends On
7.3.1	Add `affinity_agent_id` and `affinity_reason` fields to tasks table per [[agent-coordination#6. Workflow Runtime]] Agent Affinity Implementation	7.2.1
7.3.2	Implement scheduler affinity logic: prefer idle affinity agent, bounded wait up to N seconds, fallback per [[agent-coordination#6. Workflow Runtime]] Agent Affinity Implementation	7.3.1
7.3.3	Implement agent type matching: task's `agent_type` field matched against agent's type during assignment per [[agent-coordination#3. Core Concepts]] Agent Affinity	7.3.1
7.3.4	Add tests: Agent affinity scheduling per [[agent-coordination#6. Workflow Runtime]] Agent Affinity Implementation. Cases: (a) task with `affinity_agent_id="agent-1"` is assigned to agent-1 when agent-1 is idle, (b) task with affinity to busy agent waits up to N seconds before falling back to another available agent, (c) fallback agent matches the task's `agent_type` requirement, (d) affinity with no wait time (N=0) immediately falls back if affinity agent is busy, (e) affinity agent that becomes idle within the wait window gets the task (not the fallback), (f) `affinity_reason` is logged for debugging (e.g., "original author of feature branch"), (g) task without affinity is assigned normally by the scheduler (no affinity preference)	7.3.2
7.3.5	Add tests: Agent type matching per [[agent-coordination#3. Core Concepts]] Agent Affinity. Cases: (a) task with `agent_type="code-review"` is NOT assigned to an agent with type "coding", (b) task with `agent_type="coding"` IS assigned to an available coding agent, (c) task with no `agent_type` is assigned to any available agent regardless of type, (d) task with `agent_type` that no agent matches stays queued (not assigned to wrong type), (e) agent with multiple type capabilities can match tasks of any of its types, (f) type mismatch rejection is logged with task_id, required_type, and agent_type for debugging	7.3.3

7.4 Workspace Modes¶

Spec: [[agent-coordination#7. Workspace Strategy]]

#	Task	Depends On
7.4.1	Add `lock_mode` field to workspace acquisition (default: `exclusive`) per [[agent-coordination#7. Workspace Strategy]] Lock Modes	—
7.4.2	Implement `branch-isolated` lock mode (multiple agents, same repo, different branches) per [[agent-coordination#7. Workspace Strategy]]	7.4.1
7.4.3	Implement git mutex for shared operations (fetch, gc) in branch-isolated mode per [[agent-coordination#7. Workspace Strategy]]	7.4.2
7.4.4	Add tests: Branch-isolated workspace mode per [[agent-coordination#7. Workspace Strategy]]. Cases: (a) two agents acquire workspace with `lock_mode="branch-isolated"` on same repo — both succeed, (b) each agent operates on a separate branch (no cross-branch interference), (c) shared git operations (fetch, gc) are serialized via mutex — concurrent fetches do not corrupt the repo, (d) agent A's commits on branch-A are not visible on agent B's branch-B, (e) branch-isolated lock is released when agent completes task, (f) three or more agents can work concurrently in branch-isolated mode, (g) branch-isolated mode with conflicting branches (same branch name) is rejected	7.4.2
7.4.5	Add tests: Exclusive workspace mode backward compatibility per [[agent-coordination#7. Workspace Strategy]] Lock Modes. Cases: (a) workspace acquired with `lock_mode="exclusive"` (default) blocks second agent from acquiring same workspace, (b) second agent's acquisition attempt waits or fails with clear error, (c) exclusive lock release allows next agent to acquire, (d) exclusive mode behavior is identical to pre-lock-mode behavior (backward compat), (e) workspace without explicit `lock_mode` defaults to exclusive, (f) mixing exclusive and branch-isolated on same repo is rejected (cannot downgrade from exclusive)	7.4.1
7.4.6	Stub `directory-isolated` mode as future placeholder per [[agent-coordination#7. Workspace Strategy]] (deferred)	—

7.5 Coordination Playbooks¶

Spec: [[agent-coordination#4. Coordination Playbook Examples]], [[agent-coordination#9. Default Coordination Playbooks]]

#	Task	Depends On
7.5.1	Write `feature-pipeline.md` default coordination playbook per [[agent-coordination#4. Coordination Playbook Examples]] Example 1	5.1.2, 7.1.3
7.5.2	Write `bugfix-pipeline.md` default coordination playbook per [[agent-coordination#9. Default Coordination Playbooks]]	5.1.2, 7.1.3
7.5.3	Write `review-cycle.md` default coordination playbook per [[agent-coordination#9. Default Coordination Playbooks]]	5.1.2, 7.1.3
7.5.4	Write `exploration.md` default coordination playbook per [[agent-coordination#4. Coordination Playbook Examples]] Example 2	5.1.2, 7.1.3
7.5.5	Implement long-running playbook support (event-triggered resumption across workflow stages) per [[agent-coordination#6. Workflow Runtime]] Workflow ↔ PlaybookRun Relationship	5.4.1, 7.1.4
7.5.6	Implement orphan workflow recovery: if coordination playbook crashes, tasks continue, playbook can be re-triggered per [[agent-coordination#11. Open Questions]] #2	7.5.5
7.5.7	Add tests: Feature pipeline coordination playbook per [[agent-coordination#4. Coordination Playbook Examples]] Example 1. Cases: (a) feature-pipeline playbook creates coding task first, then review + QA tasks after coding completes, (b) review and QA tasks have dependency on coding task (not scheduled until coding is done), (c) review + QA tasks can run concurrently (no dependency between them), (d) merge task depends on both review AND QA completing, (e) task chain has correct `workflow_id` linking all tasks, (f) coding task has `agent_type="coding"`, review has `agent_type="code-review"`, QA has `agent_type="qa"`, (g) failure in coding task stops the pipeline (review + QA not created), (h) feature-pipeline fires on appropriate trigger event (e.g., `task.created` with type="feature")	7.5.1
7.5.8	Add tests: Review feedback cycle with agent affinity per [[agent-coordination#4. Coordination Playbook Examples]] Example 1. Cases: (a) reviewer marks code review as "changes_requested" — playbook creates a fix task, (b) fix task has `affinity_agent_id` set to the original coding agent (who wrote the code), (c) `affinity_reason` is "original author" or similar descriptive string, (d) if original agent is idle, fix task is assigned to them immediately, (e) if original agent is busy, fix task waits up to configured timeout then falls back, (f) fix task completion re-triggers review (loop back in playbook graph), (g) maximum review cycles are bounded (configurable, e.g., 3 rounds) to prevent infinite loops	7.5.1, 7.3.2
7.5.9	Add tests: Exploration coordination playbook per [[agent-coordination#4. Coordination Playbook Examples]] Example 2. Cases: (a) exploration playbook creates N parallel research tasks with no dependencies between them, (b) all parallel tasks are assigned to available agents concurrently (scheduler respects independence), (c) reviewer task is created only after ALL parallel tasks complete (depends on all), (d) reviewer task receives summaries/outputs from all parallel tasks as context, (e) partial failure (2 of 3 parallel tasks complete, 1 fails) — reviewer still triggers with available results plus failure note, (f) exploration with single parallel task degrades gracefully to sequential, (g) workflow status reflects "running" until reviewer completes, then "completed"	7.5.4
7.5.10	Add tests: Orphan workflow recovery per [[agent-coordination#11. Open Questions]] #2. Cases: (a) kill coordination playbook mid-workflow — in-flight tasks continue executing to completion, (b) tasks created before crash have correct dependencies and are scheduled normally, (c) re-triggering coordination playbook discovers existing workflow and resumes from current state, (d) resumed playbook does not re-create tasks that already exist, (e) workflow status shows "running" during orphan period (not "failed"), (f) orphan detection: system identifies workflows with no active playbook run and alerts operator, (g) manual `resume_playbook` can restart coordination from the last completed stage	7.5.6

7.6 Coordination Observability¶

Spec: [[agent-coordination#11. Open Questions]] #6

#	Task	Depends On
7.6.1	Design workflow pipeline view for dashboard (stages with tasks, agent assignments, progress) per [[agent-coordination#11. Open Questions]] #6	7.5.5

Test checkpoint: End-to-end coordination test: create a FEATURE task. Verify the feature-pipeline playbook triggers, creates a coding task with agent affinity, waits for PR via event, creates review + QA tasks (running concurrently via [[specs/scheduler-and-budget|scheduler]] dependency DAG), handles review feedback cycle with affinity back to original agent, and completes the workflow. Verify the scheduler respects agent type matching and the dependency graph drives concurrency per [[agent-coordination#5. How Coordination Playbooks Change the Scheduler]]. Specific validations: - Create FEATURE task: feature-pipeline playbook triggers and creates coding task with agent_type="coding" - Coding agent completes and pushes PR: git.pr.created event fires, playbook creates review + QA tasks - Review task assigned to code-review agent, QA task assigned to qa agent — type matching enforced - Review + QA run concurrently (scheduler assigns both when agents available, respects DAG independence) - Reviewer requests changes: fix task created with affinity_agent_id = original coding agent - Original coding agent receives fix task (was idle) — verify affinity scheduling - Fix task completes, re-review passes, merge task created — full pipeline completes - Workflow status transitions: pending → running → completed, all tasks have correct workflow_id - Project constraint: set exclusive=true mid-workflow — no new tasks scheduled for project until released - Branch-isolated mode: coding and QA agents work on same repo simultaneously on different branches - Exclusive mode: attempt concurrent access to exclusive workspace — second agent blocked - Exploration playbook: create 3 parallel research tasks, verify all complete before reviewer starts - Orphan recovery: kill playbook mid-pipeline, verify tasks continue, re-trigger resumes from last stage - Scheduler type mismatch: QA task is NOT assigned to coding agent even if coding agent is idle - Affinity fallback: make affinity agent busy, verify fallback to another agent of same type after timeout - End-to-end timing: full feature pipeline completes within reasonable time (measure bottlenecks)

Phase 8: Hook Engine Deprecation¶

Remove the old system once [[playbooks]] are validated.

Source: [[playbooks#13. Migration Path]] — Phase 3

#	Task	Depends On
8.1	Migrate all user-created active rules to playbooks (generate playbook markdown from rule files) per [[playbooks#13. Migration Path]] Phase 2	5.6.6
8.2	Migrate passive rules to vault memory files (in appropriate agent-type or project scope) per [[playbooks#13. Migration Path]] Passive Rules	3.1.1
8.3	Redirect hook commands to playbook equivalents (`list_hooks` → `list_playbooks`, etc.) per [[playbooks#13. Migration Path]] Phase 3	5.5.7
8.4	Remove `HookEngine` (hooks), `RuleManager` (rule-system), and related code	8.1, 8.2, 8.3
8.5	Remove hook/rule DB tables (Alembic migration)	8.4
8.6	Remove `src/memory.py` MemoryManager and v1 memory plugin per [[memory-plugin#2. Current Architecture (Being Replaced)]]	2.2.16
8.7	~~Update all specs to remove "future evolution" callouts (they're now current)~~ ✅	8.4
8.8	~~Remove deprecated spec files (`proactive-inspector.md` already removed, verify no others)~~ ✅	8.7

Final test checkpoint: Full regression test. Every feature that worked with hooks still works with playbooks. Memory operations all route through v2 plugin. No references to hook engine in active code paths. Run entire test suite. Specific validations: - All existing hook-driven behaviors (post-action-reflection, spec-drift-detector, error-recovery-monitor) work via playbooks - Side-by-side comparison: run same task with hooks and with playbooks — outputs are functionally equivalent - Passive rules migrated to vault memory files are searchable via memory_search - Hook commands (list_hooks, create_hook, etc.) redirect to playbook equivalents with deprecation notice - No imports of HookEngine, RuleManager, or src/memory.py MemoryManager in any active code path - Hook/rule DB tables are dropped in Alembic migration without data loss (migration verified both up and down) - All memory operations route through v2 plugin — v1 plugin is unregistered and removed - Full test suite passes: pytest tests/ with zero failures and no deprecation warnings from removed code - Spec files updated: no "future evolution" callouts remain that reference now-implemented features - No orphaned spec files (e.g., proactive-inspector.md) exist in the specs directory - System boots cleanly from fresh install (no old paths, no migration needed, all defaults are playbook-based)

Summary¶

Phase	Tasks	Status	Key Deliverable	Source Specs	Depends On
0	23	✅ Complete	Prerequisite refactors	[[playbooks]] §17	—
1	19	✅ Complete	Vault structure + file watcher	[[vault]]	—
2	38	🔵 Ready	memsearch fork + memory plugin v2	[[memory-plugin]], [[memory-scoping]] §7	✅ Phase 1
3	27	⚪ Blocked	Memory scoping, tiers, overrides, dedup	[[memory-scoping]]	Phase 2
4	18	⚪ Blocked	Profiles as markdown	[[profiles]]	✅ Phase 1, Phase 3
5	48	🔵 Ready	Playbook system	[[playbooks]]	✅ Phase 0, ✅ Phase 1
6	18	⚪ Blocked	Self-improvement loop	[[self-improvement]], [[vault]] §4	Phase 3, Phase 5
7	27	⚪ Blocked	Agent coordination	[[agent-coordination]]	Phase 5
8	8	⚪ Blocked	Hook engine deprecation	[[playbooks]] §13	Phase 5, Phase 6
Total	226	42 done

Parallelism Opportunities¶

Phases 0 and 1 are complete. Phase 2 and Phase 5 can now run in parallel (both dependencies met). Phase 4 can start as soon as Phase 3.1 lands. Phase 6 requires both Phase 3 and Phase 5. Phase 7 requires Phase 5.

Phase 0 ✅ ───────────────────────► Phase 5 🔵 ──► Phase 7
                                        │              │
Phase 1 ✅ ──► Phase 2 🔵 ──► Phase 3 ──► Phase 6    Phase 8
                                  │
                              Phase 4