HyperMesh — Phase 0 & 1 Engineering Plan

Wedge decision (2026-06-08): Lead with the temporal hypergraph database (developer/infra). AI and Virtual demote to clearly-labeled experimental extensions. This plan hardens the asset we actually ship and removes the claims the code can’t back.

Maturity today: core DB beta, everything around it alpha/PoC. Goal of Phases 0–1: make the DB trustworthy and adoptable.

Phase 0 — Stop the bleeding (≈3–5 days)

Cheap, high-risk-reduction fixes. Nothing here needs the C engine recompiled except via existing build.

P0.1 — Kill Cypher/DDL injection in the public API 🔴 security

Where: hypermeshdb/_api.py — f-string identifier interpolation at lines 539, 1121, 1140, and the duplicate router copies at 1383, 1588, 1612 (DROP HYPEREDGE TABLE {name}, CREATE INDEX ON {req.table} ({req.column}), DROP INDEX ON {table} ({col})). Pydantic validates types but not identifier safety.

Change:

Add a shared validator in _api.py (or a new _validators.py):

_IDENT = re.compile(r"^[A-Za-z_][A-Za-z0-9_]{0,63}$")
def safe_ident(v: str, what: str = "identifier") -> str:
    if not _IDENT.match(v):
        raise HTTPException(422, f"invalid {what}: {v!r}")
    return v

Wrap every interpolated name/table/column at the 6 sites with safe_ident(...).
Apply the same on the Pydantic models (CreateTableRequest, the index request models near _api.py:294-308) via a field_validator so it’s rejected at the edge too.

Why both: route-level guards the f-string; model-level gives a clean 422 with a field path.

Acceptance: POST /v1/tables {"name":"E; DELETE FROM api_keys; --"} → 422, no execution. Add a regression test tests/test_api_injection.py covering all 6 sites.

Effort: 0.5 day. Risk: low (additive validation).

P0.2 — Auth on by default 🔴 security

Where: _api.py:3915-3938 — create_app(auth_disabled: bool = True, ...) and the HMDB_AUTH_DISABLED env merge.

Change:

Flip the default to auth_disabled: bool = False.
Invert the CLI/env story: keep an explicit opt-out (hmdb serve --no-auth / HMDB_AUTH_DISABLED=1) for local dev, but secure-by-default for serve.
Update the docstring at _api.py:22-23 and deploy/README.md to match.
In docker-compose*.yml, require HMDB_API_KEY (fail fast if unset when auth is on).

Acceptance: hmdb serve <dir> with no flags rejects unauthenticated /v1/* writes with 401. Existing auth tests (tests/test_phase_a_auth.py) updated for the new default.

Effort: 0.5 day. Risk: medium — breaks any caller relying on auth-off; gate behind a clear release note + the --no-auth escape hatch.

P0.3 — Backup path traversal 🟡 security (admin-only)

Where: _api.py:1168-1170 — os.makedirs(req.backup_dir) / os.path.join(...) with no canonicalization.

Change: Constrain backup_dir to a configured allow-root (e.g. HMDB_BACKUP_ROOT, default <db_dir>/backups). os.path.realpath the join and assert it stays under the root.

Acceptance: {"backup_dir":"/../../etc"} → 422. Effort: 0.25 day. Risk: low.

P0.4 — Deterministic startup + remove the legacy bridge 🟠 DX/reliability

Where: server/_core/index.ts — findAvailablePort() (191), preferredPort (360-364), startPythonBridge() (129, spawned at 374-377).

Problem: server silently drifts 3000→3001 when 3000 is busy (today’s “services didn’t start” symptom — the human hits :3000, server is on :3001). The Python bridge on 8765 is dead code (tRPC proxies FastAPI, not the bridge) but still spawned.

Change:

Make the port deterministic: if PORT (default 3000) is busy, fail loudly with a clear message (Port 3000 in use — set PORT or free it) instead of silently incrementing. Keep auto-increment only behind HYPERMESH_PORT_AUTOFIND=1 for convenience.
Default HYPERMESH_START_BRIDGE to off and log a deprecation line; schedule deletion of server/hypermesh_bridge.py once nothing references 8765 (grep confirms tRPC does not).
Add a one-line readiness log: Ready: app=:<port> api=:<API_PORT>.

Acceptance: Cold start with 3000 free → app on 3000, no bridge process, single readiness line. With 3000 busy → non-zero exit + actionable message. Effort: 0.5 day. Risk: low.

P0.5 — Reconcile the Mayo “genuine win” claim 🔴 integrity (not code)

Where: data/Mayo/.../GENUINE_WIN_FINDINGS.md + any JAMA-facing draft.

Problem: the headline HGNN “win” does not hold on the 5-seed confirmation (genuine_win_lowlabel_confirm.csv); XGBoost ties/beats on the real case-finding metrics. This must be reframed before anything external (esp. JAMA) ships.

Change: Rewrite the finding as the honest, publishable negative/structural result (“deterministic hyperedges make HGNNs redundant with tuned gradient boosting on this cohort”), and commit the Mayo scripts + result CSVs to git (currently on-disk only).

Acceptance: No “statistically significant win” language anywhere external; results reproducible from committed code. Effort: 1 day (writing). Risk: reputational if skipped.

Phase 1 — Make the asset trustworthy & adoptable (≈4–8 weeks)

Ordered by criticality. P1.1–P1.4 are “is it a real database.” P1.5–P1.7 are “can a stranger run it.”

P1.1 — Crash recovery & WAL integrity 🔴 durability (C engine)

Where: hypermesh_core/src/hypermesh.c — hm_open() (218), _compact_impl() (1092), WAL open path (hm_wal_open). Note: tmp-file cleanup already exists (hm_open:219-221), so the atomic-rename skeleton is present; the gaps are validation + fsync ordering.

Change:

WAL integrity on open: add a per-record CRC32 (or magic+length check) so a torn final record (partial write before crash) is detected and truncated to the last valid record, rather than silently mis-read. hm_open currently does no record validation.
fsync ordering in compaction: in _compact_impl, ensure: write *.tmp → fsync(tmp) → rename(tmp, final) → fsync(dir) → only then truncate the WAL. Audit that the WAL truncate cannot precede the durable rename (data-loss window).
Recovery test harness: tests/test_crash_recovery.py — fault-injection that kills the process at each step (mid-WAL-write, mid-compaction, post-rename/pre-truncate) and asserts the DB re-opens with exactly the committed set.

Acceptance: kill-at-any-step → reopen with no corruption, no lost committed records. Effort: 1.5–2 wks. Risk: high (C, durability) — gate behind the fault-injection suite.

P1.2 — Real transaction atomicity & isolation 🔴 correctness

Where: hypermeshdb/_connection.py — begin/commit/rollback (972-1051), the per-entry replay loop in commit (1006-1020), _tx_buffer (508-509).

Problem: “commit” replays buffered inserts one-by-one with best-effort tombstone rollback (1021-1033). A crash mid-commit leaves a partial transaction; readers see uncommitted-then- reversed rows (no isolation). This is not atomic or isolated.

Change:

Atomic group-commit: add a C-level batch-WAL primitive that writes all N entries under a single WAL frame with one fdatasync, so the group is all-or-nothing on disk (replaces the Python per-entry loop). Expose hm_wal_append_batch(...).
Isolation: stamp each WAL frame with a commit sequence number; range/FMI queries read up to the last committed seq, so in-flight frames are invisible (snapshot read). Minimal MVCC.
Keep the Python transaction() context manager API stable; it now calls the batch primitive.

Acceptance: concurrent reader never observes an uncommitted/rolled-back row; crash mid-commit leaves zero partial effect. Tests in tests/test_transactions.py. Effort: 1.5–2 wks. Risk: high — couples to P1.1; do them together.

P1.3 — Cursor pagination on read endpoints 🟠 scale

Where: _api.py query/hyperedge GETs (/v1/query, /v1/hyperedges), C hm_range_query.

Problem: results return in one shot → OOM on large sets (no limit/offset on range queries).

Change: opaque cursor = (last_bucket_id, last_offset) encoded base64. Add limit (default 1000, max 10k) + cursor params; C range query resumes from the cursor instead of scanning from zero. Response carries next_cursor (null at end).

Acceptance: 1M-row table paginates in bounded memory; cursor round-trips. Effort: 1 wk. Risk: medium.

P1.4 — Query timeouts & cancellation 🟠 reliability

Where: _api.py query execution; C query loop.

Change: server-side deadline (HMDB_QUERY_TIMEOUT_MS, default 30s). C scan loop checks a deadline/cancel flag every K rows and returns a partial-with-timed_out=true or raises. Prevents a single pathological full-scan from blocking the write lock indefinitely.

Acceptance: a deliberately huge full-scan aborts at the deadline with 504. Effort: 0.5–1 wk. Risk: medium (touches the C hot loop).

P1.5 — Collapse 3 processes → 1 clean boundary 🟠 architecture

Where: server/_core/index.ts (spawns FastAPI + bridge), server/hypermesh_bridge.py.

Change: Node keeps only the web/tRPC + auth role and talks to one FastAPI over HTTP (HYPERMESH_API_URL). Delete the bridge (after P0.4 deprecation). Add a supervised health-check: if FastAPI is unreachable, tRPC returns a clean 503 with a “backend starting” state instead of opaque 502s. Document the two-process topology (web + db-api) as the supported deployment.

Acceptance: one node + one python process; bridge gone; FastAPI-down → graceful 503. Effort: 1 wk. Risk: medium.

P1.6 — SDK to 1.0 (the front door) 🟢 adoption

Where: sdk-ts/src/ — strongest existing surface (typed, zero-dep).

Change: full type coverage for connectors/analytics/virtual (currently partial); retry + backoff with idempotency awareness; semver 1.0 + CHANGELOG; quickstart README (“install → create table → insert → query in 10 lines”); publish pipeline.

Acceptance: 1.0 published, npm i quickstart runs end-to-end against a local server. Effort: 1 wk. Risk: low.

P1.7 — CI/CD + automated backups 🟠 ops

Where: new .github/workflows/ci.yml; docker-compose*.yml; _cli.py backup (539-619).

Change:

CI: pytest (incl. new injection/crash/txn suites), tsc, vitest, docker build, a dependency/secret scan. Block merge on red.
Backups: a sidecar/cron that runs hmdb backup on a schedule to a configured target (S3/GCS), with restore-verification in CI (backup → restore → integrity check).

Acceptance: PRs run CI green-gated; nightly backup artifact produced and test-restored. Effort: 1 wk. Risk: low.

Sequencing / dependencies

Phase 0 ships first, independently (days).
P1.1 + P1.2 are joined (durability + atomic commit share the WAL changes) — do as one ~3–4 wk workstream, behind the fault-injection suite.
P1.3/P1.4 parallelizable after P1.1 lands (they touch the C query path; coordinate merges).
P1.5/P1.6/P1.7 are independent and can run in parallel by a second person.

Explicit non-goals for Phases 0–1

No Snowflake/Postgres Virtual backend; Virtual stays DuckDB + “experimental.”
No new AI surface; SNN/RAG stay as-is, labeled experimental.
No new Mayo/JAMA ML claims beyond the honest reframe in P0.5.