HyperMesh — Phase 0 & 1 Engineering Plan
Wedge decision (2026-06-08): Lead with the temporal hypergraph database (developer/infra). AI and Virtual demote to clearly-labeled experimental extensions. This plan hardens the asset we actually ship and removes the claims the code can’t back.
Maturity today: core DB beta, everything around it alpha/PoC. Goal of Phases 0–1: make the DB trustworthy and adoptable.
Phase 0 — Stop the bleeding (≈3–5 days)
Section titled “Phase 0 — Stop the bleeding (≈3–5 days)”Cheap, high-risk-reduction fixes. Nothing here needs the C engine recompiled except via existing build.
P0.1 — Kill Cypher/DDL injection in the public API 🔴 security
Section titled “P0.1 — Kill Cypher/DDL injection in the public API 🔴 security”Where: hypermeshdb/_api.py — f-string identifier interpolation at lines
539, 1121, 1140, and the duplicate router copies at 1383, 1588, 1612
(DROP HYPEREDGE TABLE {name}, CREATE INDEX ON {req.table} ({req.column}),
DROP INDEX ON {table} ({col})). Pydantic validates types but not identifier safety.
Change:
- Add a shared validator in
_api.py(or a new_validators.py):_IDENT = re.compile(r"^[A-Za-z_][A-Za-z0-9_]{0,63}$")def safe_ident(v: str, what: str = "identifier") -> str:if not _IDENT.match(v):raise HTTPException(422, f"invalid {what}: {v!r}")return v - Wrap every interpolated
name/table/columnat the 6 sites withsafe_ident(...). - Apply the same on the Pydantic models (
CreateTableRequest, the index request models near_api.py:294-308) via afield_validatorso it’s rejected at the edge too.
Why both: route-level guards the f-string; model-level gives a clean 422 with a field path.
Acceptance: POST /v1/tables {"name":"E; DELETE FROM api_keys; --"} → 422, no execution.
Add a regression test tests/test_api_injection.py covering all 6 sites.
Effort: 0.5 day. Risk: low (additive validation).
P0.2 — Auth on by default 🔴 security
Section titled “P0.2 — Auth on by default 🔴 security”Where: _api.py:3915-3938 — create_app(auth_disabled: bool = True, ...) and the
HMDB_AUTH_DISABLED env merge.
Change:
- Flip the default to
auth_disabled: bool = False. - Invert the CLI/env story: keep an explicit opt-out (
hmdb serve --no-auth/HMDB_AUTH_DISABLED=1) for local dev, but secure-by-default forserve. - Update the docstring at
_api.py:22-23anddeploy/README.mdto match. - In
docker-compose*.yml, requireHMDB_API_KEY(fail fast if unset when auth is on).
Acceptance: hmdb serve <dir> with no flags rejects unauthenticated /v1/* writes with 401.
Existing auth tests (tests/test_phase_a_auth.py) updated for the new default.
Effort: 0.5 day. Risk: medium — breaks any caller relying on auth-off; gate behind a
clear release note + the --no-auth escape hatch.
P0.3 — Backup path traversal 🟡 security (admin-only)
Section titled “P0.3 — Backup path traversal 🟡 security (admin-only)”Where: _api.py:1168-1170 — os.makedirs(req.backup_dir) / os.path.join(...) with no
canonicalization.
Change: Constrain backup_dir to a configured allow-root (e.g. HMDB_BACKUP_ROOT,
default <db_dir>/backups). os.path.realpath the join and assert it stays under the root.
Acceptance: {"backup_dir":"/../../etc"} → 422. Effort: 0.25 day. Risk: low.
P0.4 — Deterministic startup + remove the legacy bridge 🟠 DX/reliability
Section titled “P0.4 — Deterministic startup + remove the legacy bridge 🟠 DX/reliability”Where: server/_core/index.ts — findAvailablePort() (191), preferredPort
(360-364), startPythonBridge() (129, spawned at 374-377).
Problem: server silently drifts 3000→3001 when 3000 is busy (today’s “services didn’t start” symptom — the human hits :3000, server is on :3001). The Python bridge on 8765 is dead code (tRPC proxies FastAPI, not the bridge) but still spawned.
Change:
- Make the port deterministic: if
PORT(default 3000) is busy, fail loudly with a clear message (Port 3000 in use — set PORT or free it) instead of silently incrementing. Keep auto-increment only behindHYPERMESH_PORT_AUTOFIND=1for convenience. - Default
HYPERMESH_START_BRIDGEto off and log a deprecation line; schedule deletion ofserver/hypermesh_bridge.pyonce nothing references 8765 (grep confirms tRPC does not). - Add a one-line readiness log:
Ready: app=:<port> api=:<API_PORT>.
Acceptance: Cold start with 3000 free → app on 3000, no bridge process, single readiness line. With 3000 busy → non-zero exit + actionable message. Effort: 0.5 day. Risk: low.
P0.5 — Reconcile the Mayo “genuine win” claim 🔴 integrity (not code)
Section titled “P0.5 — Reconcile the Mayo “genuine win” claim 🔴 integrity (not code)”Where: data/Mayo/.../GENUINE_WIN_FINDINGS.md + any JAMA-facing draft.
Problem: the headline HGNN “win” does not hold on the 5-seed confirmation
(genuine_win_lowlabel_confirm.csv); XGBoost ties/beats on the real case-finding metrics.
This must be reframed before anything external (esp. JAMA) ships.
Change: Rewrite the finding as the honest, publishable negative/structural result (“deterministic hyperedges make HGNNs redundant with tuned gradient boosting on this cohort”), and commit the Mayo scripts + result CSVs to git (currently on-disk only).
Acceptance: No “statistically significant win” language anywhere external; results reproducible from committed code. Effort: 1 day (writing). Risk: reputational if skipped.
Phase 1 — Make the asset trustworthy & adoptable (≈4–8 weeks)
Section titled “Phase 1 — Make the asset trustworthy & adoptable (≈4–8 weeks)”Ordered by criticality. P1.1–P1.4 are “is it a real database.” P1.5–P1.7 are “can a stranger run it.”
P1.1 — Crash recovery & WAL integrity 🔴 durability (C engine)
Section titled “P1.1 — Crash recovery & WAL integrity 🔴 durability (C engine)”Where: hypermesh_core/src/hypermesh.c — hm_open() (218), _compact_impl() (1092),
WAL open path (hm_wal_open). Note: tmp-file cleanup already exists (hm_open:219-221), so the
atomic-rename skeleton is present; the gaps are validation + fsync ordering.
Change:
- WAL integrity on open: add a per-record CRC32 (or magic+length check) so a torn final
record (partial write before crash) is detected and truncated to the last valid record,
rather than silently mis-read.
hm_opencurrently does no record validation. - fsync ordering in compaction: in
_compact_impl, ensure: write*.tmp→fsync(tmp)→rename(tmp, final)→fsync(dir)→ only then truncate the WAL. Audit that the WAL truncate cannot precede the durable rename (data-loss window). - Recovery test harness:
tests/test_crash_recovery.py— fault-injection that kills the process at each step (mid-WAL-write, mid-compaction, post-rename/pre-truncate) and asserts the DB re-opens with exactly the committed set.
Acceptance: kill-at-any-step → reopen with no corruption, no lost committed records. Effort: 1.5–2 wks. Risk: high (C, durability) — gate behind the fault-injection suite.
P1.2 — Real transaction atomicity & isolation 🔴 correctness
Section titled “P1.2 — Real transaction atomicity & isolation 🔴 correctness”Where: hypermeshdb/_connection.py — begin/commit/rollback (972-1051), the
per-entry replay loop in commit (1006-1020), _tx_buffer (508-509).
Problem: “commit” replays buffered inserts one-by-one with best-effort tombstone rollback
(1021-1033). A crash mid-commit leaves a partial transaction; readers see uncommitted-then-
reversed rows (no isolation). This is not atomic or isolated.
Change:
- Atomic group-commit: add a C-level batch-WAL primitive that writes all N entries under a
single WAL frame with one
fdatasync, so the group is all-or-nothing on disk (replaces the Python per-entry loop). Exposehm_wal_append_batch(...). - Isolation: stamp each WAL frame with a commit sequence number; range/FMI queries read up to the last committed seq, so in-flight frames are invisible (snapshot read). Minimal MVCC.
- Keep the Python
transaction()context manager API stable; it now calls the batch primitive.
Acceptance: concurrent reader never observes an uncommitted/rolled-back row; crash mid-commit
leaves zero partial effect. Tests in tests/test_transactions.py. Effort: 1.5–2 wks.
Risk: high — couples to P1.1; do them together.
P1.3 — Cursor pagination on read endpoints 🟠 scale
Section titled “P1.3 — Cursor pagination on read endpoints 🟠 scale”Where: _api.py query/hyperedge GETs (/v1/query, /v1/hyperedges), C hm_range_query.
Problem: results return in one shot → OOM on large sets (no limit/offset on range queries).
Change: opaque cursor = (last_bucket_id, last_offset) encoded base64. Add limit (default
1000, max 10k) + cursor params; C range query resumes from the cursor instead of scanning from
zero. Response carries next_cursor (null at end).
Acceptance: 1M-row table paginates in bounded memory; cursor round-trips. Effort: 1 wk. Risk: medium.
P1.4 — Query timeouts & cancellation 🟠 reliability
Section titled “P1.4 — Query timeouts & cancellation 🟠 reliability”Where: _api.py query execution; C query loop.
Change: server-side deadline (HMDB_QUERY_TIMEOUT_MS, default 30s). C scan loop checks a
deadline/cancel flag every K rows and returns a partial-with-timed_out=true or raises. Prevents
a single pathological full-scan from blocking the write lock indefinitely.
Acceptance: a deliberately huge full-scan aborts at the deadline with 504. Effort: 0.5–1 wk. Risk: medium (touches the C hot loop).
P1.5 — Collapse 3 processes → 1 clean boundary 🟠 architecture
Section titled “P1.5 — Collapse 3 processes → 1 clean boundary 🟠 architecture”Where: server/_core/index.ts (spawns FastAPI + bridge), server/hypermesh_bridge.py.
Change: Node keeps only the web/tRPC + auth role and talks to one FastAPI over HTTP
(HYPERMESH_API_URL). Delete the bridge (after P0.4 deprecation). Add a supervised health-check:
if FastAPI is unreachable, tRPC returns a clean 503 with a “backend starting” state instead of
opaque 502s. Document the two-process topology (web + db-api) as the supported deployment.
Acceptance: one node + one python process; bridge gone; FastAPI-down → graceful 503.
Effort: 1 wk. Risk: medium.
P1.6 — SDK to 1.0 (the front door) 🟢 adoption
Section titled “P1.6 — SDK to 1.0 (the front door) 🟢 adoption”Where: sdk-ts/src/ — strongest existing surface (typed, zero-dep).
Change: full type coverage for connectors/analytics/virtual (currently partial); retry + backoff with idempotency awareness; semver 1.0 + CHANGELOG; quickstart README (“install → create table → insert → query in 10 lines”); publish pipeline.
Acceptance: 1.0 published, npm i quickstart runs end-to-end against a local server.
Effort: 1 wk. Risk: low.
P1.7 — CI/CD + automated backups 🟠 ops
Section titled “P1.7 — CI/CD + automated backups 🟠 ops”Where: new .github/workflows/ci.yml; docker-compose*.yml; _cli.py backup
(539-619).
Change:
- CI:
pytest(incl. new injection/crash/txn suites),tsc,vitest,docker build, a dependency/secret scan. Block merge on red. - Backups: a sidecar/cron that runs
hmdb backupon a schedule to a configured target (S3/GCS), with restore-verification in CI (backup → restore → integrity check).
Acceptance: PRs run CI green-gated; nightly backup artifact produced and test-restored. Effort: 1 wk. Risk: low.
Sequencing / dependencies
Section titled “Sequencing / dependencies”- Phase 0 ships first, independently (days).
- P1.1 + P1.2 are joined (durability + atomic commit share the WAL changes) — do as one ~3–4 wk workstream, behind the fault-injection suite.
- P1.3/P1.4 parallelizable after P1.1 lands (they touch the C query path; coordinate merges).
- P1.5/P1.6/P1.7 are independent and can run in parallel by a second person.
Explicit non-goals for Phases 0–1
Section titled “Explicit non-goals for Phases 0–1”- No Snowflake/Postgres Virtual backend; Virtual stays DuckDB + “experimental.”
- No new AI surface; SNN/RAG stay as-is, labeled experimental.
- No new Mayo/JAMA ML claims beyond the honest reframe in P0.5.