Data Lakehouse — Goal 1¶
The MESA Lakehouse is the canonical record of metadata history for iRODS
collections in MESA-enabled projects. It is built on the DuckLake lakehouse
pattern: a Postgres catalog plus Parquet data files stored at
<project_root>/.mesa/ducklake/ inside the iRODS project itself, so the
metadata history travels with the data.
Stack¶
| Layer | Technology |
|---|---|
| Query engine | DuckDB |
| Lakehouse format | DuckLake on Apache Iceberg + Parquet |
| Catalog | PostgreSQL (one catalog per project) |
| Storage | Parquet under .mesa/ducklake/ in iRODS |
| Mirror | TACC Corral, OSN Pod |
What it stores¶
For every AVU (attribute / value / unit) change made through a MESA MCP server, the Lakehouse records a versioned snapshot:
| Field | Description |
|---|---|
project_id |
iRODS project identifier |
irods_path |
Path of the collection or data object |
attribute, value, unit |
The AVU triple |
actor |
User or service principal that made the change |
change_type |
add / update / delete |
note |
Optional human-readable annotation |
snapshot_id |
DuckLake snapshot — supports time-travel |
recorded_at |
UTC timestamp |
This means you can answer questions like:
- "What did this collection's metadata look like on 2026-09-01?" —
via
get_avus_as_of. - "Which user added the ENVO:biome tag to this dataset?" — via
mesa_ducklake_search_changes. - "How has my project's metadata evolved over the last quarter?" — via Lakehouse SQL queries.
Public API¶
The Lakehouse is consumed through the Python library
mesa-ducklake, which exposes a single
entry point — DuckLakeClient. All other modules in mesa_ducklake are
internal.
from mesa_ducklake import DuckLakeClient, AvuChange
with DuckLakeClient(postgres_dsn=..., irods_session=...) as client:
project = client.register_project(
irods_path="/iplant/home/alice/myproj",
actor="alice",
zone="iplant",
)
snap = client.record_changes(
project_id=project.project_id,
actor="alice",
changes=[AvuChange(...)],
note="Tagged file.csv with ENVO biome",
)
avus = client.get_avus(project.project_id, irods_path=...)
Performance targets¶
| Metric | Target |
|---|---|
| Sub-second analytical queries | On benchmark dataset, leveraged compute |
| Catalog availability | ≥ 95% business hours during the prototype |
| Daily backup | pg_dump snapshotted to iRODS automatically |
Backup and recovery¶
The Postgres catalog is backed up daily by a pg_dump job that writes the
dump into the project's iRODS collection (.mesa/ducklake/backups/). The
Parquet files live in iRODS already, so the full catalog can be recovered
from iRODS alone — no separate backup system is required.
See the mesa-ducklake operator guide for
recovery procedures.