Data Lakehouse — Goal 1¶

The MESA Lakehouse is the canonical record of metadata history for iRODS collections in MESA-enabled projects. It is built on the DuckLake lakehouse pattern: a Postgres catalog plus Parquet data files stored at <project_root>/.mesa/ducklake/ inside the iRODS project itself, so the metadata history travels with the data.

Stack¶

Layer	Technology
Query engine	DuckDB
Lakehouse format	DuckLake on Apache Iceberg + Parquet
Catalog	PostgreSQL (one catalog per project)
Storage	Parquet under `.mesa/ducklake/` in iRODS
Mirror	TACC Corral, OSN Pod

What it stores¶

For every AVU (attribute / value / unit) change made through a MESA MCP server, the Lakehouse records a versioned snapshot:

Field	Description
`project_id`	iRODS project identifier
`irods_path`	Path of the collection or data object
`attribute`, `value`, `unit`	The AVU triple
`actor`	User or service principal that made the change
`change_type`	add / update / delete
`note`	Optional human-readable annotation
`snapshot_id`	DuckLake snapshot — supports time-travel
`recorded_at`	UTC timestamp

This means you can answer questions like:

"What did this collection's metadata look like on 2026-09-01?" — via get_avus_as_of.
"Which user added the ENVO:biome tag to this dataset?" — via mesa_ducklake_search_changes.
"How has my project's metadata evolved over the last quarter?" — via Lakehouse SQL queries.

Public API¶

The Lakehouse is consumed through the Python library mesa-ducklake, which exposes a single entry point — DuckLakeClient. All other modules in mesa_ducklake are internal.

from mesa_ducklake import DuckLakeClient, AvuChange

with DuckLakeClient(postgres_dsn=..., irods_session=...) as client:
    project = client.register_project(
        irods_path="/iplant/home/alice/myproj",
        actor="alice",
        zone="iplant",
    )
    snap = client.record_changes(
        project_id=project.project_id,
        actor="alice",
        changes=[AvuChange(...)],
        note="Tagged file.csv with ENVO biome",
    )
    avus = client.get_avus(project.project_id, irods_path=...)

Performance targets¶

Metric	Target
Sub-second analytical queries	On benchmark dataset, leveraged compute
Catalog availability	≥ 95% business hours during the prototype
Daily backup	`pg_dump` snapshotted to iRODS automatically

Backup and recovery¶

The Postgres catalog is backed up daily by a pg_dump job that writes the dump into the project's iRODS collection (.mesa/ducklake/backups/). The Parquet files live in iRODS already, so the full catalog can be recovered from iRODS alone — no separate backup system is required.

See the mesa-ducklake operator guide for recovery procedures.