Skip to content

Data Lakehouse — Goal 1

The MESA Lakehouse is the canonical record of metadata history for iRODS collections in MESA-enabled projects. It is built on the DuckLake lakehouse pattern: a Postgres catalog plus Parquet data files stored at <project_root>/.mesa/ducklake/ inside the iRODS project itself, so the metadata history travels with the data.

Stack

Layer Technology
Query engine DuckDB
Lakehouse format DuckLake on Apache Iceberg + Parquet
Catalog PostgreSQL (one catalog per project)
Storage Parquet under .mesa/ducklake/ in iRODS
Mirror TACC Corral, OSN Pod

What it stores

For every AVU (attribute / value / unit) change made through a MESA MCP server, the Lakehouse records a versioned snapshot:

Field Description
project_id iRODS project identifier
irods_path Path of the collection or data object
attribute, value, unit The AVU triple
actor User or service principal that made the change
change_type add / update / delete
note Optional human-readable annotation
snapshot_id DuckLake snapshot — supports time-travel
recorded_at UTC timestamp

This means you can answer questions like:

  • "What did this collection's metadata look like on 2026-09-01?" — via get_avus_as_of.
  • "Which user added the ENVO:biome tag to this dataset?" — via mesa_ducklake_search_changes.
  • "How has my project's metadata evolved over the last quarter?" — via Lakehouse SQL queries.

Public API

The Lakehouse is consumed through the Python library mesa-ducklake, which exposes a single entry point — DuckLakeClient. All other modules in mesa_ducklake are internal.

from mesa_ducklake import DuckLakeClient, AvuChange

with DuckLakeClient(postgres_dsn=..., irods_session=...) as client:
    project = client.register_project(
        irods_path="/iplant/home/alice/myproj",
        actor="alice",
        zone="iplant",
    )
    snap = client.record_changes(
        project_id=project.project_id,
        actor="alice",
        changes=[AvuChange(...)],
        note="Tagged file.csv with ENVO biome",
    )
    avus = client.get_avus(project.project_id, irods_path=...)

Performance targets

Metric Target
Sub-second analytical queries On benchmark dataset, leveraged compute
Catalog availability ≥ 95% business hours during the prototype
Daily backup pg_dump snapshotted to iRODS automatically

Backup and recovery

The Postgres catalog is backed up daily by a pg_dump job that writes the dump into the project's iRODS collection (.mesa/ducklake/backups/). The Parquet files live in iRODS already, so the full catalog can be recovered from iRODS alone — no separate backup system is required.

See the mesa-ducklake operator guide for recovery procedures.