Skip to content

Building a Universal Lakehouse Catalog - Beyond Iceberg Tables

Published: at 09:00 AM

Get Data Lakehouse Books:

Lakehouse Community:

Will be recording an episode on this topic on my podcast, so please subscribe to the podcast to not miss it (Also on iTunes and other directories)

Apache Iceberg has done something few projects manage to pull off, it created a standard. Its table format and REST-based catalog interface made it possible for different engines to read, write, and govern the same data without breaking consistency. That’s a big deal. For the first time, organizations could mix and match engines while keeping one clean, transactional view of their data.

But this success brings new expectations.

As lakehouse adoption grows, teams want more than just Iceberg tables under one roof. They want to treat all their datasets, raw Parquet files, streaming logs, external APIs, or even other formats like Delta and Hudi, with the same consistency and governance. The problem? Today’s Iceberg catalogs don’t support that. They’re built for Iceberg tables only.

So how do we move beyond that? How do we build a universal lakehouse catalog that works across engines and across formats?

Let’s explore two possible paths and what’s still missing.

Iceberg’s Success: A Case Study in Standardization

To understand where catalogs could go next, it helps to look at what made Iceberg successful in the first place.

Before Iceberg, working with data lakes was messy. You could store files in open formats like Parquet or ORC, but there was no clean way to manage schema changes, version history, or transactional consistency. Each engine had to implement its own logic, or worse, teams had to build brittle pipelines to fill in the gaps.

Iceberg changed that. It introduced:

These two specs, the table format and the REST catalog interface, created a plug-and-play model. Spark, Flink, Trino, Dremio, and others could all speak the same language. As a result, Iceberg became the neutral zone. No vendor lock-in, no hidden contracts.

But that neutrality came with a scope: Iceberg REST Catalog only tracks and governs Iceberg tables. If your dataset isn’t an Iceberg table, there is no modern open interoperable standard for governing and accessing. And that’s where the limitation begins.

The Problem: No Standards Beyond Iceberg

While Iceberg catalogs are tightly defined for Iceberg tables, some catalogs do allow you to register other types of datasets, raw Parquet, Delta tables, external views, or even API-based data sources.

But there’s a catch.

Each catalog handles this differently. One might use a custom registration API, another might expose a metadata file format, and yet another might treat external sources as virtual tables with limited capabilities. The result is a patchwork of behavior:

This makes interoperability fragile. What works in one engine may not work in another, even if they both support the same table format. Teams are left stitching together workarounds or writing custom integrations just to get basic access across systems.

So what’s really missing here? A standard API for non-Iceberg datasets. Something that defines:

The big question is: where should this standard come from, and what should it look like?

Where Should the Standard Come From?

This brings us to the real crossroads: if we need a standard API for universal lakehouse catalogs, where should it come from?

There are a few possibilities:

No matter how you look at it, there are really only two main paths forward:

  1. An implementation becomes the de facto standard.
    One catalog (open source or commercial) builds enough momentum that its API becomes the standard, similar to how S3 became the API for object storage.

  2. A neutral API spec is created independently.
    This would follow the Iceberg model, where the spec came first, then vendors and engines built around it.

If history teaches us anything, it’s that vendor-driven standards can create long-term friction. S3 is a good example: it’s ubiquitous, but it’s also tightly bound to a single provider’s roadmap leading to a whack-a-mole like catch up game for those who support the API they have no control over. That experience shaped how the industry approached table formats, this time, the community came together around Iceberg to avoid that kind of lock-in and vendor catch-up.

So whatever path we take toward universal cataloging, the smart money is on a community standard. The only question is whether that standard comes from an existing implementation, or from a new, vendor-neutral spec that everyone agrees to follow.

Exploring the Implementation-First Path: Apache Polaris and Table Sources

If the path to a universal catalog starts with an implementation, Apache Polaris (incubating) is worth watching closely. Among the open catalog projects, Polaris stands out for two reasons:

  1. It’s built as an open implementation of the Apache Iceberg REST Catalog spec.
  2. It’s actively proposing new features to extend catalog support beyond Iceberg tables.

While Polaris already supports Iceberg tables through the standard REST interface, it’s exploring how to bring non-Iceberg datasets into the same catalog. This includes both structured file-based datasets like Parquet or JSON, and unstructured data like images, PDFs, or videos.

Right now, Polaris includes a feature called Generic Tables, but a more robust proposal called Table Sources is under active discussion.

What Are Table Sources?

Discussion of this proposal on the Dev List

Table Sources are a proposed abstraction that lets Polaris register and govern external data that isn’t already an Iceberg table. Instead of forcing everything into the Iceberg format, Polaris acts as a bridge: mapping object storage locations to queryable tables using metadata services that live outside the catalog itself.

Each Table Source includes:

For example:

Polaris doesn’t scan or interpret these datasets directly. Instead, Source Services, external processes, use the registered configurations to scan file systems, generate table metadata, and push it back to Polaris. This decouples the engine from the source and the catalog from the scanning logic.

At query time, engines can interact with these registered tables using the same APIs as they would for Iceberg, even though the backing data may not follow Iceberg’s spec.

Why This Matters

If adopted, the Table Source feature could give Polaris a head start as the reference implementation for a broader catalog API. It defines a reusable contract for registering external data, managing its lifecycle, and governing access, all in a way that’s decoupled from specific engines or formats.

But this also raises the bigger question: will other catalogs follow this model? Will engines adopt the same contract for recognizing external data? Or will each system continue to define its own rules?

That tension, between an evolving implementation like Polaris and the desire for an extension to the REST Catalog API standard, sets the stage for what comes next in the catalog story.

The API-First Path: Extending the Iceberg REST Catalog Spec

Now let’s explore the other side of the equation: what if instead of extending a specific implementation, we expanded the Iceberg REST Catalog specification itself?

This approach would focus on defining a neutral contract that any catalog, Polaris, Unity, Glue, or others, could implement to support more than just Iceberg tables. Rather than focusing on what a specific system can do today, it asks: what could a future REST catalog look like if it supported universal datasets by design?

One of the most interesting signs of this potential is already in the spec: the Scan Planning Endpoint.

What Is Scan Planning?

In the typical read path, an engine:

  1. Requests a table from the catalog.
  2. The catalog responds with the metadata location.
  3. The engine reads the metadata files (manifests, snapshots, etc.) and plans which Parquet files to scan.

But with the Scan Planning Endpoint, the flow changes:

  1. The engine calls the endpoint directly.
  2. The catalog does the heavy lifting: it traverses the metadata, evaluates filters, and returns a list of data files to scan.

This makes the engine’s job simpler if the catalog and engine support the endpoint. It no longer needs to understand Iceberg’s metadata structure. It just gets files to read.

Why This Matters for Universal Catalogs

By pushing scan planning into the catalog, the spec opens the door to something bigger:

This is a fundamental shift: the catalog becomes the query planner for metadata, not just a metadata store.

But here’s the big catch: this currently only exists on the read side.

There’s no equivalent in the spec today for the write path.

A Hypothetical Write-Side Extension

Imagine this: instead of asking the engine to write metadata files (as is required today), the engine submits a write payload to the catalog:

The catalog could then:

In this model, the catalog fully owns metadata management for both reads and writes. Engines don’t need to understand Iceberg’s internals, or any other format’s internals. They just write and read data and delegate everything else.

The Trade-Offs

This model is clean and powerful. It simplifies engine logic and opens the door for catalogs to support any file-based dataset. But it comes at a cost:

In other words, this model places a lot more responsibility on the catalog itself. That’s not necessarily bad, but it changes the design expectations.

Still, if the goal is to build a universal contract for working with datasets across formats, pushing more of that logic into the catalog, via a standardized API that even the major cloud vendors follow, might be the path forward.

Comparing the Two Paths: Implementation vs. API Standard

Both the Table Sources approach and the Scan Planning API model offer ways to move beyond Iceberg-only catalogs. But they take fundamentally different routes. One starts by expanding what a specific catalog can do and becomes the standard if that catalog becomes the standard. The other extends an API Spec that is already an industry standard with a narrower scope (standardizing transactions with Iceberg tables).

Let’s weigh the trade-offs.

1. Flexibility and Expressiveness

In both scenarios, there is still always the question of a specific engines support for reading different file formats or metadata formats. Although, in both scenarios the catalog can still be the central listing governing access to all lakehouse datasets.

2. Governance and Control

3. Ecosystem Alignment

4. Developer Experience

Summary

In practice, both paths have strengths, and challenges. A hybrid model could even emerge: catalogs like Polaris could lead the way with working implementations, while the community formalizes an API spec based on what works.

The real question isn’t which is “better”, it’s which path brings the most durable, portable, and scalable standard to life.