A misconception persists around Data Vault: that it can only be applied to traditional relational databases. As soon as Hadoop, MongoDB, or more broadly NoSQL technologies are mentioned, Data Vault is often considered unsuitable, too relational, or too normalized.
This view is based on a fundamental confusion.
Data Vault is above all a conceptual and logical model, not a fixed physical schema. Like any modeling approach, whether 3NF, star schema, or snowflake, it must be adapted to the chosen storage engine.
A Key Principle: Separating the “What” from the “How”
Data Vault defines:
- what is being modeled: identities, relationships, and historized attributes;
- how data is conceptually structured to handle change.
It does not dictate:
- the type of index,
- the partitioning strategy,
- the level of denormalization,
- or even the existence of tables in the relational sense.
👉 Confusing the Data Vault model with a relational implementation is a mistake in perspective.
Data Vault = Cross-Functional Conceptual Modeling
At the conceptual level, Data Vault relies on stable principles:
- Hubs for business identification,
- Links for relationships,
- Satellites for attributes and history.
These concepts exist independently of the physical engine:
- relational: PostgreSQL, Oracle, SQL Server;
- distributed: Hive, Spark, BigQuery;
- document-oriented: MongoDB;
- column-oriented: HBase;
- or cloud-native.
👉 It is the implementation patterns that change, not the principles.
Implementing Data Vault on Hadoop / Big Data
On Hadoop or Spark platforms:
- Hubs, Links, and Satellites are often implemented as distributed tables;
- partitioning, by date, source, or domain, becomes central;
- joins are managed through design choices: batch processing, ELT, and intermediate materializations.
Common adaptations include:
- grouping Satellites by domain or change frequency;
- intensive use of time-based partitioning;
- access optimization through columnar formats such as Parquet or ORC.
👉 Data Vault is often more natural in these environments than on traditional relational databases, due to its historized and incremental nature.
Implementing Data Vault on MongoDB or Document Databases
In a document-oriented NoSQL context:
- a Hub can become a root document;
- its Satellites can be:
- historized sub-documents;
- or separate collections depending on data volumes.
- Links can be materialized through explicit references or relationship collections.
Design choices are made based on:
- data volumes,
- access patterns,
- performance constraints.
👉 Just like a star schema implemented on MongoDB, some denormalization is expected, without abandoning the conceptual model.
Exactly Like 3NF or Star Schema Modeling
It is important to remember an often-forgotten truth:
- a 3NF model is never implemented in a completely “pure” form in production;
- a star schema is almost always denormalized, indexed, and partitioned.
Data Vault is no exception.
👉 Any modeling approach is:
- conceptual and logical upstream;
- physical and pragmatic downstream.
This is not a weakness. It is an architectural rule.
What Never Changes, Regardless of the Engine
Whatever the storage technology, a successful Data Vault preserves:
- the separation between identification, relationship, and attributes;
- explicit historization;
- traceability by source;
- the ability to absorb change.
👉 These are the properties that make Data Vault valuable, not the number of SQL tables.
The Real Risk: Adapting the Model… or Abandoning It
Adapting Data Vault to a NoSQL engine is healthy.
However, abandoning its principles in the name of performance or “simplicity” often leads to:
- a loss of traceability;
- data debt that is difficult to resolve;
- models that become impossible to evolve.
👉 The challenge is not “Data Vault or NoSQL”.
👉 The real challenge is: “how can we intelligently implement Data Vault on a given engine?”
Conclusion
Data Vault 2.0 is neither relational nor NoSQL by nature.
It is conceptual, storage-agnostic, and designed for the long term.
Like any serious modeling approach, it requires:
- physical adaptations;
- technical trade-offs;
- local optimizations.
👉 The real question is therefore not: “is Data Vault compatible with Hadoop or MongoDB?”
👉 It is rather: “do we have the maturity to distinguish between conceptual modeling and physical implementation?”
