Warning

🚧 Work in Progress: This page is currently under construction. Content may be incomplete or subject to change. To contribute, see the contribution guide.

Data Architecture


Principles

  1. Single source of truth: BigQuery as the central repository — data must not be replicated in local silos
  2. Medallion architecture: three data maturity layers (raw → stage → gold)
  3. Quality before consumption: mandatory validations in the stage layer
  4. Cataloging: every production dataset must be registered in the Data Catalog
  5. Access governance: access by AD group, no service credential sharing

Data lake layers

LayerStandard datasetContentRetentionAccess
Rawproject.raw_domainRaw data, no transformation, original format90 daysData engineering only
Stageproject.stage_domainCleaned, typed, deduplicated data1 yearData engineering
Goldproject.gold_domainModeled data, ready for consumptionPermanentAnalysts, BI, APIs

Dataset naming standard

See Standards > Naming for full conventions.

Summary:

{layer}_{domain}_{subdomain}
e.g.: raw_investments_fundraising
      stage_finance_accounts_payable
      gold_corporate_headcount

Technology stack

ComponentTechnologyUse
Data lake / DWBigQuery (GCP)Storage and analytical queries
OrchestrationAirflow — Cloud ComposerPipeline scheduling and dependencies
Ingestion APIsCloud RunREST ingestion for systems without native connectors
Low-code integrationN8NExternal API and webhook ingestion
TransformationSQL (BigQuery) + dbt (under evaluation)Stage → gold transformations
BI(fill in — Looker / Power BI / etc.)Dashboards and reports
AI ModelsVertex AI / Cloud RunProduction models