Warning
🚧 Work in Progress: This page is currently under construction. Content may be incomplete or subject to change. To contribute, see the contribution guide.
Data Architecture
Principles
- Single source of truth: BigQuery as the central repository — data must not be replicated in local silos
- Medallion architecture: three data maturity layers (raw → stage → gold)
- Quality before consumption: mandatory validations in the stage layer
- Cataloging: every production dataset must be registered in the Data Catalog
- Access governance: access by AD group, no service credential sharing
Data lake layers
| Layer | Standard dataset | Content | Retention | Access |
|---|---|---|---|---|
| Raw | project.raw_domain | Raw data, no transformation, original format | 90 days | Data engineering only |
| Stage | project.stage_domain | Cleaned, typed, deduplicated data | 1 year | Data engineering |
| Gold | project.gold_domain | Modeled data, ready for consumption | Permanent | Analysts, BI, APIs |
Dataset naming standard
See Standards > Naming for full conventions.
Summary:
{layer}_{domain}_{subdomain}
e.g.: raw_investments_fundraising
stage_finance_accounts_payable
gold_corporate_headcount
Technology stack
| Component | Technology | Use |
|---|---|---|
| Data lake / DW | BigQuery (GCP) | Storage and analytical queries |
| Orchestration | Airflow — Cloud Composer | Pipeline scheduling and dependencies |
| Ingestion APIs | Cloud Run | REST ingestion for systems without native connectors |
| Low-code integration | N8N | External API and webhook ingestion |
| Transformation | SQL (BigQuery) + dbt (under evaluation) | Stage → gold transformations |
| BI | (fill in — Looker / Power BI / etc.) | Dashboards and reports |
| AI Models | Vertex AI / Cloud Run | Production models |