Warning
π§ Work in Progress: This page is currently under construction. Content may be incomplete or subject to change. To contribute, see the contribution guide.
ADR-001: Data Lake on BigQuery (GCP)
| Field | Value |
|---|---|
| Status | β Accepted |
| Date | 2024 |
| Decision makers | CTO, Head of Data |
Context
Patria needed a centralized platform to consolidate data from multiple systems (operational, financial, investment) and support analytics, reporting, and AI models.
Decision
Adopt BigQuery (Google Cloud Platform) as the primary data lake and data warehouse platform, with Cloud Composer (Airflow) for pipeline orchestration and Cloud Run for data APIs.
Rationale
- BigQuery is serverless β no infrastructure management, automatic scalability
- Petabyte-scale query performance with predictable cost
- Native integration with the GCP ecosystem (Dataflow, Vertex AI, Looker)
- Teamβs prior experience with the platform
- Better cost-benefit ratio vs. alternatives like Snowflake and Databricks for the current scale
Alternatives considered
| Alternative | Why it was not chosen |
|---|---|
| Snowflake | Higher cost, less GCP integration |
| Databricks | Higher operational complexity for the current stage |
| Azure Synapse | Less mature Azure analytics ecosystem at the time |
Consequences
- Positive: scalable, serverless platform with a strong analytics ecosystem
- Negative / trade-offs: GCP vendor lock-in; team upskilling required
- Follow-up actions: structure medallion layers (raw β stage β gold), define Airflow DAG standards