Warning

🚧 Work in Progress: This page is currently under construction. Content may be incomplete or subject to change. To contribute, see the contribution guide.

ADR-001: Data Lake on BigQuery (GCP)

Field	Value
Status	✅ Accepted
Date	2024
Decision makers	CTO, Head of Data

Context

Patria needed a centralized platform to consolidate data from multiple systems (operational, financial, investment) and support analytics, reporting, and AI models.

Decision

Adopt BigQuery (Google Cloud Platform) as the primary data lake and data warehouse platform, with Cloud Composer (Airflow) for pipeline orchestration and Cloud Run for data APIs.

Rationale

BigQuery is serverless — no infrastructure management, automatic scalability
Petabyte-scale query performance with predictable cost
Native integration with the GCP ecosystem (Dataflow, Vertex AI, Looker)
Team’s prior experience with the platform
Better cost-benefit ratio vs. alternatives like Snowflake and Databricks for the current scale

Alternatives considered

Alternative	Why it was not chosen
Snowflake	Higher cost, less GCP integration
Databricks	Higher operational complexity for the current stage
Azure Synapse	Less mature Azure analytics ecosystem at the time

Consequences

Positive: scalable, serverless platform with a strong analytics ecosystem
Negative / trade-offs: GCP vendor lock-in; team upskilling required
Follow-up actions: structure medallion layers (raw → stage → gold), define Airflow DAG standards

Patria Tech Docs

Explorer

Patria Tech Docs

ADR-001: Data Lake On BigQuery (GCP)

ADR-001: Data Lake on BigQuery (GCP)

Context

Decision

Rationale

Alternatives considered

Consequences

Table of Contents

Backlinks