Warning

🚧 Work in Progress: This page is currently under construction. Content may be incomplete or subject to change. To contribute, see the contribution guide.

ADR-001: Data Lake on BigQuery (GCP)

FieldValue
Statusβœ… Accepted
Date2024
Decision makersCTO, Head of Data

Context

Patria needed a centralized platform to consolidate data from multiple systems (operational, financial, investment) and support analytics, reporting, and AI models.

Decision

Adopt BigQuery (Google Cloud Platform) as the primary data lake and data warehouse platform, with Cloud Composer (Airflow) for pipeline orchestration and Cloud Run for data APIs.

Rationale

  • BigQuery is serverless β€” no infrastructure management, automatic scalability
  • Petabyte-scale query performance with predictable cost
  • Native integration with the GCP ecosystem (Dataflow, Vertex AI, Looker)
  • Team’s prior experience with the platform
  • Better cost-benefit ratio vs. alternatives like Snowflake and Databricks for the current scale

Alternatives considered

AlternativeWhy it was not chosen
SnowflakeHigher cost, less GCP integration
DatabricksHigher operational complexity for the current stage
Azure SynapseLess mature Azure analytics ecosystem at the time

Consequences

  • Positive: scalable, serverless platform with a strong analytics ecosystem
  • Negative / trade-offs: GCP vendor lock-in; team upskilling required
  • Follow-up actions: structure medallion layers (raw β†’ stage β†’ gold), define Airflow DAG standards