Data Architecture

VersionDateAuthorComments

Introduction

This document describes an overview of the Data Architecture in Patria. It establishes the standards, principles, and guidelines for data architecture, governance, and the construction of data flows at Patria Investments. Our goal is to empower the organization to make data-driven decisions, drive innovation, and optimize operational efficiency, ensuring that our data assets are reliable, secure, and accessible.

Document Purpose:

  • To define a robust and scalable data architecture.
  • To standardize data naming, organization, and lifecycle.
  • To provide clear guidelines for the development and operation of data pipelines.
  • To ensure compliance with security and governance policies.

Target Audience: This document is intended for Data Engineers, Data Scientists, Developers, Data Analysts, Project Managers, and all stakeholders who interact with our data and infrastructure.

Benefits of Adoption: Adherence to these standards will result in:

  • Consistency and Reusability: Reduced rework and increased agility in development.
  • Scalability and Performance: Infrastructure prepared for growth and high demand.
  • Data Security and Quality: Protection of sensitive information and assurance of reliable data.
  • Cost Optimization: Efficient use of cloud resources.
  • Agility and Collaboration: Facilitation of teamwork and continuous delivery.

Fundamental Principles of Data Architecture

This section establishes the pillars upon which our data architecture is built, ensuring that all decisions and implementations are aligned with our strategic vision.

Guiding Principles

  • Data as a Product: Treat datasets as products, with clear ownership, documentation, and SLAs.
  • Security by Design: Incorporate security controls at all layers of the architecture, from conception.
  • Cost Optimization (FinOps): Continuously monitor and optimize data infrastructure costs.
  • Scalability and Flexibility: Design systems that can grow and adapt to future needs.
  • Observability: Implement comprehensive monitoring and logging for data flow visibility and performance.
  • Automation First: Prioritize the automation of infrastructure processes and data pipelines (IaC, CI/CD).
  • Data Quality: Establish mechanisms to ensure the integrity, accuracy, and completeness of data at all stages.

Architecture Overview (Conceptual Diagram)

A high-level diagram illustrating the data flow from sources to consumption, highlighting the main layers and components. This diagram will serve as a conceptual map before delving into technical details.

TODO


Environments and Project Organization

The organization and management of projects in Google Cloud follow a structured hierarchy to ensure efficient administration and clear separation of environments. Our setup is organized under the parent organization called patria.com. Within this organization, projects are categorized and managed under distinct environments to facilitate development, testing, and production workflows. These environments are designated as development (dev), homologation (hml), and production (prd).

Environment Structure

Details the purpose and characteristics of each environment:

  • Development (dev): The development environment is used for building and testing new features and functionalities. It allows developers to experiment and iterate on their code without affecting the production environment. All resources and services in this environment are isolated from other environments to ensure stability and security.

  • Homologation (hml): The homologation environment serves as a staging area where features and changes are tested thoroughly before being deployed to production. This environment mimics the production environment as closely as possible to identify and fix any issues that might arise during the final stages of the deployment process.

  • Production (prd): The production environment is where the live applications and services run. This environment is optimized for reliability, performance, and security. It is strictly controlled to minimize disruptions and ensure continuous availability of services to end users.

GCP Project Management

Each environment contains specific projects tailored to the needs and requirements of that stage. For example:

  • pat-datalake-dev for development activities.
  • pat-datalake-hom for homologation and pre-production testing.
  • pat-datalake-prd for live production workloads.

This hierarchical and segmented approach helps in maintaining clear boundaries between different stages of the project lifecycle, thus facilitating smooth transitions from development to production while ensuring that each stage meets its specific operational requirements.

By organizing projects in this manner, we ensure efficient resource allocation, streamlined workflows, and enhanced security and compliance across all stages of our operations within Google Cloud.

Access Management (IAM)

Managing access to these projects is crucial for maintaining security and ensuring that team members have the appropriate permissions to perform their tasks. Google Cloud’s Identity and Access Management (IAM) provides a flexible and powerful way to manage access to resources.

Steps to Grant Access:

  1. Navigate to IAM & Admin:
    1. Open the Google Cloud Console.
    2. Navigate to the IAM & Admin section from the main menu.
  2. Select the Project:
    1. Choose the project for which you want to manage access (e.g., pat-datalake-dev, pat-datalake-hom, pat-datalake-prd).
  3. Add Members:
    1. Click on the “Add” button to add new members.
    2. Enter the email address of the user or service account you want to grant access to.
  4. Assign Roles:
    1. Select the appropriate roles for the user based on their responsibilities. Common roles include:
      1. Viewer: Read-only access.
      2. Editor: Read and write access.
      3. Owner: Full control over all resources.
      4. Custom roles: If specific permissions are needed, custom roles can be created to tailor access precisely.
  5. Save Changes:
    1. Click “Save” to apply the changes.

Example: To grant a developer access to the development environment with editing capabilities:

  • Project: pat-datalake-dev
  • Member: developer@example.com
  • Role: Editor

This process ensures that only authorized users have access to specific resources and can perform their tasks without compromising the security and integrity of the environments.

By following these steps, we maintain a secure and well-managed access control system across all our projects and environments in Google Cloud.


Core Data Architecture: Layers and Components

This section describes the fundamental structure of our Data Lake and Data Warehouse, based on the Medallion Architecture, and the main technological components that support it.

Medallion Architecture (Bronze, Silver, Gold)

The Medallion Architecture (also known as the “Multi-hop” or “Lakehouse” architecture) is a data design pattern used to organize data in a data lake or lakehouse environment. It structures data into tiered layers (bronze, silver, gold) to progressively refine and enrich raw data into trusted, analytics-ready datasets. This architecture is widely adopted in platforms like Databricks and Delta Lake, but it’s tool-agnostic and can be applied in any modern data stack.

In our data lake architecture, we utilize the Medallion Architecture to organize and manage data through different stages of its lifecycle. This approach helps in maintaining data quality, enabling efficient data processing, and ensuring that data is ready for analytical use. The layers are categorized as Bronze, Silver, and Gold.

Bronze Layer

  • Description: The Bronze layer is the raw data layer where data is ingested and stored in its original format. This layer acts as the landing zone for all incoming data from various sources. Data in the Bronze layer is often unprocessed and may include duplicate records or inconsistent formats.
  • Purpose: The primary purpose of the Bronze layer is to capture and store all incoming data for further processing and transformation.
  • Characteristics: Raw, unprocessed data, may contain duplicates and inconsistencies. Scalable storage (e.g., Parquet in Google Cloud Storage). Acts as a staging area for data ingestion.

Bronze layer

Silver Layer

  • Description: The Silver layer is the intermediate layer where data is cleaned, deduplicated, and transformed. This layer focuses on improving data quality by applying various transformation and standardization processes.
  • Purpose: The purpose of the Silver layer is to transform raw data into a more refined and structured format, making it suitable for further analysis and reporting.
  • Characteristics: Cleaned, deduplicated, and transformed into standardized formats and structures. This enhances data quality through initial transformations and ensures the data is stored in an organized, structured format such as BigQuery tables.

Silver layer

Gold Layer

  • Description: The Gold layer is the final, refined layer where data is fully transformed and aggregated. This layer contains high-quality, ready-to-use data optimized for business intelligence (BI) and advanced analytics.
  • Purpose: To provide business-ready data that can be directly used for reporting, dashboards, and advanced analytics without further transformations.
  • Characteristics: Fully transformed and aggregated data, high quality and consistency. Optimized for fast querying (e.g., BigQuery tables with optimized schema).

Gold layer

Data Flow Example

Practical illustration of data movement between layers, from ingestion (World Bank to GCS) to final consumption in BigQuery.

  1. Bronze Layer:
    1. Data is ingested from various sources, such as World Bank, and stored in its raw format in Google Cloud Storage.
    2. Example: gs://poc-patria-lake-dev/patria/world_bank/raw_data/
  2. Silver Layer:
    1. Data from the Bronze layer is processed to remove duplicates, correct inconsistencies, and standardize formats. The refined data is then stored in BigQuery.
    2. Example: bigquery-public-data.patria.world_bank.silver_data
  3. Gold Layer:
    1. Data from the Silver layer is further transformed and aggregated to create final datasets optimized for BI and reporting. This data is stored in BigQuery and is ready for consumption.
    2. Example: bigquery-public-data.patria.world_bank.gold_data

By adopting the Medallion Architecture, we ensure that our data lake supports scalable, efficient, and high-quality data processing, making it easier to derive valuable insights and support data-driven decision-making across the organization.


Key Data Platform Components

Description of the main services and tools used in our architecture, with their functionalities and roles in the ecosystem.

Data Storage (Google Cloud Storage)

Cloud storage is a managed service for storing structured, semi-structured and unstructured data, where files with raw data extracted from source systems are stored. It has a wide range of features such as: object lifecycle management, continental-scale file replication and fast and flexible transfer service. It also has “storage classes” where, depending on the type of data and its reading frequency, it is possible to store it in less frequent use layers, which are cheaper.

  • Standard: aimed at the best performance and data that needs to be accessed frequently.
  • Nearline: agile alternative, recommended for data accessed monthly.
  • Coldline: efficient alternative, recommended for data accessed every 90 days.
  • Archive: best cost-benefit, aimed at data accessed less than once a year.

Data Warehouse/Analytics (BigQuery)

BigQuery is the database where data will be stored and where it can be queried. BigQuery is a solution created by Google to perform analyses on huge amounts of data in a scalable way, using SQL language, allowing terabytes of data to be queried in seconds. We can highlight its main advantages as speed, scalability and economy, since it is possible to measure the cost of each query made and among its main functionalities we have:

  • Cost control.
  • High availability.
  • Security and access control.

Workflow Orchestration (Apache Airflow / Cloud Composer)

Airflow is a tool developed for orchestrating data pipelines. It is one of the most widely used frameworks for this purpose, since it masterfully solves some of the main problems found in data pipelines. Since its creation, the tool has been based on a few axioms, which are:

  • Dynamicity: Since pipelines can be instantiated through code, especially Python code, the creation process is simpler and more dynamic.
  • Extensibility: It can be attached to any environment, given its componentization, which is composed of executors, operators, and external libraries.
  • Scalability: Because it has a message queue to orchestrate its workers, as well as because it is a modular architecture, it is prepared to scale to n environments.

Google Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow, designed to automate, schedule, and monitor workflows across Google Cloud and hybrid environments.

Key Features & Capabilities:

  • Managed Apache Airflow:
    • Runs Airflow as a managed service, eliminating the need to deploy, scale, or maintain infrastructure.
    • Provides Airflow UI + CLI access for workflow management.
  • Workflow Automation:
    • Schedule and orchestrate ETL pipelines, data processing, and ML workflows.
    • Trigger workflows based on events, time schedules, or external signals.
  • GCP & Multi-Cloud Integration:
    • Native GCP integrations: Works seamlessly with BigQuery, Pub/Sub, Cloud Storage, Dataproc, Dataflow, and more.
    • Supports hybrid/multi-cloud workflows (AWS, Azure, on-prem).
  • Scalability & Reliability:
    • Auto-scales based on workload demands.
    • Runs on Google Kubernetes Engine (GKE) for high availability.
  • Monitoring & Logging:
    • Built-in Cloud Monitoring & Logging for tracking DAG (workflow) execution.
    • Error handling & retries for fault tolerance.

Secret Management (Secret Manager)

Service for managing passwords and credentials for accessing external data sources, acting as a password vault.

With Secret Manager, you can manage all passwords and data for connecting to data sources. Through Airflow, we can reference the secret manager as a backend for making connections with external systems. Within this infrastructure, the secret manager acts as a password vault, where all access credentials for all systems and databases are located.

Compute and Processing (Compute Engine)

Compute Engine is Google Cloud Platform’s (GCP) Infrastructure-as-a-Service (IaaS) offering, providing secure, high-performance, and scalable virtual machines (VMs) for running cloud workloads.

Key Features of Compute Engine:

  • Customizable Virtual Machines: Create VMs with varying CPU, memory, and storage configurations. Offers predefined machine types (E2, N2, N2D, C2 series) or custom machine configurations with exact resource requirements.
  • Scalability:
    • Autoscaling: Automatically adjusts VM instances based on application demand.
    • Managed Instance Groups: Create groups of identical VMs that can be auto-scaled and managed collectively.
  • Operating System Images: Supports multiple OS options including Linux (Ubuntu, Debian, Red Hat) and Windows Server.
  • Storage Options:
    • Persistent Disks: Durable, highly available block storage.
    • Local SSDs: High-speed temporary storage disks.
    • Networking: Secure and efficient VM communication via Virtual Private Cloud (VPC).
  • Security:
    • IAM and Role-Based Access Control (RBAC) for permission management.
    • SSH keys and service account authentication for secure VM access.
  • Common Use Cases:
    • Web Applications & Backends: Host APIs, web services, and backend applications.
    • Data Science & ML: Optimized instances for data processing and ML model execution.
    • Dev/Test Environments: Rapid provisioning for software development and testing.
    • Containers: Native integration with Kubernetes Engine and Docker support.

Identity and Access Management (IAM)

IAM (Identity and Access Management) is the service in GCP that allows you to manage access to resources on the platform. With IAM, you can define who (users, groups, or service accounts) can do what (grant permissions) on which resources (such as VMs, storage buckets, etc.).

Key concepts:

  • Policies: Sets of permissions applied to resources.
  • Roles: Sets of permissions that can be assigned to identities.
  • Identities: Users, groups, or service accounts that can have access to resources.
  • Principle of least privilege: You can define roles that provide only the permissions necessary to perform a task.

DevOps Tools (Azure DevOps, Azure Pipelines)

Platform for the software development lifecycle, focusing on CI/CD for automating code builds, tests, and deployments.

Azure DevOps is a software development services platform that offers a set of integrated tools for planning, developing, delivering, and managing applications in multiple programming languages. It covers the entire software development lifecycle (SDLC), from writing code to monitoring delivery to production. In the project, its use is similar to GitHub, where Airflow source codes are stored.

Azure Pipelines is the CI/CD (Continuous Integration/Continuous Delivery) service within Azure DevOps, designed to automate the process of building, testing, and deploying code. It supports a wide variety of programming languages, frameworks, and environments, making it ideal for any type of software project.

Azure Pipelines Features:

  • Continuous Integration (CI):
    • Automatic code builds: Whenever changes are pushed to the repository (via push or pull request), Azure Pipelines can be configured to automatically build the code, run tests, and validate if changes are stable.
    • Multi-language support: Supports virtually all popular languages and frameworks including Python, .NET, Java, Node.js, Ruby, Go, and PHP.
    • Build pipelines: CI processes are configured in pipelines that can be defined via YAML or through a graphical interface. These pipelines can be reused and shared across different projects.
  • Continuous Delivery (CD):
    • Deployment automation: Azure Pipelines enables automated application deployments to various environments like Azure, AWS, Google Cloud, Kubernetes, on-premises VMs, or even custom servers.
    • Multi-environment deployment: You can define test, staging, and production environments within a pipeline, ensuring code passes validation stages before being released to production.
    • Gates and approvals: CD pipelines can be configured with automatic gates and manual approvals to promote or block code releases based on tests or reviews.

Data Governance and Quality

Data governance is crucial to ensuring that our data assets are reliable, secure, and compliant with regulations.

Dataplex for Governance

Features and Benefits

Google Cloud Dataplex is an intelligent, unified data governance service offered by Google Cloud Platform (GCP). It helps organizations manage, monitor, and govern their data across diverse environments (like data lakes, data warehouses, and databases) in a centralized and automated way.

Key Features of Dataplex:

  • Unified Data Management:
    • Organizes distributed data across GCS (data lakes) and BigQuery (data warehouses) into a single logical layer called a “Lake”.
    • Supports data zones (Raw, Curated, Analytics) for better structure.
  • Automated Metadata & Cataloging:
    • Automatically tracks metadata (schema, classifications, data lineage).
    • Integrates with Data Catalog for easy data discovery.
  • Governance & Security:
    • Centralized access policies (IAM).
    • Supports data masking and sensitive data classification (e.g., PII, financial data).
  • Data Quality & Processing:
    • Define and monitor data quality rules.
    • Integrates with Dataproc (Spark) and BigQuery for transformations.
  • AI/ML Analytics:
    • Enables advanced analytics using BigQuery ML and Vertex AI on managed data.

Lake Organization in Dataplex

The concept of a lake in Dataplex can be established through the hierarchy of Lake → Zone → Asset. For the overall Patria data project, a lake named lake-prd has been created, encompassing the data from the POC project. Within this lake, there are the bronze, silver, and gold zones. These zones contain the following assets:

  • Bronze: Storage bucket and BigQuery dataset
  • Silver: BigQuery dataset
  • Gold: BigQuery dataset

Access control can be managed at the lake, zone, or asset level.

For new projects, additional lakes can be created within Dataplex. For example: commercial-lake, rh-lake.

Dataplex lake organization

Access Management in Dataplex

If it is necessary to include a new account, it is necessary to determine which data this account will be able to access.

Example: We will consider that it will be the account of a Data Analyst who will only consult the information in the final data layer, in the gold layer.

Steps to Grant Access:

  1. Navigate to Dataplex:
    1. Open the Google Cloud Console.
    2. Navigate to the Dataplex section from the main menu.
  2. Select the option Manage:
    1. Within the Manage option, the Lakes will be shown, just select the desired lake (e.g., pat-datalake-dev, pat-datalake-hom, pat-datalake-prd).
  3. Select the option Lake:
    1. Within the Lake option, the Zones will be shown, just select the desired zone (e.g., bronze, silver, gold).
  4. Within the lake zones, click on the Permissions tab:
    1. In the permissions tab, several accounts and their permissions related to the data in this layer will be displayed.
    2. Click the Grant Access button, and a window will appear:
      1. In Add Principals, add the emails of the desired accounts.
      2. In Assign Roles, select the roles that these accounts will have within the data layer. There are several roles such as editor, viewer as well as custom roles such as Dataplex Data Writer.
    3. Once the adjustments have been confirmed, simply click the Save button.

Data Lineage

Dataplex features a data lineage tool that generates a data flow diagram, illustrating the created tables and the queries responsible for generating these tables. This tool can be accessed directly from BigQuery, within each table, using the “Data Lineage” button.

Data lineage

Data Masking

Data masking is a critical process for ensuring the privacy and security of sensitive information within our data warehouse. In Google Cloud Platform (GCP), we use BigQuery’s native capabilities to implement data masking techniques, which allow us to obscure specific data elements to protect sensitive information from unauthorized access while maintaining the usability of the data for analysis and reporting.

Purpose

The primary purpose of data masking is to safeguard sensitive data such as personally identifiable information (PII), financial data, and other confidential information. By masking data, we can:

  • Comply with regulatory requirements and data privacy laws.
  • Protect against data breaches and unauthorized access.
  • Enable safe data sharing and analysis without exposing sensitive information.

Configuration File

The configuration for data masking and other project parameters is managed in the project_params.py file located within the Airflow DAGs directory.

  • Configuration File: project_params.py

BigQuery

In BigQuery it is possible to create new Policy Tags to mask sensitive data from users. First it is needed to create a new taxonomy, then enter the new taxonomy to put the masking rules on the tags created, select the tag and click on the button above. A new modal should open where it is able to select what masking process is needed.

BigQuery taxonomy configuration

BigQuery policy tag masking rules

You can apply the tags with rules on BigQuery on the columns of each table, select Edit Schema:

Edit schema

Then select the fields/columns to add the masking tag and click on Add Policy Tag:

Add policy tag

Policy tag applied

You can find more information at BigQuery Column-Level Security Documentation.


Standards and Conventions

Standardization is essential for consistency, maintenance, and collaboration in our data architecture.

GCP Resource Naming

Patria has rules related to the naming of cloud components, where buckets, databases, firewalls, etc. need to comply with this rule. This nomenclature has some rules and a defined structure.

GCP cloud resource names follow the following formula:

pat + gcp + xx + yy + nn
  • pat = Acronym of Patria
  • gcp = Acronym of Google Cloud
  • xx = Cloud component acronym
  • yy = Acronym for the cloud resource functionality
  • nn = Ordinal number of the resource within the project

Example: For a bucket in GCP storage, the nomenclature would be: patgcpbust01

Below is the dictionary of acronyms used in the project.

Dictionary of Acronyms:

AcronymDefinition
patPatria
gcpGoogle Cloud
ceCompute Engine
apApplication
buBucket
stGCP Storage
bqGCP BigQuery
dlDatalake
txGCP BigQuery Policy Taxonomy
mkGCP BigQuery Taxonomy Mask
dpGCP DataPlex
dmData Management
znDataplex Zone
asDataplex Asset
ccGCP Cloud Composer

Data Naming and Organization Standardization

Cloud Storage

Folder structure for data in GCS:

/[bucket-lake]/[project]/[source-system]/[object]/[year]/[month]/[day]/[full|incremental].parquet

Examples:

poc-patria-lake-dev/patria/oracle_hcm/lista_desligados/2023/12/17/full.parquet
poc-patria-lake-dev/patria/tmf_usa/usa_report/2023/12/15/full.parquet

BigQuery

General Guidelines:

  • Consistency: Ensure consistency across all datasets, tables, and columns.
  • Readability: Use descriptive names that easily convey the meaning or content.
  • Use Lowercase Letters: Stick to lowercase letters to avoid case sensitivity issues.
  • Underscores for Separation: Use underscores (_) to separate words instead of spaces or hyphens.
  • Descriptive Names: Choose names that clearly describe the content or purpose of the dataset, table, or field.
  • Avoid Keywords: Do not use BigQuery reserved keywords.

Dataset Naming Conventions:

  • Format: Lowercase with underscores (snake_case).
  • Description: Include the name of the subject area and environment if applicable.
  • Example: sales_analytics, customer_data_dev

Table Naming Conventions:

  • Format: Lowercase with underscores (snake_case).
  • Structure: Start with the subject name, followed by descriptions for specific focus if needed.
  • Avoidance: Avoid using tbl_, as it is redundant.
  • Time Partitioned Tables: Consider including time dimensions (e.g., sales_2025).
  • Example: order_transactions, customer_profiles
  • Description: All tables must have a clear description. It will be used in Data Catalog.

Column Naming Conventions:

  • Format: Lowercase with underscores (snake_case).
  • Clarity: Be descriptive but concise.
  • Prefixes/Suffixes: Use them sparingly for types or units if they add clarity.
  • Avoidance: Do not use generic names like data or value.
  • Example: customer_id, transaction_amount_usd
  • Description: All columns must have a clear description. It will be used in Data Catalog.

Abbreviations:

  • Use Sparingly: Only use widely accepted abbreviations.
  • Be Descriptive: Ensure they make the column easily understood.
  • Examples: qty for quantity, amt for amount.

Airflow DAGs Standardization

The following naming conventions have been implemented for creating DAGs.

DAG Name:

  • Bronze + Silver: [client]_[source_system]
  • Gold: [client]_[concept]

Folders in Task Groups:

  • bronze, silver and gold

Files in Task Groups:

  • [client]_[source_system]_bronze
  • [client]_[source_system]_silver
  • [client]_[concept]_gold

Example of Full Path:

.../dags/task_groups/bronze/patria_oracle_hcm_bronze.py
.../dags/task_groups/gold/patria_reports_tmf_gold.py

Tags / Labels for FinOps

Standards for tags and labels (resource_name, project_id, cost_center, env, area) for cost control and resource allocation.

The definition of labels within each project component can be changed directly via IaC code, simply by including these labels in the Terraform codes.

It was defined by Patria and Builders that the labels would be as follows:

LabelExample
resource_namepatgcpcedc01
project_idprj-2025-037
cost_centernot_informed
envprd
areait

Data Flow Construction and Operation

This section details the data development lifecycle, from code versioning to deployment and pipeline operation.

Data Development Lifecycle (CI/CD)

Repositories

Use of Azure DevOps as a central hub for code versioning, including the Airflow DAGs repository and the Infrastructure as Code (IaC) repository with Terraform.

Our repositories on Azure DevOps serve as the central hub for managing and versioning our infrastructure and application code. They are critical for maintaining consistency, enabling collaboration, and ensuring seamless integration and deployment processes. Below are the links to the repositories for our Airflow DAGs and Infrastructure as Code (IaC) using Terraform.

Azure DevOps repositories

Airflow DAGs Repository

This repository contains all the DAGs (Directed Acyclic Graphs) used for orchestrating our data workflows in Google Cloud Composer. It includes the Python scripts and configurations necessary to manage our ETL processes.

Airflow DAGs repository structure

Repository Link: Airflow DAGs Repository

Airflow DAGs repository detail

CI/CD on Azure DevOps

This repository contains our Terraform scripts used for provisioning and managing cloud infrastructure. It ensures that our infrastructure is defined, deployed, and maintained as code, supporting scalability and repeatability.

  • Repository Link: IaC Repository

IaC repository

CI/CD Pipelines

Automated build, test, and deployment via Azure Pipelines. Description of the workflow (branching, Python tests, PR approval, merge, and deployment to HML/PRD).

Content existing in: CI/CD on Azure DevOps, Airflow Flow, IAC - Terraform

Environment Access and Operation

Detailed instructions on how to access different components in each environment.

Airflow UI Access — DEV/HML

There are two ways to access Apache Airflow, one directly via port 22 and the other via port 8080. Port 22 is the development port, that is, where the developer can transfer files related to new pipelines and finally port 8080 is the port where the Airflow web interface runs.

Airflow VM architecture

Access to the Apache Airflow UI interface, for internal security policy reasons, can be accessed via the IAP tunnel as it is not possible to have an external IP.

Steps to Access:

  1. Open the terminal.
  2. Instantiate the development project:
    1. Use this command: gcloud config set project pat-datalake-dev
  3. Start the IAP tunnel:
    1. Use this command: gcloud compute start-iap-tunnel vm-airflow-dev 8080 --project=pat-datalake-dev --zone=us-east1-b --local-host-port=localhost:8080
  4. The terminal should show that it is listening on port 8080 via IAP tunnel. The command Listening on port [8080] will appear in the terminal.
  5. Open your internet browser and enter the address: http://localhost:8080
  6. Once this is done, Airflow will ask for a username and password for this environment.
    1. In development and approval environments, the user used is admin.
    2. Inside the Airflow development virtual machine (vm-airflow-dev), there is a file with some environment variables and among them is the Airflow user and password for dev and qa.
    3. To access or change this password, simply change this file that is located in the path: /home/airflow/config/variables.env

SSH Access (DEV/HML VMs)

To access port 22, the dev VM must be turned on and a specific SSH key must be configured on it. Just access the VM and include the SSH key of the station/engineer who will access this environment.

Steps to Access:

  1. Navigate to Compute Engine:
    1. Open the Google Cloud Console.
    2. Navigate to the Compute Engine section from the main menu.
  2. Select the option VM Instance:
    1. Choose the option VM Instance and a window will appear with the virtual machines.
  3. Select the VM vm-airflow-dev:
    1. Choose the virtual machine vm-airflow-dev and go to details.
  4. Navigate to the SSH Keys option:
    1. Click the Edit button to edit the VM and add the key.
    2. In the SSH Keys option, click the Add item button.
    3. Add the SSH key of the engineer who will access this environment.

SSH key configuration in VM

It is possible to include the SSH keys in the project’s IaC source code, so that the VMs are configured via code with this information.

Simply add the key to the SSH key file within the virtual machines’ IaC code: ssh-keys

SSH keys in IaC repository

With these steps it is possible to connect via port 22 to the VM and exchange files between environments and thus submit new files to the development environment.

It is possible to communicate with the VM using applications like WinSCP.

WinSCP:

  1. In WinSCP it is possible to create a connection by entering the internal IP of the VM and port 22.
  2. In the Advanced option in WinSCP:
    1. In the authentication part, include the engineer’s private SSH key.
    2. In the connection part, include a proxy tunnel to connect:
      1. Paste this command: gcloud compute start-iap-tunnel vm-airflow-dev 22 --listen-on-stdin --project=pat-datalake-dev --zone=us-east1-b

With this it is possible to connect via WinSCP with the virtual machine.

WinSCP connection

Google Composer — PRD

Within the production environment, there is only one way to access Apache Airflow and that is through Cloud Composer, which is a scalable environment managed by GCP itself.

Steps to Access:

  1. Navigate to Composer:
    1. Open the Google Cloud Console.
    2. Navigate to the Composer section from the main menu.
  2. Inside the Environments section:
    1. Click on the Airflow option within Airflow Webserver.

Cloud Composer access

BigQuery Access

BigQuery is the scalable database with all data in all layers of the data lake. It can be accessed via a browser directly in the GCP cloud or via a terminal.

Steps to Access:

  1. Navigate to BigQuery:
    1. Open the Google Cloud Console.
    2. Navigate to the BigQuery section from the main menu.
  2. Select the option Studio:
    1. Choose the option Studio and a window will appear with a panel where you can access datasets and tables.

BigQuery Studio

Secret Manager Access

This GCP component corresponds to the password vault, where database access credentials can be included. Access credentials to banks, APIs and systems such as passwords and tokens can be stored here. Access is via Google Cloud Console (Security section).

Steps to Access:

  1. Navigate to Security:
    1. Open the Google Cloud Console.
    2. Navigate to the Security section from the main menu.
  2. Select the option Secret Manager:
    1. Choose the option Secret Manager.
  3. To add a new secret:
    1. Click the button Create Secret.
  4. Add a name to the secret:
    1. If the secret is used during the Airflow pipeline, then it should start with the name airflow-connections-[CONNECTION_NAME].
    2. With the secret name defined, simply add credentials such as host, port, password, user, token, etc. to the value field.

Secret Manager


Data Sources and Integrations

This section covers the data sources that feed our Data Lake and specific integrations with other platforms.

Current Data Sources

Example of extraction from the World Bank public API, including indicators (GDP, inflation) and the list of Latin American countries.

The World Bank’s public API was extracted, with the aim of providing some information that may be useful in the business context of Patria Investments. The data obtained from the World Bank refer to GDP and inflation indicators and were obtained for the following countries:

  • ARGENTINA
  • BOLIVIA
  • BRAZIL
  • CHILE
  • COLOMBIA
  • COSTA RICA
  • CUBA
  • DOMINICAN REPUBLIC
  • ECUADOR
  • GUATEMALA
  • HONDURAS
  • HAITI
  • MEXICO
  • NICARAGUA
  • PANAMA
  • PERU
  • PARAGUAY
  • EL SALVADOR
  • URUGUAY
  • VENEZUELA

Appendices

The appendices will contain detailed reference information that complements the main sections.

Appendix A: Complete Acronym Dictionary

An exhaustive list of all acronyms used in the document and in the architecture.

Content existing in: Architecture > Dictionary of Acronyms

Appendix B: Detailed Resource Naming by Environment

Complete tables with the specific naming of each GCP resource for DEV, HML, and PRD environments.

Content existing in: Architecture > Naming of resources in projects

Appendix C: Detailed Environment Diagrams

Visual diagrams of the architecture for Production (PRD), Homologation (HML), and Development (DEV) environments, offering a clear representation of the infrastructure.

Content existing in: Google Clouds Components > Environment diagram

Collection of links to code repositories (DAGs, IaC), relevant external documentation, and other tools.

Content existing in: Repository on Azure DevOps > Repository Link and Data Masking > Configuration File