At a glance
8
Certifications
8
6+
Projects built
6
2
Live deployments
2
PT
Based in Portugal
PT
Data stack — end to end
pipeline.py — gabriel_alves
01
Ingest
REST APIs
CSV / Batch
Paginated I/O
CSV / Batch
Paginated I/O
02
Transform
PySpark
Pandas · NumPy
SQL
Pandas · NumPy
SQL
03
Orchestrate
Apache Airflow
GitHub Actions
Logging
GitHub Actions
Logging
04
ML-Ready
Feature Eng.
Model Eval.
Prediction Svc
Model Eval.
Prediction Svc
05
Deploy
Render
AWS Cognito
WAF
AWS Cognito
WAF
Featured projects
Data Engineering · ELT
World Bank GDP Pipeline
Complete
End-to-end ELT pipeline on World Bank API data — paginated extraction, S3 Hive-style partitioning, PySpark transforms, Airflow orchestration, and a PostgreSQL star schema.
PySpark
Airflow
AWS S3
PostgreSQL
Python
Docker
ML Pipeline · Churn
Customer Churn Pipeline
▶ Demo
ML-integrated ETL for churn prediction. Deployed on Render with AWS Cognito auth and WAF security layer.
Python
AWS
ML
Render
NLP · Sentiment
NLP Sentiment Pipeline
Live
Text processing pipeline over product reviews driving a Streamlit dashboard with real-time filtering by category and time.
NLP
Streamlit
Neural nets
Analytics · SQL
Workforce SQL Analysis
Complete
Diagnostic analysis of employee data covering compensation equity, diversity metrics, and workforce stability insights.
SQL
DataCamp
Analytics
Main Projects
Data Engineering · ELT
World Bank GDP Pipeline
Complete
Designed and built a scalable end-to-end ELT pipeline for processing World Bank GDP data. Features modular paginated API extraction with retry logic and structured logging, a multi-layer AWS S3 architecture using Hive-style partitioning, and Apache Spark transformations covering cleaning, type casting, null handling, and aggregations. Includes an automated data quality validation suite (null rates, value range checks, duplicate detection, row count drift) that gates the pipeline before writes. Processed data is loaded into a PostgreSQL star schema via a JDBC staging layer and idempotent SQL pipeline populating dim_country, dim_time, dim_indicator, and fact_gdp. Daily workflows orchestrated with Apache Airflow DAGs and containerised with Docker.
PySpark
Airflow
AWS S3
PostgreSQL
Python
Docker
ML Pipeline · Churn
Customer Churn Pipeline
▶ Demo
End-to-end machine learning pipeline for customer churn prediction, covering data ingestion, preprocessing, feature engineering, model training, and deployment. Built reusable ETL components and integrated model evaluation with profit-based threshold optimization. Deployed on Render with AWS S3, Cognito, and WAF.
Python
ETL
AWS
ML
Render
NLP · Sentiment
NLP Sentiment Analysis Pipeline
Live
Text processing pipeline to clean, transform, and analyze large-scale product review data. Neural network for sentiment classification integrated into a Streamlit dashboard with dynamic filtering by product, category, and time. Production-ready workflow combining scalable text processing with ML-driven business insights.
NLP
Streamlit
Neural nets
Python
Other Projects
Competition · DataCamp
Cleaning Data & The Skies
Competition
Cleaned and preprocessed real-world messy flight data to extract business insights and answer key analytical questions.
Python
Data Cleaning
EDA
Analytics · SQL
SQL Workforce Data Analysis
Complete
In-depth SQL analysis delivering intelligence on workforce stability, compensation equity, and diversity from historical employee data.
SQL
Analytics
EDA · Finance
S&P 500 Financial EDA
Complete
Analysis of S&P 500 company distribution across US states, with focus on sector concentration patterns within regions.
Python
Pandas
Matplotlib
Finance
Technical Skills
Data Engineering & ETL
7 skills
Programming & Processing
6 skills
Orchestration & Monitoring
5 skills
Cloud & Deployment
5 skills
ML Integration & Analysis
7 skills
Soft Skills
Critical Thinking & Problem Solving
Analytical Mindset & Data-Driven Thinking
Attention to Detail
Curiosity & Learning Agility
Ability to Present Results & Insights
Persistence & Self-Discipline
About Me
Building reliable, scalable data systems that bridge Engineering and Machine Learning.
I am a Data Engineer with a background in Informatics Engineering, focused on building scalable data pipelines and production-ready data systems.
I have hands-on experience designing ETL workflows, transforming large datasets, and preparing data for machine learning applications. I have built end-to-end pipelines covering data ingestion, transformation, modeling, and deployment.
Recently, I completed the Data Engineer Professional Certification, working with tools such as Airflow and logging systems, strengthening my understanding of workflow orchestration and pipeline monitoring.
Currently seeking a junior Data Engineer role to contribute to data infrastructure and grow in distributed systems and orchestration.
Quick Info
Education
Bachelor's in Informatics Engineering
ESTG-IPVC · Instituto Politécnico de Viana do Castelo
Location
Portugal
Open to remote · Hybrid (north of Portugal)
Looking for
Junior Data Engineer
Data infrastructure · Distributed systems · Orchestration