GSoC 2026 Projects

Explore the project proposals for Google Summer of Code. Each project offers a unique opportunity to contribute to open source and develop advanced technical skills.

Proposal 1
Medium (175 hours) Medium

Pipeline for Data Quality Verification and Anomaly Detection in Governmental Data with Automated Reports

Currently, GovHub performs daily ingestions of data from various public sources, organizing it into data layers (bronze, silver, and gold). Although there are some point validations, the systematic identification of anomalies, inconsistencies, and data quality issues still depends on manual inspections or ad hoc checks.

The objective of this project is to build an automated data quality verification module for the databases of the core government systems (ComprasNet, SIAFI, TransfereGov, SIAPE, and SIORG), capable of analyzing daily ingestions and generating consolidated periodic reports (weekly or monthly) highlighting statistical anomalies, semantic inconsistencies, and potential structural issues. The focus is not on automatically correcting the data, but on providing visibility, traceability, and issue prioritization for pipeline maintainers.

As the desired final state, the student should deliver a system integrated into the existing GovHub pipeline that applies automated (statistical and semantic) checks on processed tables and produces structured, versioned, and auditable reports, facilitating data governance and the reliability of published data.

Expected Outcomes

  • Functional data quality verification module integrated into the GovHub ingestion pipeline.
  • Automatic generation of periodic reports (e.g., weekly) listing anomalies, affected tables, type of issue, and severity.
  • Technical documentation of the module and usage examples.

Required Skills

Languages: Python, SQL
Knowledge: Basic statistics, exploratory data analysis, data engineering
Tools: Pandas, PostgreSQL (or equivalent)

Nice to Have Skills

  • Familiarity with dbt
  • Docker
  • Experience with data pipelines and Airflow

Mentors

Mateus de Castro (@mat054) Davi de Aguiar (@davi-aguiar-vieira)
Proposal 2
Large (350 hours) Difficult

RAG for Natural Language Queries over Gold Data

Transforming complex governmental data into clear answers through natural language.

Brazil is a global reference in budgetary transparency, with milestones such as the Fiscal Responsibility Law and the Freedom of Information Law (LAI). However, a paradox remains: despite the vast availability of open data, these datasets often become "unexplored digital archives." The technical barrier is high; to monitor public policies in GovHub, a citizen or researcher must master SQL and understand complex data models (bronze, silver, and gold layers), which restricts active transparency to technical specialists.

This project proposes to democratize access by transforming natural language into structured queries. The goal is not to create a generic chatbot that hallucinates answers, but a precise analytical agent.

This project aims to develop a system based on Retrieval-Augmented Generation (RAG) that enables natural language queries to be executed directly over qualified GovHub data. The system should leverage metadata, table descriptions, and vectorized schemas to feed the LLM, automatically generate SQL queries, manage them, and return structured responses.

The desired final state is a tool that abstracts the complexity of SQL queries while maintaining traceability of user questions, generated queries, and the data used—operating not as a generic chatbot, but as an analytical component integrated into the data pipeline and ensuring data governance.

Expected Outcomes

  • Functional RAG pipeline (Haystack) with SQL generation, query execution, and final response, operating on the GovHub database and prioritizing the gold layer.
  • Integration with Open WebUI to enable practical user interaction with the model and the RAG+SQL pipeline.
  • Full observability with Langfuse, including traceability of prompts, retrieved documents, generated SQL, execution logs, metrics, and technical documentation of the system.

Required Skills

Languages: Python, SQL
Frameworks/Stacks: Haystack (pipelines), LLM APIs, Qdrant (Vector Database)
Theoretical Knowledge:
  • Data modeling and SQL queries
  • RAG fundamentals (embeddings, retrieval, ranking)
  • Security and best practices for executing LLM-generated SQL

Nice to Have Skills

  • Familiarity with Open WebUI and application deployment
  • Experience with vector databases / stores (Qdrant)
  • Knowledge of hybrid retrieval (BM25 + embeddings)
  • Docker and observability (tracing, metrics)
  • Experience with Langfuse or similar tools

Mentors

Mateus de Castro (@mat054) Davi de Aguiar (@davi-aguiar-vieira)
Proposal 3
Large (350 hours) Difficult

Data Lakehouse (MinIO + Iceberg) for Versioning (Time Travel) and AI Consumption in GovHub

Currently, GovHub operates on a solid architecture for analytical processing and consumption, with processed data organized into Medallion architecture layers and stored in Postgres. However, as a database oriented toward the current state (a typical OLTP characteristic), Postgres presents some limitations:

  • Native historical versioning (time travel) in a standardized and efficient manner;
  • Traceability and reproducibility of datasets for AI consumption (ML, embeddings, RAG);
  • Cost-efficient scalability for large volumes and frequent reprocessing;
  • Retention governance and auditing at the snapshot and file level.

This project proposes evolving the GovHub architecture toward a versioned Data Lakehouse, built on widely adopted open-source technologies: MinIO (object storage) and Apache Iceberg (table format), with Spark as the processing engine and Trino as the federated query layer. The goal is to enable time travel, schema/partition evolution, and metadata governance, while keeping the team experience close to the current Airflow + dbt workflow, avoiding significant rework.

The scope also includes analyzing and documenting essential operational considerations for the lakehouse, such as retention and maintenance of Iceberg metadata, strategies to handle small files, organization of layers, catalog patterns, and infrastructure patterns for execution on Kubernetes.

Expected Outcomes

  • Functional lakehouse architecture on Kubernetes with MinIO + Iceberg, including catalog configuration and data organization standards.
  • Ingestion/processing pipeline (e.g., using Spark) writing Iceberg tables with versioning and time travel enabled, including a minimal partitioning strategy and schema evolution.
  • Operational federated querying via Trino, enabling analytical consumption of Iceberg tables (and, if necessary, integration with existing sources such as Postgres).
  • Proposed and documented integration with Airflow + dbt, showing how the current workflow adapts to the lakehouse with minimal day-to-day changes.
  • Documented and/or automated strategy for handling small files and compaction (optimize), with batch routines.
  • Complete technical documentation covering architecture, components, decisions, operational costs, retention policies, observability, security, and an operational guide.

Required Skills

Languages: Python, SQL
Frameworks/Stacks: Spark (processing), Trino (query engine), MinIO (S3-compatible storage), Iceberg (table format)
Theoretical Knowledge:
  • Data architectures (lakehouse, bronze/silver/gold layers);
  • Fundamentals of transactional tables in data lakes (snapshots, manifests, catalogs);
  • Data modeling/partitioning and performance (pruning, small files, compaction);
  • Best practices for operating on Kubernetes (storage, deployment, security).

Nice to Have Skills

  • Experience with Airflow and DAG patterns for data pipelines
  • Familiarity with dbt and layered analytical modeling
  • Experience with observability (logs, metrics, tracing)
  • Knowledge of catalogs and governance (e.g., Nessie, Hive Metastore, Glue-like patterns)
  • Experience with object storage and S3 tuning (MinIO policies, lifecycle, performance)

Mentors

Arthur Alves Melo (@arthrok)
Proposal 4
Large (350 hours) Difficult

Pipeline for Extraction, Structuring, and Ingestion of Unstructured Data into a Governmental Datalakehouse

A significant portion of relevant government information remains in unstructured documents, such as PDFs, presentations, and technical reports. Although these documents are publicly available, their content remains difficult to access for systematic analysis, integration with analytical pipelines, and reuse in data-driven applications.

This project proposes developing a modular, reprocessable pipeline to ingest unstructured documents into GovHub, combining visual layout analysis, region-based OCR, and semantic enrichment. The pipeline should automatically identify the structural elements of documents (titles, paragraphs, tables, and images), extract textual content in a contextualized manner, and transform it into semi-structured and structured data.

Unlike approaches focused on immediate indexing or search, the main focus of this project is the structuring and persistence of these data in a datalakehouse, organized into well-defined layers (raw, parsed, and enriched), enabling governance, versioning, auditing, and reprocessing. The desired final state is a structured data foundation derived from documents, ready to support analyses, reports, and future analytical or AI-based applications within the GovHub ecosystem.

Expected Outcomes

  • Functional pipeline for ingesting unstructured documents, including layout analysis, region-based OCR, and generation of normalized intermediate structures.
  • Structured and enriched data persisted in a datalakehouse, organized into layers, with metadata, versioning, and traceability back to the original document.
  • Complete technical documentation of the pipeline, including quality metrics, logs, validation criteria, and guidelines for reprocessing and data model evolution.

Required Skills

Languages: Python
Frameworks/Stacks: Document processing pipelines, OCR libraries, LLM APIs
Theoretical Knowledge:
  • Unstructured data processing
  • OCR concepts and document layout analysis
  • Fundamentals of NLP and semantic enrichment
  • Principles of data engineering and datalakehouse architectures (layers, versioning, and governance)

Nice to Have Skills

  • Experience with document layout analysis
  • Knowledge of analytical data modeling
  • Familiarity with datalakehouse architectures
  • Docker and observability practices (logs, metrics, tracing)
  • Experience with data governance and auditing

Mentors

Mateus de Castro (@mat054) Davi de Aguiar (@davi-aguiar-vieira)