Pipeline for Data Quality Verification and Anomaly Detection in Governmental Data with Automated Reports
Currently, GovHub performs daily ingestions of data from various public sources, organizing it into data layers (bronze, silver, and gold). Although there are some point validations, the systematic identification of anomalies, inconsistencies, and data quality issues still depends on manual inspections or ad hoc checks.
The objective of this project is to build an automated data quality verification module for the databases of the core government systems (ComprasNet, SIAFI, TransfereGov, SIAPE, and SIORG), capable of analyzing daily ingestions and generating consolidated periodic reports (weekly or monthly) highlighting statistical anomalies, semantic inconsistencies, and potential structural issues. The focus is not on automatically correcting the data, but on providing visibility, traceability, and issue prioritization for pipeline maintainers.
As the desired final state, the student should deliver a system integrated into the existing GovHub pipeline that applies automated (statistical and semantic) checks on processed tables and produces structured, versioned, and auditable reports, facilitating data governance and the reliability of published data.
Expected Outcomes
- Functional data quality verification module integrated into the GovHub ingestion pipeline.
- Automatic generation of periodic reports (e.g., weekly) listing anomalies, affected tables, type of issue, and severity.
- Technical documentation of the module and usage examples.
Required Skills
Nice to Have Skills
- Familiarity with dbt
- Docker
- Experience with data pipelines and Airflow