Validation and Error Detection in Relational Data using a Hybrid Rule-based System

Daniel Andrew Shane Chayono; Johan Jimmy Carter Tambotoh

doi:10.32520/stmsi.v15i6.6561

Validation and Error Detection in Relational Data using a Hybrid Rule-based System

Daniel Andrew Shane Chayono, Johan Jimmy Carter Tambotoh

Abstract

Relational databases form the backbone of modern information systems. However, data quality issues, such as duplicate records, invalid formats, missing values, and cross-table inconsistencies, can significantly reduce the accuracy of data-driven decision-making. Conventional rule-based validation is effective for detecting structured errors but has limited capability in identifying ambiguous errors, such as typographical variations in entity names. This study proposes a hybrid rule-based system that combines SQL triggers for structured error detection with fuzzy matching using the RapidFuzz Python library to identify semantically similar records across multiple relational database tables. The proposed system was implemented using six primary database tables, six corresponding quarantine tables, and a centralized error_log table. The evaluation was conducted using a synthetic dataset containing 2,775 records distributed across the six tables. The dataset was systematically generated using the generate_dataset.py script, with various intentionally injected data quality issues to enable accurate verification of the detection results. The experimental results show that the proposed system detected 472 data quality issues, with 297 records automatically moved to the quarantine tables. The rule-based component identified 311 errors (65.9%), including format violations, negative values, and referential integrity violations. Meanwhile, the fuzzy matching component detected 127 semantic errors that could not be identified using SQL rules alone, including 112 duplicate customer names, 7 similar product names, and 5 inconsistent product categories. On the experimental dataset, the proposed hybrid approach detected 34.1% more data quality issues than a rule-based validation approach alone. These findings demonstrate that integrating rule-based validation with fuzzy matching substantially improves error detection capability in relational databases, particularly for semantic inconsistencies that are difficult to capture using conventional validation rules.

Keywords

data quality; duplicate detection; fuzzy matching; relational database; SQL trigger

Full Text:

PDF

References

R. Miller, S. H. M. Chan, H. Whelan, and J. Gregório, “A Comparison of Data Quality Frameworks: A Review,” Big Data and Cognitive Computing, Vol. 9, No. 4, Apr. 2025, DOI: 10.3390/bdcc9040093.

M. Ibrahim, Y. Helmy, and D. Elzanfaly, “Data Quality Dimensions, Metrics, and Improvement Techniques,” Future Computing and Informatics Journal, Vol. 6, No. 1, pp. 1–12, 2021, DOI: 10.54623/fue.fcij.6.1.3.

F. Ridzuan and W. M. N. W. Zainon, “A Review on Data Quality Dimensions for Big Data,” Procedia Comput. SCI., Vol. 234, No. 1, pp. 341–348, 2024, DOI: 10.1016/j.procs.2024.03.008.

J. Wang, Y. Liu, P. Li, Z. Lin, S. Sindakis, and S. Aggarwal, “Overview of Data Quality: Examining the Dimensions, Antecedents, and Impacts of Data Quality,” Journal of the Knowledge Economy, Vol. 15, No. 1, pp. 1159–1178, Mar. 2024, DOI: 10.1007/s13132-022-01096-6.

J. Merino, I. Caballero, B. Rivas, M. Serrano, and M. Piattini, “A Data Quality in Use Model for Big Data,” Future Generation Computer Systems, Vol. 63, pp. 123–130, 2016, DOI: 10.1016/j.future.2015.11.024.

A. Yulianto and Firmansyah, “Data Improvement Life Cycle untuk Meningkatkan Kualitas Data: Studi Kasus Data Survey Kesehatan Mental,” Remik, Vol. 9, No. 2, pp. 474–483, 2025, DOI: 10.33395/remik.v9i2.14643.

P.-O. Côté, A. Nikanjam, N. Ahmed, D. Humeniuk, and F. Khomh, “Data Cleaning and Machine Learning: A Systematic Literature Review,” Automated Software Engineering, Vol. 31, No. 54, May 2024, DOI: 10.1007/s10515-024-00453-w.

M. Aidjili, “Implementasi Trigger dan View untuk Mendukung Konsistensi dan Efisiensi Pengolahan Data pada Sistem Database (Study Kasus: Toko Nanda Pekalongan),” j.komputer, j.informasi, j.teknologi, Vol. 5, No. 2, pp. 1–12, Dec. 2025, DOI: 10.53697/jkomitek.v5i2.39.

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, “Data Cleaning: Overview and Emerging challenges,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, Jun. 2016, pp. 2201–2206. DOI: 10.1145/2882903.2912574.

A. R. Kaufman and A. Klevs, “Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field,” Political Analysis, Vol. 30, No. 4, pp. 590–596, Oct. 2022, DOI: 10.1017/pan.2021.38.

N. Elmobark, “A Comparative Analysis of Python Text Matching Libraries: A Multilingual Evaluation of Capabilities, Performance and Resource Utilization,” International Journal of Environment, Engineering and Education, Vol. 7, No. 1, pp. 48–60, Apr. 2025, DOI: 10.55151/ijeedu.v7i1.188.

Y. Gao, C. Ge, X. Miao, H. Wang, B. Yao, and Q. Li, “A Hybrid Data Cleaning Framework using Markov Logic Networks,” arXiv preprint, Mar. 2019, DOI: 10.48550/arXiv.1903.05826.

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” in Proceedings of the 2003 ACM SIGMOD international conference on Management of data, Association for Computing Machinery, Jun. 2003, pp. 313–324. DOI: 10.1145/872757.872796.

J. Stoikov, A. Nikolova, and V. Georgiev, “Advanced Record Linkage Techniques for Improving the Data Matching between Cultural Heritage Datasets from Different Sources,” TEM Journal, Vol. 11, No. 4, pp. 1906–1914, Nov. 2022, DOI: 10.18421/TEM114-59.

E. Eessaar, “The Usage of Declarative Integrity Constraints in the SQL Databases of Some Existing Software,” in Software Engineering and Algorithms, R. Silhavy, Ed., Springer International Publishing, Jul. 2021, pp. 375–390. DOI: 10.1007/978-3-030-77442-4_33.

F. M. Wibowo, M. Z. Nafan, M. A. Gustalika, H. Fernando, M. Hussain, and N. A. Sahadun, “Lightweight String Similarity Approaches for Duplicate Detection in Academic Titles,” Journal of Informatics and Web Engineering, Vol. 4, No. 3, pp. 416–426, Oct. 2025, DOI: 10.33093/jiwe.2025.4.3.25.

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Trans. Knowl. Data Eng., Vol. 19, No. 1, pp. 1–16, 2007, DOI: 10.1109/TKDE.2007.250581.

I. F. Ilyas and X. Chu, “Trends in Cleaning Relational Data: Consistency and Deduplication,” Foundations and Trends in Databases, Vol. 5, no. 4, pp. 281–393, Oct. 2015, DOI: 10.1561/1900000045.

L. Cruz-Filipe, M. Franz, A. Hakhverdyan, M. Ludovico, I. Nunes, and P. Schneider-Kamp, “repAIrC: A Tool for Ensuring Data Consistency by Means of Active Integrity Constraints,” arXiv preprint, Oct. 2015, DOI: 10.5220/0005586400170026.

O. Azeroual, M. Jha, A. Nikiforova, K. Sha, M. Alsmirat, and S. Jha, “A Record Linkage‐based Data Deduplication Framework with DataCleaner Extension,” Multimodal Technologies and Interaction, Vol. 6, No. 4, p. 27, Apr. 2022, DOI: 10.3390/mti6040027.

N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Trans. Knowl. Discov. Data, Vol. 15, No. 3, pp. 1–37, Apr. 2021, DOI: 10.1145/3442200.

J. Fu, X. Han, X. Wan, and W. Wang, “PAT: Pattern-Perceptive Transformer for Error Detection in Relational Databases,” arXiv preprint, Sep. 2025, DOI: 10.48550/arXiv.2509.25907.

I. F. Ilyas and T. Rekatsinas, “Machine Learning and Data Cleaning: Which Serves the Other?,” Journal of Data and Information Quality, Vol. 14, No. 3, Sep. 2022, DOI: 10.1145/3506712.

M. Herschel, R. Diestelkämper, and H. Ben Lahmar, “A Survey on Provenance: What For? What Form? What From?,” The VLDB Journal, Vol. 26, pp. 881–906, Oct. 2017, DOI: 10.1007/s00778-017-0486-1.

E. Cahyaningsih, A. Rinjatmoko, and W. P. Sari, “Pengukuran Kualitas Data menggunakan Framework Total Data Quality Management: Studi Kasus Kementerian Hukum dan Hak Asasi Manusia Rutan Klas I Jakarta Pusat,” Jurnal Teknologi Informasi dan Ilmu Komputer, Vol. 12, No. 1, pp. 121–132, Feb. 2025, DOI: 10.25126/jtiik.2025129178.

F. K. Dankar, M. K. Ibrahim, and L. Ismail, “A Multi-Dimensional Evaluation of Synthetic Data Generators,” IEEE Access, Vol. 10, pp. 1–1, Jan. 2022, DOI: 10.1109/ACCESS.2022.3144765.

DOI: https://doi.org/10.32520/stmsi.v15i6.6561

Article Metrics

Abstract view : 128 times
PDF - 21 times

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me