Ensuring Data Quality in Complex Data Engineering Workflows

Chittaranjan Pradhan

pdf

Published: May 30, 2024

Chittaranjan Pradhan

Abstract

Today, as the volume, speed and variety of data generation have exploded with big data and advanced analytics, ensuring data quality has become one of the foundational challenges in data engineering. Poor data quality leads to an inability to make reliable conclusions, to make effective decisions — and to avoid operational failures. This essay will focus on keeping data intact at every level from acquisition and storage, through to transformation and consumption. It covers critical problems and actionable steps to resolve them such as data inconsistencies, NULL values, duplication, and schema-drift. It covers data validation, anomaly detection, automated and real-time monitored data pipelines, data profiling and more. Our study also explores how AI and ML can help solve the data quality management challenges such as mistake correction, anomaly detection, and data problem prediction. It further discusses governance frameworks and industry best practices (like DataOps and MDM) that help to set the standards of good quality data. Data quality assurance can help businesses better their decision-making and overall performance by systematically enhancing the reliability, accuracy, and consistency of their datasets. In the world of data, correctness, consistency, and integrity are getting more and more critical. It has never been easy to manage data quality, and modern organisations face the additional challenge of huge and heterogeneous datasets. This paper addresses key aspects of data quality in complex engineering processes focusing on the areas of data collection, processing, and integration. This article reviews techniques of data validation, anomaly detection and real-time monitoring, taking you through all the best practices to tackle common data quality issues. We also explore the use of automation and machine learning in high-volume data pipelines, demonstrating how these innovations could enhance data integrity and make it easier to ensure quality. Finally, the essay emphasizes promotes collaboration among data scientists, engineers, and other stakeholders in the organization. When organisations work together to ensure data quality is a major priority throughout the data lifecycle, more reliable insights and better business outcomes can be realised

How to Cite

Ensuring Data Quality in Complex Data Engineering Workflows (C. Pradhan , Trans.). (2024). International Journal of Creative Research In Computer Technology and Design, 6(6). https://jrctd.in/index.php/IJRCTD/article/view/98

Issue

Vol. 6 No. 6 (2024): IJCRCTD

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

How to Cite

Ensuring Data Quality in Complex Data Engineering Workflows (C. Pradhan , Trans.). (2024). International Journal of Creative Research In Computer Technology and Design, 6(6). https://jrctd.in/index.php/IJRCTD/article/view/98

References

Armbrust, M.; Fox, A.; Griffith, R.; Joseph, A.D.; Katz, R.; Konwinski, A.; Lee, G.; Patterson, D.; Rabkin, A.; Stoica, I. A view of cloud computing. Commun. ACM 2010, 53, 50–58.

Gill, S.S.; Buyya, R. A taxonomy and future directions for sustainable cloud computing: 360 degree view. ACM Comput. Surv. 2018, 51, 1–33.

Dai, X.; Xiao, Z.; Jiang, H.; Alazab, M.; Lui, J.C.; Min, G.; Dustdar, S.; Liu, J. Task offloading for cloud-assisted fog computing with dynamic service caching in enterprise management systems. IEEE Trans. Ind. Inform. 2023, 19, 662–672.

Lv, Z.; Chen, D.; Lv, H. Smart city construction and management by digital twins and BIM big data in COVID-19 scenario. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–21.

Li, M.; Tian, Z.; Du, X.; Yuan, X.; Shan, C.; Guizani, M. Power normalized cepstral robust features of deep neural networks in a cloud computing data privacy protection scheme. Neurocomputing 2023, 518, 165–173.

Masdari, M.; Zangakani, M. Green cloud computing using proactive virtual machine placement: Challenges and issues. J. Grid Comput. 2020, 18, 727–759.

Rajakumari, K.; Kumar, M.V.; Verma, G.; Balu, S.; Sharma, D.-K.; Sengan, S. Fuzzy based ant colony optimization scheduling in cloud computing. Comput. Syst. Sci. Eng.2022, 40, 581–592.

Rao, L.; Liu, X.; Ilic, M.D.; Liu, J. Distributed coordination of internet data centers under multiregional electricity markets. Proc. IEEE2011, 100, 269–282.

Lin, W.; Peng, G.; Bian, X.; Xu, S.; Chang, V.; Li, Y. Scheduling algorithms for heterogeneous cloud environment: Main resource load balancing algorithm and time balancing algorithm. J. Grid Comput.2019, 17, 699–726

Laghari, A.A.; Jumani, A.K.; Laghari, R.A. Review and state of art of fog computing. Arch. Comput. Methods Eng.2021, 28, 3631–36433.

Mukherjee, M.; Kumar, S.; Mavromoustakis, C.X.; Mastorakis, G.; Matam, R.; Kumar, V.; Zhang, Q. Latency-driven parallel task data offloading in fog computing networks for industrial applications. IEEE Trans. Ind. Inform.2020, 16, 6050–6058.

Chekired, D.A.; Khoukhi, L.; Mouftah, H.T. Industrial IoT data scheduling based on hierarchical fog computing: A key for enabling smart factory. IEEE Trans. Ind. Inform.2018, 14, 4590–4602.

Chang, Z.; Liu, L.; Guo, X.; Sheng, Q. Dynamic resource allocation and computation offloading for IoT fog computing system. IEEE Trans. Ind. Inform.2021, 17, 3348–3357.

Keshavarznejad, M.; Rezvani, M.H.; Adabi, S. Delay-aware optimization of energy consumption for task offloading in fog environments using metaheuristic algorithms. Clust. Comput. J. Netw. Softw. Tools Appl.2021, 24, 1825–1853.

Citation Indices	All	Since 2018
Citation	50854	30996
h-index	28	23
i10-index	119	72

Year	Rate
2024	10.6%
2023	18.3%

Article Sidebar

Main Article Content

Abstract

Article Details

How to Cite

References

Most read articles by the same author(s)