A Comparison of Multiple Imputation Methods for Recovering Missing Data in Hydrological Studies

Fatimah Bibi Hamzah, Firdaus Mohd Hamzah, Siti Fatin Mohd Razali, Hafiza Samad

Abstract


Missing data is a common problem in hydrological studies; therefore, data reconstruction is critical, especially when it is crucial to employ all available resources, even incomplete records. Furthermore, missing data could have an impact on statistical analysis results, and the amount of variability in the data would not be fittingly anticipated. As a result, this study compared the performance of three imputation methods in predicting recurrence in streamflow datasets: robust random regression imputation (RRRI), k-nearest neighbours (k-NN), and classification and regression tree (CART). Furthermore, entire historical daily streamflow data from 2012 to 2014 (as training dataset) were utilised to assess and validate the effectiveness of the imputation methods in addressing missing streamflow data. Following that, all three methods coupled with multiple linear regression (MLR), were used to restore streamflow rates in Malaysia's Langat River Basin from 1978 to 2016. The estimation techniques effectiveness was evaluated using metrics inclusive of the Nash-Sutcliffe efficiency coefficient (CE), root-mean-square error (RMSE), and mean absolute percentage error (MAPE). The results confirmed that RRRI coupled with MLR (RRRI-MLR) had the lowest RMSE and MAPE values, outperforming all other techniques tested for filling missing data in daily streamflow datasets. This indicates that the RRRI-MLR is the best method for dealing with missing data in streamflow datasets.

 

Doi: 10.28991/cej-2021-03091747

Full Text: PDF


Keywords


Missing Data; Streamflow; Robust Regression; CART; k-NN; MLR.

References


Mwale, F.D., A.J. Adeloye, and R. Rustum. “Infilling of Missing Rainfall and Streamflow Data in the Shire River Basin, Malawi-A Self Organizing Map Approach.” Physics and Chemistry of the Earth, Parts A/B/C 50–52 (2012): 34–43. doi:10.1016/j.pce.2012.09.006.

Hamzah, Fatimah Bibi, Firdaus Mohd Hamzah, Siti Fatin Mohd Razali, Othman Jaafar, and Norhayati Abdul Jamil. “Imputation Methods for Recovering Streamflow Observation: A Methodological Review.” Edited by Fei Li. Cogent Environmental Science 6, no. 1 (January 1, 2020): 1745133. doi:10.1080/23311843.2020.1745133.

Mispan, M. R., N. F. A. Rahman, M. F. Ali, K. Khalid, M. H. A. Bakar, and S. H. Haron. "Missing river discharge data imputation Approach using artificial neural network." ARPN J. Eng. Appl. Sci. 10, no. 22 (December 2015): 10480-10485.

Adeloye, Adebayo J., Rabee Rustum, and Ibrahim D. Kariyama. “Kohonen Self-Organizing Map Estimator for the Reference Crop Evapotranspiration.” Water Resources Research 47, no. 8 (August 2011): 1-19. doi:10.1029/2011wr010690.

Adeloye, Adebayo J. “An Opportunity Loss Model for Estimating the Value of Streamflow Data for Reservoir Planning.” Water Resources Management 10, no. 1 (February 1996): 45–79. doi:10.1007/bf00698811.

Mariana Che Mat Nor, Siti, Shazlyn Milleana Shaharudin, Shuhaida Ismail, Nurul Hila Zainuddin, and Mou Leong Tan. “A Comparative Study of Different Imputation Methods for Daily Rainfall Data in East-Coast Peninsular Malaysia.” Bulletin of Electrical Engineering and Informatics 9, no. 2 (April 1, 2020): 635-643. doi:10.11591/eei.v9i2.2090.

Harvey, Catherine L., Harry Dixon, and Jamie Hannaford. “An Appraisal of the Performance of Data-Infilling Methods for Application to Daily Mean River Flow Records in the UK.” Hydrology Research 43, no. 5 (April 12, 2012): 618–636. doi:10.2166/nh.2012.110.

Tfwala, Samkele S., Yu-Min Wang, and Yu-Chieh Lin. “Prediction of Missing Flow Records Using Multilayer Perceptron and Coactive Neurofuzzy Inference System.” The Scientific World Journal 2013 (2013): 1–7. doi:10.1155/2013/584516.

Hirsch, Robert M. “An Evaluation of Some Record Reconstruction Techniques.” Water Resources Research 15, no. 6 (December 1979): 1781-1790. doi:10.1029/wr015i006p01781.

Wallis, James R., Dennis P. Lettenmaier, and Eric F. Wood. “A Daily Hydroclimatological Data Set for the Continental United States.” Water Resources Research 27, no. 7 (July 1991): 1657–1663. doi:10.1029/91wr00977.

Elshorbagy, Amin, S.P. Simonovic, and U.S. Panu. “Estimation of Missing Streamflow Data Using Principles of Chaos Theory.” Journal of Hydrology 255, no. 1–4 (January 2002): 123–133. doi:10.1016/s0022-1694(01)00513-3.

Cheng, Chia-Hsin, and Siang-Jhih Syu. “Improving Area Positioning in ZigBee Sensor Networks Using Neural Network Algorithm.” Microsystem Technologies 27, no. 4 (January 22, 2019): 1419–1428. doi:10.1007/s00542-019-04309-2.

Worland, Scott C., William H. Farmer, and Julie E. Kiang. “Improving Predictions of Hydrological Low-Flow Indices in Ungaged Basins Using Machine Learning.” Environmental Modelling & Software 101 (March 2018): 169–182. doi:10.1016/j.envsoft.2017.12.021.

Kim, Taeyoung, Woong Ko, and Jinho Kim. “Analysis and Impact Evaluation of Missing Data Imputation in Day-Ahead PV Generation Forecasting.” Applied Sciences 9, no. 1 (January 8, 2019): 204. doi:10.3390/app9010204.

Vezza, Paolo, Claudio Comoglio, Maurizio Rosso, and Alberto Viglione. “Low Flows Regionalization in North-Western Italy.” Water Resources Management 24, no. 14 (May 6, 2010): 4049–4074. doi:10.1007/s11269-010-9647-3.

Karakurt, Onur, Halil Ibrahim Erdal, Ersin Namli, Hacer Yumurtaci-Aydogmus, and Yusuf Sait Turkkan. “Comparing Ensembles Of Decision Trees And Neural Networks For One-Day-Ahead Stream Flow Predict.” Science Park 1, no. 17 (November 1, 2013): 43–54. doi:10.9780/23218045/1172013/41.

Tyralis, Hristos, Georgia Papacharalampous, and Andreas Langousis. “A Brief Review of Random Forests for Water Scientists and Practitioners and Their Recent History in Water Resources.” Water 11, no. 5 (April 30, 2019): 910. doi:10.3390/w11050910.

Erdal, Halil Ibrahim, and Onur Karakurt. “Advancing Monthly Streamflow Prediction Accuracy of CART Models Using Ensemble Learning Paradigms.” Journal of Hydrology 477 (January 2013): 119–128. doi:10.1016/j.jhydrol.2012.11.015.

Beauchamp, J.J., D.J. Downing, and S.F. Railsback. “Comparison of Regression and Time-Series Methods for Synthesizing Missing Streamflow Records.” Journal of the American Water Resources Association 25, no. 5 (October 1989): 961–975. doi:10.1111/j.1752-1688.1989.tb05410.x.

Su, Yu-Sung, Andrew Gelman, Jennifer Hill, and Masanao Yajima. “Multiple Imputation with Diagnostics (mi) inR: Opening Windows into the Black Box.” Journal of Statistical Software 45, no. 2 (2011): 31. doi:10.18637/jss.v045.i02.

Buuren, Stef van, and Karin Groothuis-Oudshoorn. “Mice: Multivariate Imputation by Chained Equations inR.” Journal of Statistical Software 45, no. 3 (2011): 1-67. doi:10.18637/jss.v045.i03.

Schilling, Keith E., and Calvin F. Walter. “Estimation of Streamflow, Base Flow, and Nitrate-Nitrogen Loads in Iowa Using Multiple Linear Regression Models.” Journal of the American Water Resources Association 41, no. 6 (December 2005): 1333–1346. doi:10.1111/j.1752-1688.2005.tb03803.x.

Gyau-Boakye, P., and G. A. Schultz. “Filling Gaps in Runoff Time Series in West Africa.” Hydrological Sciences Journal 39, no. 6 (December 1994): 621–636. doi:10.1080/02626669409492784.

Ebrahimian, Mahboubeh, Ahmad Ainuddin Nuruddin, Mohd Amin Mohd Soom, Alias Mohd Sood, Liew Ju Neng, and Hadi Galavi. “Trend Analysis of Major Hydroclimatic Variables in the Langat River Basin, Malaysia.” Singapore Journal of Tropical Geography 39, no. 2 (February 27, 2018): 192–214. doi:10.1111/sjtg.12234.

Noorazuan, M. H., Ruslan Rainis, Hafizan Juahir, S. M. Zain, and Nazari Jaafar. "GIS application in evaluating land use-land cover change and its impact on hydrological regime in Langat River basin, Malaysia." In 2nd annual Asian Conference of Map Asia, (February 2003): 14-15.

Wan Mohtar, Wan Hanna Melini, Siti Aminah Bassa Nawang, and Mohd Noor Shafique Rahman. “Statistical Analysis in Fluvial Sediments of Selangor Rivers: Downstream Variation in Grain Size Distribution.” Jurnal Kejuruteraan S, no. 1 (July 1, 2017): 37–45. doi:10.17576/jkukm-s-01-06.

Juahir, Hafizan, T. Mohd Ekhwan, Sharifuddin M. Zain, M. Mokhtar, J. Zaihan, and M. J. Ijan Khushaida. "The use of chemometrics analysis as a cost-effective tool in sustainable utilisation of water resources in the Langat River Catchment." American-Eurasian Journal of Agricultural & Environmental Sciences 4, no. 1 (2008): 258-265.

Memarian, Hadi, Siva K. Balasundram, Jamal B. Talib, Alias M. Sood, and Karim C. Abbaspour. “Trend Analysis of Water Discharge and Sediment Load During the Past Three Decades of Development in the Langat Basin, Malaysia.” Hydrological Sciences Journal 57, no. 6 (June 26, 2012): 1207–1222. doi:10.1080/02626667.2012.695073.

Hai Hwee Yang. “Analysis of Hydrological Processes of Langat River Sub Basins at Lui and Dengkil.” International Journal of the Physical Sciences 6, no. 32 (December 2, 2011): 7390–7409. doi:10.5897/ijps11.1036.

Juahir, Hafizan, Sharifuddin M. Zain, Mohd Kamil Yusoff, T. I. Tengku Hanidza, A. S. Mohd Armi, Mohd Ekhwan Toriman, and Mazlin Mokhtar. “Spatial Water Quality Assessment of Langat River Basin (Malaysia) Using Environmetric Techniques.” Environmental Monitoring and Assessment 173, no. 1–4 (March 27, 2010): 625–641. doi:10.1007/s10661-010-1411-x.

K. F. Widaman, “Best practices in quantitative methods for developmentalists: III. Missing data: What to do with or without them,” Monographs of the Society for Research in Child Development 71, no. 1, (April 2006): 210–211, doi: 10.1111/j.1540-5834.2006.00404.x.

Bennett, Derrick A. “How Can I Deal with Missing Data in My Study?” Australian and New Zealand Journal of Public Health 25, no. 5 (October 2001): 464–469. doi:10.1111/j.1467-842x.2001.tb00294.x.

Tencaliec, Patricia, Anne-Catherine Favre, Clémentine Prieur, and Thibault Mathevet. “Reconstruction of Missing Daily Streamflow Data Using Dynamic Regression Models.” Water Resources Research 51, no. 12 (December 2015): 9447–9463. doi:10.1002/2015wr017399.

Lee, Hyojin, and Kwangmin Kang. “Interpolation of Missing Precipitation Data Using Kernel Estimations for Hydrologic Modeling.” Advances in Meteorology 2015 (2015): 1–12. doi:10.1155/2015/935868.

Chen, Jiahua, and Jun Shao. “Jackknife Variance Estimation for Nearest-Neighbor Imputation.” Journal of the American Statistical Association 96, no. 453 (March 2001): 260–269. doi:10.1198/016214501750332839.

Aljuaid, Tahani, and Sreela Sasi. “Proper Imputation Techniques for Missing Values in Data Sets.” International Conference on Data Science and Engineering (ICDSE) (August 2016). doi:10.1109/icdse.2016.7823957.

Yang, Yiming. "An evaluation of statistical approaches to text categorization." Information retrieval 1, no. 1 (1999): 69-90. doi:10.1023/A:1009982220290.

Gupta, Anjali, and Vijay Bhaskar Semwal. “Multiple Task Human Gait Analysis and Identification: Ensemble Learning Approach.” Emotion and Information Processing (2020): 185–197. doi:10.1007/978-3-030-48849-9_12.

Breiman, L., J. Friedman, R. Olshen, and C. Stone. "Classification and Regression Trees. New York: Wadsworth & Brooks." Pacific Grove, CA (1984).

Yohai, Victor J. “High Breakdown-Point and High Efficiency Robust Estimates for Regression.” The Annals of Statistics 15, no. 2 (June 1, 1987). doi:10.1214/aos/1176350366.

Little, Roderick J. A., and Donald B. Rubin. “Statistical Analysis with Missing Data” (August 26, 2002). doi:10.1002/9781119013563.

Van Loon, A.F., and G. Laaha. “Hydrological Drought Severity Explained by Climate and Catchment Characteristics.” Journal of Hydrology 526 (July 2015): 3–14. doi:10.1016/j.jhydrol.2014.10.059.

Carey, Austin M., and Ginger B. Paige. “Ecological Site-Scale Hydrologic Response in a Semiarid Rangeland Watershed.” Rangeland Ecology & Management 69, no. 6 (November 2016): 481–490. doi:10.1016/j.rama.2016.06.007.

Thanh, Nguyen Tien. “Evaluation of Multi-Precipitation Products for Multi-Time Scales and Spatial Distribution during 2007-2015.” Civil Engineering Journal 5, no. 1 (January 27, 2019): 255. doi:10.28991/cej-2019-03091242.

Khazaee Poul, Ahmad, Mojtaba Shourian, and Hadi Ebrahimi. “A Comparative Study of MLR, KNN, ANN and ANFIS Models with Wavelet Transform in Monthly Stream Flow Prediction.” Water Resources Management 33, no. 8 (May 30, 2019): 2907–2923. doi:10.1007/s11269-019-02273-0.

Miró, Juan Javier, Vicente Caselles, and María José Estrela. “Multiple Imputation of Rainfall Missing Data in the Iberian Mediterranean Context.” Atmospheric Research 197 (November 2017): 313–330. doi:10.1016/j.atmosres.2017.07.016.

Bertsimas, Dimitris, Colin Pawlowski, and Ying Daisy Zhuo. "From predictive methods to missing data imputation: an optimization approach." The Journal of Machine Learning Research 18, no. 1 (2017): 7133-7171.

Chhabra, Geeta, Vasudha Vashisht, and Jayanthi Ranjan. “A Comparison of Multiple Imputation Methods for Data with Missing Values.” Indian Journal of Science and Technology 10, no. 19 (June 29, 2017): 1–7. doi:10.17485/ijst/2017/v10i19/110646.

Rana, Sohel, Ahamefule Happy John, and Habshah Midi. “Robust Regression Imputation for Analyzing Missing Data.” 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE) (September 2012). doi:10.1109/icssbe.2012.6396621.


Full Text: PDF

DOI: 10.28991/cej-2021-03091747

Refbacks

  • There are currently no refbacks.




Copyright (c) 2021 FATIMAH BIBI HAMZAH, Firdaus Mohd Hamzah, Siti Fatin Mohd Razali

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
x
Message