A Comparison of Multiple Imputation Methods for Recovering Missing Data in Hydrological Studies

Missing data is a common problem in hydrological studies; therefore, data reconstruction is critical, especially when it is crucial to employ all available resources, even incomplete records. Furthermore, missing data could have an impact on statistical analysis results, and the amount of variability in the data would not be fittingly anticipated. As a result, this study compared the performance of three imputation methods in predicting recurrence in streamflow datasets: robust random regression imputation (RRRI), k-nearest neighbours (k-NN), and classification and regression tree (CART). Furthermore, entire historical daily streamflow data from 2012 to 2014 (as training dataset) were utilised to assess and validate the effectiveness of the imputation methods in addressing missing streamflow data. Following that, all three methods coupled with multiple linear regression (MLR), were used to restore streamflow rates in Malaysia's Langat River Basin from 1978 to 2016. The estimation techniques effectiveness was evaluated using metrics inclusive of the Nash-Sutcliffe efficiency coefficient (CE), root-mean-square error (RMSE), and mean absolute percentage error (MAPE). The results confirmed that RRRI coupled with MLR (RRRI-MLR) had the lowest RMSE and MAPE values, outperforming all other techniques tested for filling missing data in daily streamflow datasets. This indicates that the RRRI-MLR is the best method for dealing with missing data in streamflow datasets.


Introduction
Missing data in hydrological models is a prevalent problem owing to natural disasters, improper operation, and battery drainage, which restrict hydrological analysis [1,2] and has remained unsolved regardless of advancements in missing data imputation techniques over the years [3]. Missing data reconstruction is crucial, especially in an event where all available resources, including partial information, must be used. The lack of particular data can pose severe problems in hydrological studies, resulting in uncertainty and low efficiency of water resource systems [4][5][6].
Even minor data breaches can prohibit the computation of significant summary statistics and hydrological indexes, such as monthly runoff totals or n-day minimum flows, restricting analysis and explanation of historical flow variability [7]. Water development system planning, hydraulic structure design, and water resource management are all hampered by these gaps and breaks [8]. Additionally, extra expenses may be spent if a modelling system or decision support system eventually demands the utilisation of this measured data. As a result of these disadvantages, gaps must be filled, and the handling of missing data should be prioritised in the data preparation procedure.
The most convenient method for dealing with missing data is to delete the entire observations with partial data and analyse the remaining complete data [6]. On the other hand, deleted data may result in discontinuous data, resulting in information loss and skewed conclusions. In recent years, however, various data estimation approaches have been proposed and widely debated in relevant literature to solve this challenge. These approaches range from simple classic statistical methods such as replacing mean, median, or alternative location stations for each missing value to advanced computational techniques.
For example, Hirsch (1979) and Wallis et al. (1991) [9,10] addressed reconstruction methods for daily data utilising data from neighbouring stations. In another study, the missing consecutive streamflow was reconstructed using the k-NN algorithm, and the impact of increasing the value of k on the results was also demonstrated [11]. Recent work by Cheng and Syu (2019) [12] compared the k-NN average (kNN-AVG) algorithm to the backpropagation neural network (BPNN) in wireless positioning systems. Due to BPNN's learning capacity, the study discovered it improved area positioning accuracy more than kNN-AVG. Meanwhile, in Worland et al. (2018) [13] study, the performance of eight machine learning models (including k-NN) and four baseline models to forecast hydrological low-flow indices in ungauged basins in South Carolina, Georgia, and Alabama, USA, was examined. The study found that machine learning models produced much-reduced cross-validation errors than baseline models. Another study used four different approaches to substitute missing meteorological data: linear interpolation, mode imputation, k-NN, and multivariate imputation by chain equations (MICE). When the k-NN method was used to the test data, the prediction performance provided results closest to the original data with no missing values. The prediction model's performance remained steady even when the missing data rate rose [14].
Several recent studies have proposed techniques for filling in missing hydrological data utilising CART approaches or random forests derived from CART for missing streamflow records imputation. According to these studies, the CART model outperformed other classification methods in terms of explained variance [15][16][17]. Similarly, in Erdal et al. (2013) [18] study, CART was utilised for monthly streamflow forecasting, with a support vector regression (SVR) model serving as the benchmark model. CART was demonstrated to outperform SVR in both the training and testing stages.
Regression methods have long been utilised in statistical approaches to reconstruct missing streamflow data [19]. MICE, a well-known technique for performing multiple imputations, employs sequential regression modelling as well [20,21]. The aim is to simulate flow at one gauge as a function of flow at another or several gauges. In Beauchamp et al. (1989) [19], regression and time series methods were utilised to synthesise and predict streamflow at a downstream gauge over an upstream gauge in California. According to the study, either method produced logically good estimations and projections of the flow at the downstream gauge. In another study, an MLR model was employed to estimate streamflow in the Wainganga River. According to the study, the method employed was one of the simplest and fastest ways to compute runoff [22]. Furthermore, Gyau-Boakye and Schultz (1994) study [23] examined ten well-known techniques, including interpolation, recursive models, autoregressive models, regression, and non-linear models and concluded that the interpolation and multiple regression models fared well. A detailed outline of techniques used in hydrology for the reconstruction of missing data, including an applied comparison of simple and multiple regression models was presented in Harvey et al. (2012) [7]. The study revealed that when multiple input variables are included, accuracy improves.
However, no research on the reconstruction of missing streamflow data utilising effective RRRI techniques had been conducted before. As a result, this study's objectives were twofold: (1) to reconstruct missing flow data from the Langat River Basin using RRRI from the statistical field in comparison to machine learning techniques: k-NN and CART; and (2) to evaluate the performance of imputation methods coupled with the MLR model in forecasting future daily streamflow values. The findings of this study are likely to aid in the development of the best techniques for data imputation that allow for the reconstruction of entire daily streamflow datasets.

Area of Study
The Langat River Basin is located in the southernmost state of Selangor and upstate of Negeri Sembilan, especially between latitudes 2o 40'M 152" N to 3o 16'M 15" N and longitudes 101o 19'M 20" E to 102o 1'M 10" E in the western part of Peninsular Malaysia, as seen in Figure 1. The basin comprised an area of approximately 2,394.38 km2, with a major river channel stretching for around 141 km. The river runs southward towards the lower mainland and westwards towards the coast of the Selangor state, with its mouth is located in the Straits of Malacca [24]. This river basin, Malaysia's most urbanised river basin, is considered to compensate for the advantages of Klang Valley spill-over development [25,26]. It is a critical raw water resource for drinking water as well as other activities such as recreation, industrial usage, fishing, and agriculture [27]. This study looked at four Langat River sub-basins (Kajang, Dengkil, Lui, and Semenyih).

Figure 1. Map of Langat River Basin
The Langat Basin is impacted by two types of monsoons in terms of hydrometeorology: the northeast and southwest monsoons, which occur from November to March and May to September, respectively [28,29]. The southwest monsoon, which blows over the Malacca Strait has the most impact on the climate of the basin [30]. The Langat River Basin has four flow rate gauging stations: Dengkil and Kajang at Langat River, Kg. Rinching at Semenyih River, and Kg. Lui at Lui River. The characteristics of sub-basins connected with Langat Basin gauging stations regulated by the Department of Irrigation and Drainage (DID) are depicted in Table 1. High-dimensional data obtained from Malaysia's DID, Ampang, Selangor, recorded from 1978 to 2016 was utilized in this study. The streams have been monitored constantly and recorded in m 3 /s as daily mean flow rates. Of the 56,980 data points, 12.5% had missing values. Datasets containing 10 to 25% missing values are classified as moderate data [31]. Furthermore, as stated in Bennett (2001) study [32], if the percentage of missing data exceeds 10%, the statistical analysis is likely to be skewed. A large number of time series data are necessary to get a precise outline of the streamflow patterns [33]. Aside from that, the reliability of a frequency estimator for a lengthy time series dataset is extremely valuable in data analysis since it is closely related to sample size.

Research Methodology
This section is split into two major subsections. The techniques for estimating missing data are discussed in the first subsection. Meanwhile, the second subsection describes how the performance of the methods utilised is evaluated. This study employed a cross-validation methodology on data from 2012 to 2014 to assess the competency of infilling methods. This period was chosen as the baseline due to the availability of comprehensive data. The missing daily streamflow data were recovered from the entire time series data after being simulated at random. Figure 2 depicts the flowchart of imputation and the procedure for integrating missing data into the entire time series.

Imputation Methods
This study compared three methods of imputation to determine the best fit technique to impute missing values in the streamflow datasets. The RRRI method is a statistical methodology, whereas k-NN and CART are machine learning techniques. Initially, multiple imputation methods were used to fill up all missing values with replacement from the observed values. In general, the first variable with missing values ( 1 ) is regressed on all other variables ( 2 , 3 , … ), but only on individuals with the observed 1 . Missing values in 1 are replaced with simulated draws from 1 's posterior predictive distribution. The next variable with missing values ( 2 ) is then regressed on all other variables ( 1 , 3 , … ), restricted to individuals with the observed 2 , and using the imputed values of 1 . Missing values in 2 are again replaced by draws from 2 's posterior predictive distribution. The process is repeated for each variable with missing values in turn -this is referred to as a cycle. To stabilise the results, the procedure is typically repeated for several cycles (e.g. 10 or 20) to produce a single imputed dataset, and the entire procedure is repeated times to produce imputed datasets. Following the missing values imputation process, the CE, RMSE, and MAPE for each of the three predicted values were then computed. Finally, all three methods were employed in conjunction with MLR to restore streamflow rates in Malaysia's Langat River Basin from 1978 to 2016.

k-Nearest-Neighbor Imputation (k-NN)
The machine learning-based k-NN imputation, also known as distance function matching, is a donor approach in which the donor is chosen by minimising a specified 'distance' and the mean is utilized as an imputation estimate [34,35]. It represents the local estimate approach, which predicts using just neighbouring states. Local estimators are regularly believed to give good results in chaotic time series [11]. The missing values are based on a set number of cases, one of which is very certainly the instance of interest [36]. This procedure involves calculating an appropriate distance measure, with the distance defined by the auxiliary variables. This study employed the Euclidean distance, one of the most prominent methods for measuring distance, and the formula is as follows: where and are the query point and a case from the streamflow data sample, respectively.
The first step in creating predictions using the k-NN method is determining the value of k. According to Yang (1999) [37], a larger value of k provides greater weight to accuracy and is more stable since it reduces total noise, but there is no assurance. The Elshorbagy et al. (2002) study [11], on the other hand, stated that the smaller the k value, the better the estimation of the missing value. The most commonly used rule of thumb is that k equals the square root of the number of points in the training dataset [38]. As a result, this study opted to have a maximum number of k that is less than or equal to the square root of the size of the training dataset utilized. This yielded far superior results than 1NN, which merely nominated to the class of its nearest neighbour.
After determining the value of k, predictions can be made using the k-NN method. The imputation procedure by the nearest neighbour can be summarized for k neighbours as follows: Assuming that there are m observations on n covariates, = denotes the corresponding × matrix, where represent the ℎ observation of the ℎ variable. Let = represent the corresponding × dummy matrix with the following entries: The metric for the data observed can be used to compute the distances between two observations and , which are represented in the data matrix by rows. The distance is then calculated as follows: represents the number of valid components in the computation of distances. Because parallel views were utilised to conceptualise distances, nearest neighbours were employed.

Multiple Classification and Regression Tree (CART)
CART [39] is a well-known classification of machine learning algorithms [39] that employs the concept depicted in Figure 3. CART models need predictors as well as cut-points in predictors used to split the sample. The cut-points split the sample into larger, homogenous subsamples. The splitting procedure is repeated on both subsamples, allowing a succession of splits to form a binary tree [18]. For regression problems, each node in the tree has a splitting rule defined by minimising the relative error (RE), which is similar to minimising the sums-of-squares of the split: where and are the left and right partitions, respectively, with and observations of in each, and respective means ̅ and ̅ . The decision rule is a point in some estimator variable that decides which branches go left and which go right. The partitioning rule that minimises the RE is then used to construct the tree node. Figure 3 shows an example of a CART framework.

Robust Random Regression Imputation (RRRI)
RRRI is a less stringent variant of least squares regression that operates with looser assumptions. It offers considerably better estimations of the regression coefficients when the data are ambiguous. There are numerous good examples of robust methods in the literature, the most often used: the M-estimator, the least median of squares (LMS) estimator, and the least trimmed sum of squares (LTS) S-estimator, and the MM-estimator. This study adopted the high breakdown and high-efficiency MM-estimator proposed by Yohai (1987) [40]. The following is a simple linear regression model: where is the response variable, is the regressor, is the intercept and is the slope, and is the random error term.

Multiple Linear Regression (MLR)
Following the replacement of all missing values with multiple techniques, the complete dataset is analysed using MLR to determine the best approaches for dealing with missing data in daily streamflow datasets. Regression analysis is a statistical technique that examines the relationship between at least two quantitative variables and their predicted variables [42]. The MLR model is a widely used statistical technique in many fields, including hydrological data [43,44]. The following is how the MLR model parameter is expressed: where is the response variable's value, 0 , 1 , 2 are unknown constants, is the predictor variable's value, and is the random error.

Performance of the Estimation Methods
In this study, three performance metrics were utilised to assess imputation methods: CE, RMSE, and MAPE. The CE index is a well-known metric for assessing the prediction power of hydrological models. The most effective performance models aim for a CE value of one (1). Meanwhile, RMSE is a common statistical metric used to assess model performance in meteorology, air quality, and climate research investigations. The RMSE statistic, which is a measure of the difference between predicted and observed values, offers information about short-term efficiency. Another useful measure used extensively in model evaluations is MAPE. Especially, MAPE provides an insight into the average deviation of the predicted values from the observed values and the long-term performance of these models. The lower the values of RMSE and MAPE values, the better findings of the long-term model. The following formulae can be used to compute these statistics: where is the observed streamflow data, ̃ is the predicted value, ̅ indicates the average streamflow data, represents the sample size and represents the number of independent variables in the regression equation over daily streamflow datasets from the Langat River Basin.

Results and Discussion
This study was conducted to find the best imputation method for calculating missing streamflow data comparing k-NN, CART, and RRRI. The models were first evaluated without missing values on the training dataset from the year 2012 to 2014. The simulation process was performed in the following flow: a conventional training dataset was generated using the missing data rates (i.e. 5, 10, 15, 20, 25, and 30%), and the missing values were substituted with new values, acquired using each of the previously mentioned imputation methods. The error was computed by subtracting the trained model's predicted value from the reference model's predicted value and the data acquired using the missing value replacement method. The trained model with the original training data and test data with no missing values was used as the reference model. Each method's performance was evaluated using CE, RMSE, and MAPE. When the gap between the estimated and observed values is minimal, RMSE and MAPE will provide the smallest value. Meanwhile, CE values can vary from −∞ to 1, with values larger than 0.5 deemed acceptable. The method with the greatest CE value and the lowest RMSE and MAPE values was chosen as the best. Table 2 shows the prediction model errors, whereas Tables 3-5 display the deviation results.    The consistency of each imputation method was determined using gap analysis, as indicated by reduced gaps between training and validation results. According to the aforementioned data (Tables 2 to 5), the RRRI method had the highest CE and the lowest RMSE and MAPE values. Conversely, CART was the worst imputation method, having the lowest CE and the greatest RMSE and MAPE values. Despite this, CE values indicated that all imputation methods gave acceptable results, with values near to one and deviating from the training set by less than 10%. Nevertheless, RMSE produced somewhat lower values than the training sets as the missing data rate increased. Meanwhile, MAPE measured the magnitude of the error in percentage terms, and the values varied slightly depending on the mean difference between the observed known outcome values and the values predicted by the model. As the rate of missing data increased, so did the error between the reference and validation models with missing data imputation. This indicated a small error when training data were used with no missing values. Even if the missing data rate was only 30%, the training model would have followed the pattern of the remaining 70% training data rather than the 30% missing data. Consequently, a substantial error went unnoticed despite the existence of missing values in the training data. Later, the model was evaluated for all four sub-basins using a dataset spanning the years 1978 to 2016. The findings were then computed as an average of each imputation method results. Table 6 shows the overall performance of each method in the reconstruction of data from 1978 to 2016. The RRRI method produced the lowest RMSE and greatest CE values, according to the results. Despite this, CE values indicated that all imputation methods gave acceptable results, with values near to one. Based on the findings, it is possible to infer that the RRRI method gave the best results, whereas k-NN was the worst imputation method for daily streamflow data in Malaysia's Langat River Basin due to the lowest CE and greatest RMSE values among the other methods. Following the completion of the missing values, the MLR model was used to analyze the whole dataset in this study and find the best approaches for dealing with missing data when imputation values were coupled with modelling. The performance of imputation methods coupled with an MLR model was evaluated using MAPE and RMSE.  Finally, visual inspection/ analytic data were drawn up with observed and predicted values for the kNN-MLR, CART-MLR, and RRRI-MLR models. Figure 4 depicts the results of three imputation methods for replacing 7,124 missing daily streamflow data points in Malaysia's Langat River Basin and shows the comparable patterns of imputed daily streamflow values from all three methods. All models, for instance, responded with comparable peaks and durations in streamflow events.
It can be concluded that the RRRI-MLR method achieved the best performance, whereas k-NN-MLR performed the least. This was not surprising for the k-NN method from the aspects of the most popular methods and the methodological simplicity; the method is a lazy learner, unable to learn anything from the training data. The training data was instead used for classification, which can result in poor algorithm generalization and outliers susceptibility [45]. This was consistent with a recent finding in Miró et al. (2017) [46], where advanced linear methods produced far superior results than traditional methods, such as k-NN. To predict a new instance label, the k-NN algorithm searches the data for the k closest neighbours to the new instance and sets the predicted class label to be the most prevalent label among the k closest neighbouring points. In each prediction, the algorithm must compute and sort the distance and data for a number of training instances, which might be time-consuming. Another possible consequence of changing the k value is the change in the class label [47].
On the other hand, the CART method performed better than k-NN when coupled with MLR. The CART method is recognised for its simplicity, robustness, ability to handle multicollinearity and skewed distributions, and adaptability to interactions and non-linear relationships [21]. This conclusion was consistent with previous studies [17,18], which found that the CART model outperformed other classification algorithms in terms of explained variance. Regardless of how flexible and interpretable CART is, it is critical to understand how it works to establish and cross-validate suitable tuning parameters, such as tree depth or split number [48]. The CART technique can also be time-intensive when applied to large datasets.

Figure 4. kNN-MLR, CART-MLR and RRRI-MLR data imputation results for 7,124 missing streamflow data
Comparatively, the results revealed that the robust technique of this study prevailed over the investigated non-robust approaches. The obvious reason for this is that RRRI with MM-estimator has high asymptotic efficiency, about 95% compared to ordinary least squares (OLS) under Gauss-Markov assumptions, and a high breakdown (about 50% ~ n/2) [49]. RRRI add a random error term in which without adding this term, for the same value of independent variables, it will result in the same response which is not true in reality. Furthermore, when the error percentage is mirrored by the missing data proportion, the RRRI technique produces a lower error than the k-NN and CART techniques. These simulations demonstrate unequivocally that the RRRI technique is the most effective missing data imputation method for reconstructing missing streamflow data.

Conclusion
Missing data is a frequent constraint of hydrological research and usually leads to misinterpretations of statistical output and hydrological modelling techniques. Therefore, method performance evaluation is necessary to reduce the impact of missing data. Several techniques have been proposed in the literature to manage missing data. However, a suitable approach to be used as the missing data trend and mechanism are still unclear. Researchers typically exclude observations with missing data or replace them through naive methods, such as the mean or mode of all other observations, because of convenience, even though these methods are statistically significantly worse. In this study, a novel statistical approach to treating and estimating missing data in streamflow datasets was proposed. The findings demonstrated that when coupled with MLR, the proposed method, RRRI with MM-estimator, gave the best results. The RRRI-MLR method outperformed k-NN and CART on all three performance metrics (CE, RMSE, and MAPE), with a higher adjusted CE and lower RMSE and MAPE. This result indicates that the RRRI technique has the lowest variance between the reference model and the prediction model with missing data imputation. As a result, using the RRRI technique to address missing streamflow data can produce the best results. As such, this method should be classified among suitable contenders for managing missing data in streamflow datasets.

Limitations and Directions for Future Research
This study was based on k-NN, CART, and RRRI performance as conditional models for imputing missing flow records. Four gauging stations were used to analyze the data matrices of the Langat River Basin. However, other critical streamflow characteristic contributors, such as rainfall, temperature, topography, or other study area parameters were not examined due to data limitations. Ignoring such parameters may result in erroneous data prediction. In future studies, the performance of the proposed imputation method should be compared to other imputation methods like support vector machines and artificial neural networks. In order to examine the robustness of results, a sensitivity analysis may also be beneficial with several methods for managing missing data.

Data Availability Statement
Restrictions apply to the availability of these data. Data was obtained from Department of Irrigation and Drainage (DID), Ampang, Selangor, Malaysia.