Landslide Susceptibility Mapping using Machine Learning Algorithm

Landslides are natural disasters that have resulted in the loss of economies and lives over the years. The landslides caused by the 2005 Muzaffarabad earthquake heavily impacted the area, and slopes in the region have become unstable. This research was carried out to find out which areas, as in Muzaffarabad district, are sensitive to landslides and to define the relationship between landslides and geo-environmental factors using three tree-based classifiers, namely, Extreme Gradient Boosting (XGBoost), Random Forest (RF), and k-Nearest Neighbors (KNN). These machine learning models are innovative and can assess environmental problems and hazards for any given area on a regional scale. The research consists of three steps: Firstly, for training and validation, 94 historical landslides were randomly split into a proportion of 7/3. Secondly, topographical and geological data as well as satellite imagery were gathered, analyzed, and built into a spatial database using GIS Environment. Nine layers of landslide-conditioning factors were developed, including Aspect, Elevation, Slope, NDVI, Curvature, SPI, TWI, Lithology, and Landcover. Finally, the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) value were used to estimate the model's efficiency. The area under the curve values for the RF, XGBoost, and KNN models are 0.895 (89.5%), 0.893 (89.3%), and 0.790 (79.0%), respectively. Based on the three machine learning techniques, the innovative outputs show that the performance of the Random Forest model has a maximum AUC value of 0.895, and it is more efficient than the other tree-based classifiers. Elevation and Slope were determined as the most important factors affecting landslides in this research area. The landslide susceptibility maps were classified into four classes: low, moderate, high, and very high susceptibility. The result maps are useful for future generalized construction operations, such as selecting and conserving new urban and infrastructural areas.


Introduction
Landslides are caused by many natural and human action-induced disruptions to the soil [1][2][3]. Landslides inflict significant economic loss every year, whether caused by natural or human activity [4][5][6]. The Himalayan Mountain Range is vulnerable to frequent and devastating landslides due to robust topography, poor lithology, climate change, soil, fragile sloping infrastructure, and active tectonic action. The probability of landslides happening in an area is In this article, two main concise contributions are as follows: the first, the study area, Muzaffarabad, being the capital of Azad Jammu and Kashmir (Pakistan), and the densely populated area is exposed to the threat of landslides, especially after the 2005 earthquake. Therefore, the need to make an accurate Landslide Susceptibility Map for future planning using ML algorithms is imperative to minimize this impending threat subsequently. Secondly, to the best of our knowledge, there is no publication to investigate and compare the tree-based classifiers by tunning the hyper-parameters in the research area for the LSM. The LSM can be used for land-use planning to mitigate risks associated with landslides and is valuable for future general planned construction operations, such as selecting and environmental conservation of the new urban and infrastructural areas.

Study Area
The Study area covers 77.45 km 2 in Muzaffarabad District, northern Pakistan ( Figure 1). Muzaffarabad is situated at the convergence of the Jhelum and Neelum River. The 94 landslides were mapped using visual interpretation of optical images and field investigations. In the field, some active landslides were captured, shown in Figure 2. The study area's key geographical features are valleys, plains, hills, lakes, and rivers. The weather of the region is damp to the subtropical highlands. The Elevation ranges between 615 and 2014 meters above sea level. The maximum and minimum average temperatures are around 23° and 35°C in the study area for July and 3° and 16° C for January. The average rainfall in the region is 1,511 mm/year [39]. The region consists of active tectonics, steep slopes, and delicate geological situations. Pre-Cambrian, Cambrian, Paleocene to Eocene, Early Miocene, and Quaternary rocks are exposed from a stratigraphic perspective in the investigated area. (Figure 3). A Digital Elevation Model with a resolution of 12.5×12.5 meters produced is used to create variables (source: http://www.gscloud.cn). The USGS (https://earthexplorer.usgs.gov/) given a Landsat-8 satellite image acquired in December 2020. All the data presented is analyzed for output of QGIS 2.18.16, ArcGIS 10.6, and R-Studio-related variables

Landslide Susceptibility Mapping
The workflow is shown in Figure 4. Nine evaluation parameters are used as landslide predisposing factors, whereas the landslide is the dependent variable. A preliminary decision table is created using the circumstances attribute and intention characteristics; the input characteristic settings are then used to form the three models. Eventually, landslide susceptibility mapping is performed the following fundamental principle of landslide sensitivity mapping: The following measures are as follows:  Model unit division: As a model unit in this analysis, the grid unit was used. Remote sensing data and DEM data spatial resolution correspond to 12.5 m, and evaluation factors have all been re-evaluated at 12.5 m.
 Initial decision table: A two-dimensional table was created by a condition attribute that corresponds to 9 evaluation factors and a landslide decision attribute (1 represents landslides, 0 represents non-landslides), each line of which defines an object, and each column is the object attribute.
 Two sections have been changed: training (70%) and testing the two-dimensional table (30%). The Training data apply to develop the models and to forecast the studies.
 The evolution of landslide sensitivity: The three models above were used to calculate all model units in the research field. To generated landslide prediction maps (LPI), each model unit's prediction values per group have been developed.
 Division of LPI maps: The results acquired using the three algorithms were importing into GIS, and the Janks natural breakpoint approach [18] was applied to division landslide susceptibility into four grades [40]: low, moderate, high, and very high.
 Findings are analyzed: The three models have been extensively tested using the ROC curve and the area under the ROC curve.
The landslide susceptibility assessment has described nine variables with highly explained variances as input factors.

Random Forest
RF is a commonly used algorithm that is based on the bagging ensemble approach. The RF was selected for an ML algorithm for various classification and regression tasks [41]. The RF model is very flexible and fast. The algorithm is one of the entire tree learning methods based on CART (Tree of Regression and Classification). The number of bootstrap samples is used to balance the number of trees picked. An RF algorithm's key advantage is that it needs minor hyperparameter tuning in for accurate results than the other boosting algorithms [42]. Table 2 displays the RF hyperparameter ranges and maximized values discovered by grid search. In RF models analysis, some hyper-parameters must be tuned, such as:

Extreme Gradient Boosting
XGBoost is a scalable endwise tree boosting system that's applied in a variety of machine learning problems. XGBoost follows the gradient boosting principle that incorporates all the poor Leaner predictions [43]. Model XGBoost needs several model preview parameters to be chosen. Three main user-friendly parameters are required: nrounds (maximum number of iterations boosting), colsample_bytree (columns ratios sub-sample when each tree is constructed), and subsamples (the training instance subsample ratio); (Table 3).

k-Nearest Neighbor
k-Nearest Neighbors has been recently used in several fields (e.g., LSM) and is the easiest of all ML algorithms [44,45]. The input in KNN consists essentially of the k most similar training illustration in the specified interval. The results are a possibility of becoming members; in classified snags, membership possibilities represent the ambiguity of a particular object allocated to a specific class [46]. According to [47], In KNN, an entity is identified by its Neighbor's majority vote; the entity is allocated to the most general class of its closest Neighbors (k is an optimistic number, characteristically minor). If k = 1, the entity is assigned to the next single Neighbor class.
KNN usually is stated as an instance-based, nonparametric, and "lazy" algorithm. Here, the terms 'lazy learning' mention models in which the acquired experience is deferred until an inquiry is posted on the process instead of 'eagerness learning.' The algorithm oversimplifies the training data previously having investigations [48].
The expression 'nonparametric' mentions that the algorithms are not dependent on the dispersion of the data and that the models respond to the data independently. For LSM, where landslide distribution usually doesn't suit smooth distributions, it is beneficial. Lazy learning, on the other hand, implies that the role is just approx. Domestically, and all estimation has been delayed. Consequently, KNN does not learn from training data a discriminatory feature but instead stores the training data set.

Model Assessment
The capacity of Landslide Susceptibility Models was tested using two statistical methods in this analysis. Specificity and Sensitivity are the percentages of the landslide pixels properly categorized and the proportion of the non-landslide pixels adequately classified in the following equations as Sensitivity and specificity [49];

Sensitivity of LSM = TP TP + TN
(1) The number of correctly classified pixels is denoted by TP (true positive) and TN (true negative). FP and TP indicate the number of incorrectly classified pixels. The ROC curve is used here, the area under ROC, and the Recall value as appraisal metrics to determine LSM models' capability based on Sensitivity and specificity. The ROC curve, the output of diagnostic signals such as land change, is often measured [47,48]. Their Sensitivity toward unique characteristics is used through the X-axis with different cut-off levels [24].
Moreover, the AUC reveals that a model can simulate landslides and non-landslides cells. AUC having value 1 illustrates an optimum model; an integer 0 corresponds to an informational model [50]. We also addressed the applicability of the precise value under unbalanced landslide sample conditions. The accuracy and recall value calculation equations are as follows: Accuracy = TP + TN TP + FN + TN + FP (4)

Important Landslide variables
The selection of predictor variables is one of the essential aspects of landslide susceptibility evaluation and mapping [51]. GIS is frequently utilized to extract important variables for susceptibility evaluation from digital elevation models, such as Topographic Wetness Index (TWI), curvature, Elevation, aspect, and Slope [10,52]. This research selected nine contributing factors in line with previous literature [29]: Slope, Elevation, Lithology, Curvature, Aspect, TWI, NDVI, SPI, and Landcover ( Table 1). The R Studio (Window 4.0.3) was used to determine whether the landslide factors were significant in this analysis or not.
The Slope and Elevation were the essential landslide parameters, while the influence rate of aspect was the minimum (Figure 8). Therefore, in the research of Dai et al. and Ohlmacher et al. [54,55], it was found that Elevation and Slope were the best predictor variables for estimating the possibility of landslide events. Xiong et al. [53] showed numerous geomorphologic factors (e.g., Elevation) as the maximum significant parameters for predicting the LSM. Hong et al. [49] developed the LSM using J48 Decision Trees, and the results showed that the slope parameter significance was greater than other parameters. Similarly, Dou et al. [54] generated the LSM using advanced random forest and decision tree algorithms, and findings showed that the slope variable was the primary cause of landslides.
Furthermore, this study found that the Muzzafarabad formation was the most prone to landslides, pursued by the Slope. Therefore, the results of our LSM corroborated the findings of Riaz et al. [29] at a certain level. Similarly, Torizin et al. [55] created the LSM for the Torghar and Mansehra districts of Pakistan. Their LSM was created by combining four variables: Landcover, Slope, Lithology, and Distance to the fault. The included variables were similar to the contributive variables considered in this study; nevertheless, we developed the LSM using nine causative variables. Figure 9 indicates using the three mapping responses graded by the landslide index (LPI). The greater the landslide chance, the greater the LPI. The natural breakpoint method [18] used the class-division probability distribution histogram to divide the LPI into four classes (Jenks method). The decreasing value should be the distribution of possibility, the groups with a loss of balance (http://www.svds.com). In this analysis, the LPI was divided into four categories using the natural breakpoint procedure, which considers the histogram.

KNN Landslide Susceptibility Map
The research area, 55.2%, is covered by the KNN model's very high and high susceptibility classes (Figure 9-a). In contrast, the low and moderate vulnerability classes have 23.6% and 21.2% respective areas.

XGBoost Landslide Susceptibility Map
The XGBoost model classified 18.1% of the research area shows very high susceptibility, with values of 17.5%, 24.2%, and 40.2% for low, moderate, and high vulnerability, respectively (Figure 9-b).

RF Landslide Susceptibility Map
The research area, 44.3%, is covered by the RF model's very high and high vulnerability classes (Figure 9-c). On the other hand, the moderate and low vulnerability classes have respective areas of 26.2% and 29.5%.

Verification and Comparison
Based on the effects of the three algorithms, the Janks natural breakpoint approach [18] was used to identify landslide vulnerability into four classes [40]: low, moderate, high, and very high, based on the results obtained using the three algorithms. For each algorithm, landslide susceptibility zoning maps were developed using this tool. For the XGBoost model, the LPI ranged from 0 to 1. In normal break conditions, the map of landslide vulnerability was divided into four susceptibility classes: low (0.00-0.17), moderate (0.17-0.43), high (0.43-0.71), and very high (0.71-1). The LPI for the KNN model varied from 0 to 1; four categories of landslide susceptibility maps were reclassified: low (0.00-0.01), moderate (0.01-0.49), high (0.49-0.85), and very high (0.85-1). The LPI ranged from 0 to 1 for the RF model, and the natural split system was used to reclassify a four-grade, low (0.00-0.19), moderate (0.19-0.40), high (0.40-0.63), and very high (0.63-1) susceptibility chart. The ROC curve was being designed to test the model's predictive capability.
To compare the prediction precision of the model, the confusion matrix was used (Table 4). We observed from Table  4 that all models were highly reliable. The precision of these models is 0.858(XGBoost), 0.861(KNN), and 0.856(RF), correspondingly. However, the accuracy was identical, and all three versions tend to be strong on the surface. The evaluation metrics seen in Figure 10 and Table 5 are the ROC curve and AUC. With values of 0.790, the KNN Model had the lowest AUC efficiency. The RF and XGBoost models displayed an AUC value more remarkable than the KNN, with AUC values of 0.895 and 0.893; correspondingly, the RF approach is best predictive.    The findings show that the XGBoost, KNN, and RF have fair precision for LSM in the studied area. Based on the AUC, RF and XGBoost achieved the highest accuracy compared to KNN, which indicates the results are in line with previous studies [10,28,29]. Similarly, other studies have compared several LSM approaches performance to determine the best model [56,57]. The collaborative model's RF and XGBoost provide exceptional output with a limited range of hyper-parameters to tuning in, but they take a lot of time to run. The XGB, in particular, can process vast amounts of data while using less memory. Unlike the additional tree-based model, the XGB grows trees vertically rather than horizontally [58]. The XGBoost method verified the highest execution and an adequate time of training. The XGBoost is a specific application of the gradient boosting process and uses a memory-efficient technique to improve computational speed and model performance [59]. Whereas the KNN model is easy and basic to perform, its key drawback becomes exponentially slower as the amount of data grows, making it an unfeasible alternative in situations where forecasts must be complete quickly [58]. The Area under curve was used to determine the accuracy of the forecast effects, and AUC denotes the area under the curve bounded by the coordinate axis [60]. The RF model is a wellfunctioning and reliable model. The strength of RF and XGBoost algorithms is when a large amount of data is missing, RF and XGBoost are suitable methods for estimating missing data and retaining validity [61].
Furthermore, the RF model offers certain benefits and drawbacks. The primary benefits are that the model is not overly sensitive to the variables used to run it, and it is easy to control which variables to use. Random forests tend to exaggerate low values while underestimating high ones, which is its major drawback. When the technique is utilized for regression, this restriction becomes apparent [62].

Conclusion
Landslides have been one of Pakistan's most severe natural adversities, causing significant deaths and socioeconomic losses annually. So far, the process of landslide mapping has been quite complicated and volatile, making it difficult to predict the vulnerability of landslides accurately and timely in the most susceptible areas. For this purpose, various attempts have been made to enhance predictability based on several forecast models for mapping the landslide vulnerability targeting different regions. It is essential for decision-makers to create more appropriate landslide vulnerability maps, to enhance the efficiency of the prediction model. In this study, various prediction models have been evaluated and compared, such as XGB, RF, and KNN, to select the most suitable and accurate one to forecast the landslide susceptibility. The model's comprehensibility helps determine the potential association between manipulating Parameters and LSM from data, which is critical for considering the parameters that influence landslide susceptibility. The LSM allows decision-makers to choose appropriate sites for the implementation of development projects. This research has defined the relationship between landslides and geo-environmental factors using three tree-based classifiers: XGBoost, RF, and KNN. The collaborative model's RF and XGBoost provide exceptional output with a limited range of hyper-parameters for tuning in. Based on the three machine learning techniques, the innovative consequences show that the performance of the Random Forest model has a maximum AUC value of 0.895 and is more efficient than the other tree-based classifiers. The contributing factors, such as Slope, and Elevation were the essential landslide parameters, while the influence rate of aspect was the minimum. These findings could be employed as fundamental data to help with land-use development and slope management. The robust results imply that the RF is the most appropriate model, based on the AUC, for mapping landslide vulnerability in the study and its vicinity. Therefore, the stakeholders should employ this model to forecast the landslide susceptibility in Azad Jammu and Kashmir and the adjacent districts more accurately. However, these could be less valuable on a particular site scale where regional geological and geographical diversities may dominate. Additional landslide data is required and implemented in other regions for the model to be widely implemented. The current research offers advanced knowledge to create more robust and detailed predictive models for landslides, which can be minimized in highly susceptible regions.

Data Availability Statement
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest
The authors declare no conflict of interest.