Predicting Travel Times of Bus Transit in Washington, D.C. Using Artificial Neural Networks

This study aimed to develop travel time prediction models for transit buses to assist decision-makers improve service quality and patronage. Six-months’ worth of Automatic Vehicle Location and Automatic Passenger Counting data for six Washington Metropolitan Area Transit Authority bus routes operating in Washington, D.C. was used for this study. Artificial Neural Network (ANN) models were developed for predicting travel times of buses for different peak periods. The analysis included variables such as length of route between stops, average dwell time and number of intersections between bus stops amongst others. Quasi-Newton algorithm was used to train the data to obtain the ideal number of perceptron layers that generated the least amount of error for all peak models. Comparison of the Normalized Squared Errors generated during the training process was done to evaluate the models. Travel time equations for buses were obtained for different peaks using ANN. The results indicate that the prediction models can effectively predict bus travel times on selected routes during different peaks of the day with minimal percentage errors. These prediction models can be adapted by transit agencies to provide patrons with more accurate travel time information at bus stops or online.


Introduction
Washington, D.C. is ranked second among cities in terms of highest public transit commuters in the United States with approximately 9% of the working population using the Washington Metropolitan Area Transit Authority (WMATA) Metrobuses to commute [1]. The Metrobus in D.C. is the fifth largest bus system in the United States. It has over 1,450 buses and services approximately 350 routes across the D.C., Maryland and Virginia area [2]. The buses serve 11,129 stops, including 2,554 stops with bus shelters [3].
The accurate prediction of travel time is necessary to enable public transit agencies to provide patrons with efficient transit service and for them to effectively plan their commute or travel in the region. Transit agencies are continuously evaluating best practices available to improve reliability of their services. The use of technology, particularly in bus transit, has been critical for this purpose. This includes the use of Automatic Vehicle Location (AVL) technology, which has been instrumental in the tracking of buses in real-time. Automatic Passenger Counters (APC) installed on buses count the number of passengers alighting and boarding at each bus stop which helps in the computation of the total number of patrons onboard.
The use of public transportation instead of personal vehicles has been encouraged as one of the solutions to traffic congestion. A study conducted by Abdulrazzaq et al. (2020) in Kanjang City, Malaysia indicated that travel time and travel distance coupled with schedule accuracy, fare reductions and increased accessibility significantly influenced riders decision to use public transport [4]. Obtaining accurate travel times of the metrobuses is an important task for transit authorities to provide reliable service to its patrons. A study conducted in 2013 found that transit buses in DC had an overall on-time performance of approximately 75%. The scheduled arrival times and actual arrival times had mean deviations ranging between 1.99 and 5.03 minutes [5]. The cumulative deviations in arrival times can negatively affect patron's perception of transit reliability. It is therefore important that travel time prediction models are developed to provide more accurate information to patrons based on pertinent factors that affect travel and arrival times.
There are several bus arrival prediction models that have been developed using different techniques such as Historical average models, Regression Models, Kalman Filter models and Artificial Neural Network (ANN) models based on several different variables [6]. Significant variables used in the various bus arrival prediction models include time of the day, vehicle arrival/departure, speed, distance, passengers boarding/alighting, en-route traffic conditions, etc. Ranjitkar et al. (2019) conducted a study that introduced 10 independent variables (factors) in seven (7) different models to compare their accuracies in predicting bus travel time using AVL and APC data. The developed models used included multivariate linear regression, ANN, decision tree, gene expression programming models, among others. The study concluded that ANN model performed best in comparison to the other models and the distance between two stops was the most significant variable for all models [7].
Machine Learning models such as ANNs provide a much more effective alternative and better accuracies in comparison to the conventional models [8]. ANNs are mathematical models that are inspired by the biological neural networks in the human brain. The effectiveness of ANN is based on its ability to approximate both linear and nonlinear functions to a required degree of accuracy using a learning algorithm, and to build ''piece-wise'' approximations of the functions. Jeong and Rilett (2004) developed historical data-based model, regression and ANN models and compared their performance using AVL data to predict bus arrival time in Houston, Texas. The prediction of bus arrival time was based on dwell time at stops and traffic congestion. The ANN model had the lowest Mean Absolute Percentage Error (MAPE) of about 54.24% and 48.61% when compared to the historical data-based and the regression models. The results from the study indicated that in terms of prediction accuracy, the ANN models outperformed the historical data model and regression models [8]. A study conducted by Chien et al. (2002) developed two ANN predictive models that were trained by link-based and stop-based data respectively to accurately predict bus arrival times in an urban road network. Though each model performed better under different scenarios, the study concluded that a hybrid ANN model which integrated both the link-based and stop-based models would further improve the accuracy of bus arrival time predictions [9].  developed accelerated time survival and linear regression models to estimate both travel times and the level of uncertainty associated with these predictions (travel time variance) based on a headway bus route data from Pennsylvania State University-University Park Campus. The results indicated that though the bus travel times prediction accuracies from both models were similar, the accelerated time survival models performed better with smaller uncertainties of 76% and reduced travel time prediction variation of 12% [10]. Treethidtaphat et al. (2017) designed a bus arrival prediction model using Deep Neural Network (DNN) based on GPS data from a public transportation bus line in Bangkok, Thailand. The results determined that the DNN model provided better bus arrival prediction times by up to 55% when compared an OLS model and the bus line's current prediction model [11]. Chen (2018) proposed an Arrival Time Prediction Model (ATPM) for passenger and/ or tourism systems using Hsinchu and Yosemite Bus systems in Taiwan as case study. Three months' worth of data for 14 highway routes and 40 urban roads were used in the ATPM based on Random Neural Networks (RNN). From the results, the ATPM produced better accuracies when compared to the conventional ANN models. A smart bus system designed using the proposed ATPM produced accuracies of 94.75% for highways and 78.22% for urban roads in providing travel time information to agency and riders [12]. A study by Yu et al. (2018) proposed and compared bus travel time prediction models based on Random Forests Near Neighbor (RFNN), Linear Regression (LR), K-Nearest Neighbor (KNN), Support Vector Machines (SVM) and classic Random Forests (RF). AVL data from two bus routes in Shenyang, China with comparable traffic situations were used. The results determined that though the RFNN bus travel time prediction model had the longest computation times, it generated better accuracies in the MAE, MAPE and RMSE [13]. Petersen et al. (2019) designed and implemented a multi-output and multi-time-step Deep Neural Network model which blends convolutional and Long Short-Term Memory (LSTM) layers to form a hybrid model (ConvLSTM). The model used six months' worth of urban bus transportation data provided by Movia Public Transport Authority, Copenhagen, comprising of AVL and GPS data. Results from the study showed that the ConvLSTM model performed better across all peaks with a MAPE of 4.04% in the morning peak and 5.61% in the afternoon peak when compared to the Historical Average model, the current bus line's prediction model, the LSTM model and Google maps traffic models. In addition, the ConvLSTM model could predict travel times for multiple links and multiple time-steps ahead [14].
Despite the studies and models developed, none has been developed specifically to predict bus travel time using ANN considering the unique nature of traffic patterns in the DC area. This research aimed at developing ANN models to predict travel time of transit buses in Washington, DC using AVL and APC data. The models will enable public transit agencies to provide more accurate travel time information to patrons to improve reliability and consequently increase bus ridership. This report presents the findings of the case study conducted to predict bus transit travel times in Washington, DC with ANN using AVL and APC data. The materials and methods section provides the description of the study site along with the selection process of bus routes. This section also provides the process of obtaining the data and filtering it based on relevant independent variables to generate data sets per peak period required for the analysis. The data analysis will constitute of a descriptive statistics section along with the ANN model development to find bus travel times. The training strategy and the optimization algorithm to perform the neural network analysis are described in this section. The results section presents the findings of the neural network training for the different case scenarios (varying number of perceptron layers). The initial and final training and selection errors are compared for every iteration to compare the highest percentage change in error. Furthermore, the Normalized Squared Errors of the testing datasets are compared to determine the accuracy and reliability of the training strategies and the travel time equations. A summary of the study is presented in the conclusion section.

Materials and Methods
This section provides the steps that were followed for the collection of data required for the analysis as shown in Figure 1.  1 presents a flowchart of the methodology that was followed to obtain travel time equations for buses using neural network training.

Site Description and Selection of Bus Routes
The Washington Metropolitan Area Transit Authority (WMATA) oversees the operations of Metrobus service in Washington, DC. The city is divided into four (unequal) quadrants: Northwest (NW), Northeast (NE), Southwest (SW), and Southeast (SE). As of July 2018, the population of Washington, DC was approximately 702,455 with an annual growth rate of approximately 1.4% [15]. WMATA has a bus fleet of approximately 1,600 buses that operate on 325 routes in Washington, DC, in portions of Maryland, and Northern Virginia, covering a total land area of about 1,500 square miles. Metrobuses operate 24 hours a day, 7 days a week and make more than 400,000 trips each week day. Of the total number of bus stops, 2,556 (22.2%) have shelters, while the remainder do not [16]. Figure 2 represents a location map of Washington, DC (shown in red) to the left and a road map of Washington, DC to the right [17].

Figure 2. Location and Road map of Washington, DC
WMATA buses have Global Positioning System (GPS) installed onboard of transit buses to track its location and display it on a geographical map of the area (Automatic Vehicle Location). Automatic Passenger Counters (APC) installed on buses count the number of passengers alighting and boarding at each bus stop which helps in the computation of the total number of patrons onboard. For this study, six (6) months (January 2019 -June 2019) worth of AVL and APC data for 6 WMATA bus routes was collected for analysis. Bus routes of two functional roadway classifications were considered: arterials and collectors. In general, bus routes with the following characteristics were considered:  Routes with high patronage bus stops: data for bus routes with relatively higher patronage was provided by consulted WMATA Officials.
 Routes with bus stops with longer headways: Routes can have several bus stops that accumulate larger groups of patrons boarding or alighting buses. Such stops can account for higher dwell time of buses along the route.
 Routes with bus stops near metro rail stations: Bus stops near metro rail stations usually have a high number of patrons since they provide patrons access to bus services and vice versa.
Based on the criteria, the following bus routes were selected for the study:

Data Extraction
The ANN models were developed based on AVL (Automatic Vehicle Location) and APC (Automatic Passenger Counters) data. Six (6) months (January 2019 -June 2019) worth of AVL and APC data for buses operating in the DC area were used for this research. Excel sheets obtained from WMATA database containing the data from the first week of every month were filtered to obtain only the necessary information required for the analysis. Based on the significance of their impact on travel time of transit buses (from the previous literature), the following independent variables for a bus trip were extracted for each week and for the selected routes from the data: N A bus can "serve" a stop if there are passengers entering or exiting at that stop. Hence, for the purpose of this research, X1 is denoted as the number of bus stops between any two "origin" and "destination" points along a bus route. Hence, if a bus serves Stop 1 and Stop 2, X1 = 0. If a bus serves Stop 1 and Stop 5, X1 = 3.

 Length of Route between Bus Stops (X2)
The data provided by WMATA had the odometer readings of all the buses along a route. Hence, X2 was obtained by taking the difference of the odometer readings between any two served bus stops. Since traffic characteristics depend on the time of the day, separate ANNs models were developed for AM Peak (7:00 AM -9:30 AM), PM Peak (4:00 PM -6:30 PM) and Mid-Day Peak (10:00 AM -2:30 PM) periods. The general form of the matrix containing the different independent variables and the dependent variable (travel time) used for the neural network analysis is presented in Table 1. For this study, a minimum sample size of 500 origin to destination trips of multiple transit buses on a route were extracted for each peak period from the 6-month AVL/APC data obtained from WMATA. Thus, a minimum sample of 1,500 origin to destination trips were extracted and exported into a Comma Separated Values (CSV) file for analysis of the three peak periods for each bus route.

Data Analysis
Neural Designer software was used for the neural network analysis. The software that incorporates data science and machine learning techniques which helps to build, train and deploy neural network models. Analysis for the project involved the following steps.

Descriptive Statistics
Descriptive statistics including the mean, median and standard deviation were computed for the bus travel times as well as the other predictor or independent variables. The averages of predictors such as dwell times and number of passengers per peak periods were also obtained.

ANN Model Development
The purpose of developing an ANN model in this research is to determine the travel time of the bus on a route using approximation technique in Neural Designer Software. In the approximation technique, the neural network learns from the input-target examples provided by the user. It should be noted that the objective of approximation is to produce a neural network which performs well in generalization and makes good prediction for unseen data (good fit) rather than capturing specific details in the data set (overfitting). The software was used to split the data into a training set (75%) and a testing set (25%). The training dataset was used to train and develop the model while the testing dataset was used to validate the model.
Training was conducted through an iterative process of feed forward and errorback propagations until the gradient normalization goal or the stopping criterion of a 1,000 epochs (iterations) was met. The following adjustments were made prior to performing the neural network analysis.

Perceptron Layers
The training of the model was done using a Multilayer Perceptron (MLP). Perceptron layers are important layers that enable the neural network to learn. Numerical values are inputs (X1,…,Xn) for the perceptron neurons in a network to produce a numerical output y (travel time). The result of the output is also affected by the combination of bias (b) and the sum of individual weights of independent variables (w1,…,wn).
The MLP used for this research consisted of three layers: input layer, hidden layer, and output layer. A typical ANN architecture is presented in Figure 4.

Data Standardization
The inputs in the data sets did not have the same ranges. Hence, an automatic scaling layer was applied to make the values of all the independent variables comparable. The scaled outputs were unscaled back to the original units using the un-scaling layer on the perceptron layers.

Training Strategy
The training strategy refers to the procedure to carry out the learning process which is applied to neural network. It is done to obtain the minimum possible error in the loss index. The loss index evaluates the performance of a neural network by assessing its parameters. Minimizing the error can be done by finding a set of parameters that fit the neural network to the data set. At the lowest value of the loss index, the gradient is zero. The optimization algorithm enables the capability of varying the parameters to obtain the ideal value for each training iteration or epochs, which gradually decreases the loss. The optimization algorithm stops during training after specific conditions or criteria have been met. Quasi-Newton optimization algorithm was used for training the data sets of all peak periods of the 6 bus routes. The algorithm yields a function with low loss and higher accuracy. The algorithm is the default optimization method in Neural Designer and is also recommended for training medium sized data sets (10-1,000 variables, 1,000-1,000,000 instances).
Model selection in the Neural Designer software program refers to finding the optimal network architecture with the best generalization properties. Order selection was performed to achieve the best selection model that generated the adequate fit to the data provided. The incremental order selection process was used to obtain the optimal order, training and selection errors. Following the order selection, the models were trained to reduce the errors.

Mathematical Equations, Model Testing and Evaluation
Neural Designer was also used to obtain the mathematical equations for approximation of travel times for different peak periods. The errors obtained from conducting the neural network analyses were also documented.
After training the network for the required number of epochs, the models are tested using the set-aside test dataset. Normalized Squared Error (NSE), which is the default error term used when solving approximation problems, was used to evaluate the models. The prediction of data can yield a value between 0 (perfect prediction) and 1 (predicting on basis of the mean). It can be represented by using the following:

Results
This section provides the findings of the research.

Summary Statistics
The distribution of data points that were used to develop the models is presented in Figure 5. An overview of the data of 6 study routes for the AM, PM and Mid-Day Peak hours has been represented. From Figure 5, a total of 7,190, 5,185 and 6,800 data points were used to develop the neural network models respectively for the AM, Mid-Day and PM Peak periods.
A summary of the average travel time of the buses based on the number of bus stops served during each peak period is presented in Table 2. The descriptive statistics of data sets are presented in Table 3.  It can be observed from Table 2 that the travel times generally increased over time of the day for all the number of bus stops served (0-9) along the route. The mean and standard deviation of all the variables (dependent and independent) that were used in all three peak periods are presented in Table 3.

Neural Network Training
This section presents results of the neural network trainings on models to predict the bus Travel Times using Neural Designer software. Matrices for all three peaks were analyzed using the Quasi-Newton algorithms with 2, 3 and 5 number of perceptron layers, separately. The inputs used for all Quasi-Newton Analysis were scaled using the automatic scaling method. The size of the scaling layer was 5 (number of inputs). The scaled outputs were unscaled back to the original units using the unscaling layers on the perceptron layers for all peak periods. Unscaling method for the output layer was performed by using the minimum and maximum method.
Order selection was performed to achieve the best selection model that generated the adequate fit for all the peak periods. The details of the results obtained from neural network analysis using Quasi-Newton algorithm are discussed in the following subsections.

Two-Perceptron Layers
Neural networks were developed for all three peak periods using 2 perceptron layers and analyzed using the Quasi-Newton optimization algorithm. The activation functions of the first layer were set to a hyperbolic tangent while those of second layer were set as linear. Table 4 shows the results of training for all peak periods. It can be observed from Table 4

Three-Perceptron Layers
Neural networks for all three peak periods having 3 perceptron layers were also modeled and analyzed using the Quasi-Newton Optimization Algorithm. The activation functions of the first two layers were set to hyperbolic tangent while those of third layer of each peak period were set to linear. Table 5 shows the results of training for all peak periods. It can be observed from Table 5 that for the AM Peak period, the initial training error value of 16.17 decreased to 0.145 after 821 epochs. The initial value of the selection error for the AM Peak period was 16.08, and the final value after 821 epochs decreased to 0.154. From Table 5 it can also be observed that for the Mid-Day Peak period, the initial training error of 93.28 decreased to 0.0532 after 404 epochs. Similarly, after 404 epochs, the initial value of selection error for the Mid-Day Peak period decreased from 72.12 to 0.0415. For the PM Peak period, initial training error decreased from 18.00 to 0.205 after 1,000 epochs and the initial selection error decreased from 18.32 to 0.229.
The highest final training and selection errors were observed for the PM Peak period while the Mid-Day Peak period had the lowest final training and selection errors. The selection error for the Mid-Day Peak had the greatest change in error (99.942%) while the lowest change in error (98.75%) was obtained for the selection error of the PM Peak period after order selection.

Five-Perceptron Layers
Neural networks for all three peak periods having 5 perceptron layers were also modeled and analyzed using the Quasi-Newton Optimization Algorithm. The activation functions of the first four layers were set to hyperbolic tangent while those of last layer of each peak period were set to linear. Table 6 shows the results of training for all peak periods It can be observed from Table 6 that for the AM peak period, the initial training error value of 96.80 decreased to 0.150 after 1,000 epochs. The initial value of the selection error for the AM Peak period was 96.64, and the final value after 1,000 epochs decreased to 0.159. From Table 6, it can also be observed that for the Mid-Day Peak period, the initial training error of 59.49 decreased to 0.094 after 1,000 epochs. Similarly, after 1,000 epochs, the initial value of selection error for the Mid-Day Peak period decreased from 56.44 to 0.106. For the PM Peak period, initial training error decreased from 61.08 to 0.235 after 1,000 epochs and the initial selection error decreased from 65.75 to 0.257. The highest final training and selection errors were observed in the PM Peak model while the Mid-Day Peak period had the lowest final training and selection errors. The training error for the AM Peak period had the highest change in error (99.845%) while the lowest change in error (99.638%) was obtained for the selection error of the PM Peak period after order selection.

Errors and Error Statistics
This section presents the comparison of Normalized Squared Errors and the error statistics after performing neural network training.

Normalized Squared Error
The normalized squared errors were obtained for all the instances which evaluated the model for each use. Table 7 presents the normalized squared errors for the training, selection and testing instances that were obtained from the model for all number of perceptron layers.  From the graphs, it can be observed that the highest NSE for the training, selection and testing errors of all peak models were generally obtained for the models with 5 perceptron layers. The models with 2 perceptron layers undergoing Quasi-Newton method for the AM and Mid-Day Peak periods produced the lowest NSE for training, selection and testing sets.

Error Statistics
The Mean Absolute Errors (MAE) and Mean Percentage Errors (MPE) were also obtained for the models during the neural network analyses. MAE and MPE are a measure of errors and computed average of percentage errors between paired observations which can be used as a determination of model's accuracy. Figure 9 represents the MAE and MPE for all the 3 peak period models.
The Mean Absolute Error parameter measures the difference between two continuous variables. The prediction error, i.e. the difference between the observed and predicted value, is converted to positive to give the absolute error [18]. The mean (average sum) of all recorded absolute errors (MAE), gives an idea of the average error to expect from the prediction model. However, the MAE does not provide the relative size of the error especially when comparing several models. The Mean Percentage Error (MPE) allows for the computation of the mean error in percentage terms. This determines how big or small an error is and provides a better means of comparing the various models [19].

Figure 9. Mean absolute and mean percentage errors obtained after neural network analyses
It can be observed from Figure 9 that the lowest Mean Absolute Errors for the AM and Mid-Day Peak periods were obtained in the Quasi-Newton model with 2 layers whereas the lowest MAE for the PM Peak period was obtained in the Quasi-Newton model with 5 layers. The models with 5 layers displayed the highest overall Mean Percentage Errors Since lower training error corresponds to overfitting, it may not be a good validation of the predictive model. Overfitting leads to inaccuracy in predicting the correct output from unseen data while testing. Hence, testing error was analyzed for all the models to evaluate the accuracy of approximation. The mean absolute error and the mean percentage error were compared to test the quality of the predictive models. It could be observed that the models trained using Quasi-Newton Algorithm having 2 perceptron layers had the lowest normalized squared errors (testing) followed by training models having 3 perceptron layers. Mid-Day Peak models had the lowest overall errors for both training algorithms.  Table 8 presents the mathematical expressions that were obtained from the neural network analysis for all peak models undergoing Quasi-Newton Analyses.   Table 8 represents general equations for all three peak periods for the different analyses (with multiple perceptron layers). It should be noted that the scaled values of YAM, YMid-Day and YPM change with different analysis. The mathematical expressions for the scaled Y outputs that were obtained from the neural network analysis for all peak models undergoing Quasi-Newton Analysis can be represented as: The scaled values of independent variables are provided in Table 9.  The scaled values of y_x_x for all peak periods are provided in Table 10. The scaled values of X1 to X5 used in the calculations for Quasi-Newton methods with 3 perceptron layers were same as the scaled values used for all peak period calculations with 2 perceptron layers.

Conclusion
Neural Network models were developed in this research which can be potentially helpful for transit agencies to improve bus travel time prediction. Delivering real time information pertaining to travel times of buses conveniently to the patrons will ensure the reliability of such transit services. Hence, improvement in the credibility and performance of the WMATA online NextBus Arrival service can increase ridership amongst patrons and decrease the dependency of private owned vehicles. The short and long-term benefits other than improvement in bus travel times include alleviation of traffic congestion, reduction of travel times for overall road users in urban areas like Washington, DC and decrease of vehicular emission. Moreover, the prediction models can serve as an excellent tool to build schedules for new bus routes in or around Washington, DC. The results of the analyses indicate that ANN models can effectively predict travel times of buses on selected routes with minimal percentage errors. From the results, it can be observed that the highest MPE was observed for the PM Peak model with 2 perceptron layers (4.9%). The value is lower than the lowest MPE value (8.19%) which was obtained in the study conducted by Yin et al. [6]. The ANN models could be incorporated into several other predictive models used by WMATA to provide patrons with travel time information online or at bus stops. These models could be adopted by transit agencies in other jurisdictions with similar characteristics to that of the Washington, DC area.