Академический Документы
Профессиональный Документы
Культура Документы
Yun Wang, Sudha Ram, Faiz Currim Ezequiel Dantas, Luiz Alberto Sabia
MIS Department, Eller College of Management Secretaria Municipal de Conservao e Servios Pblicos
University of Arizona Prefeitura Municipal de Fortaleza
Tucson, USA Fortaleza, Brazil
AbstractUrbanization in developing countries has resulted in conveyed that for bus systems with fleet sizes below 750
increased demand for public transportation in the face of limited vehicles, the cost of installing AVL was roughly $2.5 million
resources. This requires smart transportation management that plus a cost between $2,500 and $10,000 per bus for APC
allows urban planners to evaluate the impact of their policies and equipment [6]. On the other hand, Global Positioning System
design targeted interventions. This paper proposes a three-layer (GPS) tracking along with Automatic Fare Collection (AFC)
management system to support smart urban mobility with an provide a more economical alternative for capturing data on bus
emphasis on bus transportation. In Layer-1, we apply novel Big trips and passenger mobility. This is because a GPS tracking unit
Data techniques to compute bus travel time and passenger cost approximately $200 per bus and AFC is already widely used
demands in an efficient and economical way. Layer-2 contains two
in many countries [7]. The challenges preventing widespread use
analytic components: network analysis of passenger transit
patterns and causal relationship analysis for bus delays. The third
of data from GPS-AFC systems for decision support by urban
layer provides decision support in an interactive visualization planners are twofold: data integration and data volumes. While
environment. The proposed system is developed and validated in the AFC and GPS systems were designed for different purposes
cooperation with the city of Fortaleza in Brazil. The use of (and typically independently collected by different agencies), by
generally available urban transportation data makes our connecting the two with other related datasets, we can derive
methodology adaptable and customizable for other cities. very useful insights into bus movement patterns on a route and
passenger boarding counts. This process is non-trivial because
KeywordsSmart City; Urban Transportation; MapReduce; the GPS tracking signals and AFC records are coarse-grained.
Network Analysis Careful processing is needed to extract arrival time at each bus
stop from GPS sensor data streams. Similarly, most AFC
I. INTRODUCTION systems provide a transaction history, but do not associate
Global development trends are creating a surge in urban transactions with a particular bus stop. Further, the prodigious
development and associated population sizes. Much of the amounts of data generated by a city-wide transportation system
growth is expected to take place in cities of developing means insufficient computing or inefficient algorithms can be a
countries, resulting in widening the gap between urban mobility barrier to information extraction.
demands and the existing transportation infrastructure. In this study, we address the challenges described above, by
Encouraging the use of public transportation options, such as leveraging the power of Big Data. This includes novel ways of
buses, is one of the prospective solutions to alleviate this analyzing data using distributed processing techniques for
problem. However, bus transportation in developing countries efficient handling of large data streams. A three-layer system is
typically must address issues such as overcrowding, delays and proposed to support smart transportation management in
traffic congestion, owing to the high population density and lack developing countries where buses are the primary method for
of efficient system management [1]. One of the critical issues in public transportation. The first layer serves as the support layer
developing smart cities, is monitoring and evaluating the where we process large amounts of GPS signals and AFC
performance of a bus transportation system. Of particular records to compute bus travel time and passenger boarding
interest to urban planners are fundamental performance locations using MapReduce [8]. The second layer contains two
indicators such as passenger demands or bus travel speeds and components that take the indicators from the first layer for data
times. analysis. The first component uses boarding information to
Many existing studies have examined various issues such as cluster bus stops into regions based on their geographic distance
forecasting passenger demands [2], [3] or predicting bus arrival and user base. Using sequential boarding records of all
time [4], [5]. However, most existing approaches require data passengers across the entire city, a novel trip transfer network is
from Automatic Vehicle Location (AVL) and Automatic extracted to measure importance of regions based on PageRank
Passenger Counter (APC) devices, which limits the centrality [9] modified for bus system analysis. The second
generalizability and transferability of their findings due to the component applies regression analysis to investigate a variety of
considerable costs of AVL-APC systems. A TCRP (Transit factors affecting bus travel time. The last layer implements a
Cooperative Research Program) synthesis report in 2008 map-based visualization tool that enables urban planners to
of bus arrival time [4], [5], aiming at improving passenger focuses on the mechanism we used to process input data
experiences. Research has also suggested building an intelligent followed by quantitative techniques that utilize the outcomes.
transportation management system through investing in AVL-
A. System Overview
APC systems [10]. However, due to its high cost, this solution
may not be feasible for developing countries where the access to Our study used real world data obtained from a large bus
expensive equipment on a system-wide is usually limited by system containing 300 routes spanning nearly 5,000 bus stops in
resource constraints [1]. With the wide usage of AFC, passenger the city of Fortaleza, the fifth largest city in Brazil. The data were
boarding information can be extracted even without AVL-APC generated from approximately 2,000 buses for a period of 7
systems, by integrating the AFC data with other data streams like months from May. 2014 thru Nov. 2014. Most buses are
GPS. equipped with GPS tracking devices sending latitude/longitude
coordinates every 15- 30 seconds. The AFC system supports
Forecasting passenger demands has attracted increased both cash and smart card. The AFC data contain transactions at
attention in the past years. Predictions are made based on time all bus stops except terminals. During the period of study, the
series models including Autoregressive integrated moving city introduced several dedicated bus lanes to reduce bus delays.
average (ARIMA), seasonal ARIMA (SARIMA) or a hybrid
approach [2], [3]. In these studies, passenger demands are Table I shows examples from the two major components of
modeled in an aggregated way, therefore individual transit the dataset. Other complimentary datasets include stop locations,
patterns are not considered. Another limitation is that these bus schedules and dedicated lane locations. As mentioned
approaches are built on a limited number of routes. As a result, previously, one objective of this study is extracting fined-
passenger transfer behavior is usually ignored. grained information (e.g., bus travel time and passenger
boarding locations) from raw data to work as a more economical
Recent work on smart transportation systems have used and generalizable alternative to AVL-APC systems. This
smartphone applications to collect information such as bus processing is non-trivial, considering a typical month generates
trajectory and passenger on-board experience via crowd-sensing 200 million GPS signals and 30 million AFC records.
[11]. This approach does not depend on special bus equipment.
However its data quality largely depends on how many users are The architecture of our system is shown in Fig. 1. We
motivated to download and use the app. Also the collected data designed a MapReduce algorithm in Layer-1 to efficiently
may be biased due to self-reporting errors. compute bus travel time and passenger boarding stops as the
fundamental indicators. They are then used by the two
Human movement behaviors exhibit reproducible patterns components in Layer-2 to extract insights of the current system.
[12]. Yet, few research studies have been devoted to the study of In Layer-3 we implemented a data visualization dashboard
human mobility patterns in the context of bus transportation. The providing urban planners with metrics about the bus network on
challenge here is obtaining the passenger alighting information a city map.
(if passengers scan their card during boarding). Ground truth has
to be collected through special surveys [13] which is expensive.
This limits the number of trips observed and in turn reduces the
generalizability of results. In this study, we examine bus
passenger patterns using clustering techniques with innovative
network analytics in place of recording the actual alighting
location of a passenger.
III. METHODOLOGY
In this section, we first describe the data used in our study
with an overview of the system architecture. The discussion then
Fig. 1. Architecture of the three-layer system
B. Computing Fundamental Indicators in MapReduce TABLE II. LAYER-1 MAPREDUCE ALGORITHM
Conceptually, the computation of fundamental indicators Map function: partition GPS stream
can be solved as a multi-step Nearest Neighbor Search (NNS): Input GPS Stream G of tuples < Bi , Ci , Ti > where B i is bus ID,
C i is a pair of coordinates and Ti is timestamp
For an AFC record, search GPS of the bus on the same Output id: a composite key of date and bus ID
day for the record that has a closest timestamp. Then point: a tuple of coordinates and time in seconds
using the GPS record, search for the nearest stop on the 1. for all < Bi , Ci , Ti > G
route as the boarding location. Meanwhile, we also 2. Extract date from Ti as D i and convert time of day
estimate the bus arrival time at that stop. to seconds stored as Si
3. Store id= < Bi , Ci , Ti > point = < Bi , Ci , Ti >
For the same bus trip, if the arrival time of a stop has not
4. end for
been resolved (e.g., no boarding records), we first
5. yield all < id , point> pairs
identify the nearest resolved stop to this target stop. We
use its arrival time to filter GPS records within a time Reduce function: estimate boarding stop and bus arrival time
interval determined by the scheduled trip time. Finally, Input < id , point> pairs merged by id
we search the filtered GPS for a record closest to the One trip of AFC records A containing tuples
target stop. < U i , B , R ,V , Ti > where U i is card ID, B is bus ID, Ti is
timestamp, R is route and V is direction. For a single
In practice, AFC records of a same boarding event can be trip B, R and V are fixed
grouped together to reduce search attempts. GPS data are Output id: a bus stop ID, estimation: a tuple of estimated
indexed by timestamps in B-Trees for time based NNS or arrival time and boarding passengers
indexed by coordinates in R-Trees for distance based NNS. Still, 1. Retrieve stops P on route ( R,V )
processing all buses across the city can be time-consuming due 2. for all < U i , B , R , V , Ti > A
to the input scale. 3. Extract date from Ti as D i and convert time of day
to seconds stored as Si
MapReduce as a programming framework simplifies the
4. Search Si in point for closest S n and record (C n , S n )
complexity of distributed computing [8]. Given the fact that the
previous computation can be performed simultaneously for each 5. Search C n in P for nearest stop Pm and store arrival
individual bus on a particular day, we implemented a time Qm = S n ,Um U i
MapReduce algorithm to arrange the computation in a parallel 6. end for
7. for all Pi in P
manner. GPS data are streaming signals received citywide. In
this study, we simulate the process by streaming archived GPS 8. if Qn = None then
data on a monthly basis. AFC records are organized by trips, 9. Search Pi in P for nearest stop Pi and Qk None
therefore we can segment the data by each trip. In the map phase, 10. Filter point by a time range [Qk T , Qk + T ]
the GPS data are partitioned by a composite key of bus ID and store result in target
date. Searching is executed in the reduce phase. A detailed 11. Search Pi in target for nearest (C v , S v ) and store
algorithm is listed in Table II. arrival time Qn = S v
12. end if
C. Network Analysis of Transit Patterns 13. Store id= Pi , estimation= (Qi , Um )
In this section, we explore transit patterns that are commonly 14. yield all <id, estimation> pairs
15. end for
shared by passengers. Relying on this insight, urban planners can
evaluate if the current bus system meets all passengers needs. Instead of determining the number of clusters in advance, we
In a bus network, a transit pattern is characterized by a pair of start with an informed seeding strategy. Let B = {b1 , b2 ,..., bn } be
boarding and alighting stops. Boarding stops are estimated in
Layer-1. However, alighting location is not directly available a set of bus stops we want to cluster, U (bi ) be the number of
(the AFC system is designed to collect fares, and generates a passengers who boarded at bi in a past period. A new initial
single transaction for a passenger). To address this issue, we first center comes from a subset D B defined as:
apply a bus stop clustering algorithm to increase the reliability
of pattern extraction. Then we introduce our network analysis D = {bx | bx ( B \ C ), U (bx ) a and by C : E (bx , by ) d } (1)
method to reveal passenger transit patterns without explicitly
knowing their alighting locations. where C is the set of selected centers, E (bx , by ) calculates
1) Bus Stop Clustering: A transit pattern can be inferred by Euclidean distance between bx , by , a is the threshold to control
observing people who have similar demands. For example, the minimum boarding passengers of an initial center and d
school students who live nearby in neighborhoods would share controls the minimum distance between the new center and old
a commuting pattern. For this reason, a single pair of ones (a minimum distance ensures each cluster corresponds to a
boarding/alighting bus stops is not enough to capture all trips different region). A bus stop bi D is randomly selected based
following the same pattern. It is better to use a region
(approximating a neighborhood) to represent the origin and on a probability:
destination in a transit pattern. In this study, we used a modified U (bi )
k-means clustering to identify these regions. P(bi ) = (2)
U ( x)
xD
The seeding process will stop when D is empty. The self-loop is avoided during the network generation. Considering
algorithm proceeds with standard k-means. Intuitively this the seasonal pattern of urban mobility, we generate the transfer
strategy is designed to form clusters that have a high user base network on a monthly basis. From our dataset, about 2 million
and are evenly distributed across the city. Using parameters a sequences were used for the monthly network generation. Edge
and d , a pre-defined cluster number is not needed. Besides, the weight wi j of an outgoing edge from Ri to R j was normalized
two parameters are domain-related, making them more friendly following Equation (3):
for transportation engineer to adjust compared with tweaking the
Cijp
value of k . wi j = (3)
k N ip
2) Rank Region importance in a Transfer Network: C p
ik
Alighting stops can be inferred if a passenger has multiple k
estimated boarding stop and Li indicates unknown alighting p and N ip is the set of out-neighbors of Ri in the network.
stop. If the distance between Bi +1 to a certain stop on the route Based on (3), we update nodes weighted PageRank in
of Bi is within a walking range, we assume this stop to be Li iteration t in (4):
(the latter may represent a transfer stop on a different route or PRt ( Ri ) = PRt 1 ( R j ) w j i (4)
the boarding point for the return trip). Using this heuristic R j Iip
Fig. 4. Identified region of high mobility demands based on: (a) average passenger boarding volume and (b) PageRank centrality
TABLE III. REGRESSION ANALYSIS OF ROUTE-LEVEL DELAYS indicators (e.g., bus travel time, bus speed, or passenger
Independent Type B Std. Error boarding counts) are provided on the map so that the users can
Variables apply their domain knowledge to evaluate the metrics.
Weekday Binary -12.196 2.704 -.069***
V. CONCLUSION AND FUTURE WORK
Morning Rush Binary -20.840 1.790 -.165***
Hour In this paper, we have presented a systematic solution for
Afternoon Binary 10.056 1.888 -.077*** smart transportation management on bus networks. A three-
Rush Hour layer system is proposed to help urban planners, managers and
Rain Binary -.290 1.868 -.002
technicians in their management work. We leverage the power
Number of Numeric -3.68 .018 -.306*** of Big Data in the implementation of the system. The system is
Passengers being tested in cooperation with the city of Fortaleza in Brazil.
Dedicated Binary 5.311 1.427 .052***
Lane
We have several working directions for future research. To
validate our analyses, we are in the process of collecting ground
analyses show that in the segments where dedicated lanes are truth on passenger alighting locations using surveys. This can be
implemented, there is an improvement in average transit speeds. used to build a boarding/alighting prediction model at the
Another result worth mentioning is the low negative impact of individual level. Another direction being worked on, is to use the
afternoon peak hour compared with the morning peak hour. By analytic insights from the bus transportation system to design
further investigation of the schedule data, we learned that this is and implement actionable interventions and test the outcomes in
in part due to the route schedules budgeting on average 15-20 the field (for example, anomalous speed patterns on dedicated
more minutes during afternoon peak hours (relative to the lanes may indicate a need to examine patterns of parking
morning). For more details on this analysis, we refer the reader obstructions). Lastly, we plan to examine how bus transportation
to our earlier work [7]. works with other smart city urban mobility programs such as
bike sharing.
D. Segment Level Delay Analysis
Map-based visualization of big data analyses is an important REFERENCES
component of smart transportation management, as it enables [1] K. Gwilliam, Urban transport in developing countries, Transp. Rev.,
domain experts to apply their knowledge while evaluating vol. 23, no. 2, pp. 197216, Jan. 2003.
metrics. Fig. 5 shows a few interface screens in Layer-3. Users [2] R. Xue, D. (Jian) Sun, and S. Chen, Short-term bus passenger demand
(city administrators) select routes and a time range (could be prediction based on time series model and interactive multiple model
multiple periods to support a comparative study). Fundamental approach, Discret. Dyn. Nat. Soc., vol. 2015, no. i, pp. 111, 2015.
[3] L. Moreira-Matias, J. Gama, M. Ferreira, J. Mendes-Moreira, and L.
Damas, Predicting taxi-passenger demand using streaming data, IEEE
Trans. Intell. Transp. Syst., vol. 14, no. 3, pp. 13931402, 2013.
[4] C. Bai, Z. Peng, Q. Lu, and J. Sun, Dynamic Bus Travel Time Prediction
Models on Road with Multiple Bus Routes, Comput. Intell. Neurosci.,
vol. 2015, pp. 19, 2015.
[5] P. R. Ttreault and A. M. El-Geneidy, Estimating bus run times for new
limited-stop service using archived AVL and APC data, Transp. Res.
Part A Policy Pract., vol. 44, no. 6, pp. 390402, Jul. 2010.
[6] D. J. Parker, AVL systems for bus transit: Update. Transp. Res. Board.
2008.
[7] S. Ram, Y. Wang, F. Currim, F. Dong, E. Dantas, and L. A. Sabia,
SMARTBUS: A Web Application for Smart Urban Mobility and
Transportation, in 25th International Conference on World Wide Web
Companion, 2016.
[8] J. Dean and S. Ghemawat, MapReduce: a flexible data processing tool,
Commun. ACM, vol. 53, no. 1, pp. 7277, Jan. 2010.
[9] D. F. Gleich, PageRank beyond the Web, SIAM Rev., vol. 57, no. 3,
pp. 321363, Jul. 2014.
[10] P. G. Furth, B. Hemily, T. H. J. Muller, and J. G. Strathman, Uses of
archived AVL-APC data to improve transit performance and
management: Review and potential, Transp. Res. Board, vol. 23, no. June
2003, 2003.
[11] K. Farkas, G. Feher, A. Benczur, and C. Sidlo, Crowdsending based
public transport information service in smart cities, IEEE Commun.
Mag., vol. 53, no. 8, pp. 158165, Aug. 2015.
[12] M. C. Gonzlez, C. a Hidalgo, and A.-L. Barabsi, Understanding
individual human mobility patterns., Nature, vol. 453, no. 7196, pp. 779
82, Jun. 2008.
[13] N. J. Yuan, Y. Wang, F. Zhang, X. Xie, and G. Sun, Reconstructing
individual mobility from smart card transactions: A space alignment
approach, in Proceedings - IEEE International Conference on Data
Mining, ICDM, 2013, pp. 877886.
Fig. 5. Interfaces of segment level delay analysis