Вы находитесь на странице: 1из 4

People In Motion: Spatio-temporal Analytics on Call Detail Records

Vinay Kolar, Sayan Ranu, Anand Prabhu Subramainan, Yedendra Shrinivasan, Aditya Telang, Ravi Kokku, Sriram Raghavan
IBM Research, India {vinkolar,sayanranu,anandprabhu,yshriniv,aaditya.telang,ravkokku,sriramraghavan}@in.ibm.com

ABSTRACT
The data about how people move in a city can be potentially used by various enterprises and government organizations to strategically optimize their operations and maximize their revenue. However, ne-grained and real-time data is currently unavailable to the enterprises. We believe that Cellular Network operators can deliver such data and insights to enterprises. Call records collected in the networks embed a wealth of information about where, when and how a large fraction of the city moves. However, this information is untapped; a majority of the cellular operators are not deriving spatio-temporal insights or monetizing the data that is already available. In this paper, we demonstrate People in Motion: an end-to-end Hadoop-based system with a library of spatio-temporal algorithms that operates on the call record data to derive business insights. We identify the hangouts and trajectories of users with dierent interests. Finally, we demonstrate a visual analytics tool that facilitates business users to compute, compare and contrast the importance of spatial regions at dierent times for dierent categories of users.

Categories and Subject Descriptors


H.3.4 [Information Systems]: Information Storage and RetrievakSystems and Software ; C.2.3 [Computer Systems Organization]: Computer-Communication Networks Network Operations

Keywords
Call Detail Records, Spatio-temporal Analytics

1.

INTRODUCTION

Cellular network operators constantly collect vast amount of data that provides invaluable information about how the people in a city move. Network operators generally maintain Call Detail Records (CDRs) that record each voice, data

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. COMSNETS 14 Bangalore, India Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

and SMS transaction of every subscriber. Among other parameters, these records collect the subscriber number, basestation connected to, and the duration of transaction. They can be joined by other side-channel information, such as city map and base-station location data, to infer how a subscriber moves in a city [3]. These records capture the whereabouts of a large fraction of population; for example, major cellular network operators in India have more than 150 million subscribers, whose movement can be captured [4, 12]. CDR data also provides large spatial coverage; generally hundreds of thousands of base-stations are deployed across the nation to provide a wide coverage to all its subscribers [4]. CDR data collected by the cellular network operator, hence, provides a wealth of information that can be used to exhaustively describe the spatio-temporal movement of the population. The network data by far provides much ner grained and near real-time information about population movement than traditional population survey schemes such as census. Rich CDR data recorded by the cellular operators has applications in several domains. First, insights from CDR data can be used by the cellular operators for Network Planning and Performance Optimization. For example, the spatiotemporal variation of loads recorded by the CDR data can provide hints to where base-stations can be installed or upgraded. Second, network operators can use the data to discover novel Marketing Opportunities ; the marketing division can understand the demography and movement, and launch appropriate campaigns for targeted population in focused regions. Finally, the data and insights can be used for Data Monetization. Without compromising on the privacy of individual users, the cellular network can sell how the city or a category of population behaves in dierent regions [11]. These insights can be used by various other enterprises to plan their operations. For example, a sports retailer can utilize the insight to understand where and when sports enthusiasts hangout and travel in a city, and then choose when and on which digital a bill-boards to display their advertisement or where to set-up the next retail store. In turn, the cellular network operators benet by selling such aggregate insights to other enterprises, and hence monetize the data collected. While the data collected by cellular networks is rich in describing the population and has potential to be used in several novel usecases, it currently remains untapped. There are several technical challenges. The rst challenge is to develop a system that can continuously ingest large CDR data and operate on it to provide fast analytics; the system

Figure 1: System Architecture

connect to, interest data about users. These varied set of input data sources are fused into a holistic data model in a big data key-value store. In our system, we use Apache Hbase [1] on top of Hadoop Distributed File System (HDFS) as our key-value to store and eciently retrieve large amount of CDR data. Our analytic algorithms (described in Section 3) are implemented as fast map-reduce jobs that run on the data stored in Hbase data store. Our deep analytics library consists of trajectory and hangout detection algorithms, trajectory similarity, aggregation and mining algorithms. The visualization front end (Section 4) issues queries to the backend that return the data in JavaScript Object Notation (JSON) format. The analytics derived out of our system have application in several domains such as real-estate, commerce, pollution and disease control, retail, transportation, banking and insurance.

3.
should be capable of large-scale analytics since the operators typically generate billions of CDRs per day. Another challenge is to build a library of fundamental algorithms that operate on the CDR and map data, and provide valuable spatio-temporal insights. Traditional spatio-temporal analytics algorithms operate on location samples provided by accurate location estimates, such as GPS information. However, CDR provides coarse grained location estimates; the subscriber can be within kilometers away from the basestation, whose location is reported. Similarly, unlike the continuous GPS samples, CDRs are aperiodic and infrequent; CDRs are produced only when a subscriber engages in a transaction. In this demonstration, we describe an end-to-end Spatiotemporal Network Analytics System, called People In Motion, that ingest the CDR data and provide algorithms for detecting and representing population movement. Section 2 describes our Hadoop-based distributed system organizes CDR data to provide ecient retrieval for many spatiotemporal applications. Section 3 describes Deep Analytics, which provides a fast and ecient library of algorithms that operates on the CDR data and provides insights about the population movement. These algorithms include Hangout Rank, which detects and ranks the prominent places where subscribers of a particular interest spend signicant time. We also describe Trajectory Rank algorithm where we rank dierent regions in the city based on how a subset of users with a particular interest travel. We explain the visualization framework in Section 4. We discuss the tool developed for business analysts to explore, compare and contrast the spatio-temporal variations in the region-rankings for dierent interests at dierent times. We believe that our end-to-end system benet various enterprises, as well as the cellular network operators. We nally conclude and describe our future plans to expand spatio-temporal network analytics toolkit in Section 5.

DEEP ANALYTICS LIBRARY

The Deep Analytics library consists of fundamental algorithms that can be used in several application domains. We explain a subset of the relevant fundamental algorithms in this section.

3.1

Trajectory and Hangout Construction Algorithms

A fundamental part of the analytics library consists of building commonly used hangouts and trajectories of subscribers from the CDR data. We currently assume that the hando information is stored in the CDRs; subscribers CDRs record all the base-stations that encountered by the subscriber1 .

Master Trajectory Representation.


From the set of CDRs, we create one master trajectory for each subscriber. A master trajectory for a subscriber lists a sequence of trajectory point. Each trajectory point is a (latitude, longitude, start-time, end-time) tuple, and the sequence of trajectory points is sorted by starttime. The latitude and longitude maps to the base-station location, and the time values represent the time associated with the base-station.

Hangout Detection.
We rst construct equi-sized rectangular grids on the citys map. For each subscriber, we parse the Master Trajectory to nd grids where the user has halted for more than a threshold time (say, 30 mins). We increment the halt-count of that grid for the subscriber. We then take the top-k grids with highest halt-counts. We tag these as the subscribers hangouts.

Trajectory Detection.
We break the Master Trajectory of each subscriber into multiple trajectories based on where the user has halted. These trajectories are then fed into a Trajectory Pattern Mining algorithm, which computes the most common trajectories used by the subscriber [8]. Such trajectories represent
1 This assumption is not always true; some telcos only store the starting and the ending base-station to which the user was connected. Extensions to such data is a part of our future work.

2.

SYSTEM OVERVIEW

The overall system architecture of People in Motion is shown in Figure 1. The input data consists of Call Detail Records, base station GIS information that provide the location (latitude, longitude) of the base stations that users

the paths over which subscriber often moves. These form the subscriber trajectories on which rest of the algorithms operate.

Choose Layout

3.2

Region Summarization Algorithms

Choose Interest

Using the individual subscribers hangout and trajectories identied in the previous subsection, we develop spatiotemporal algorithms that summarize the movement of population in the city. These algorithms provide insights that can be used by the operators and other enterprises for solving problems in several application domains, without compromising on the identity of a subscriber. Our algorithms are generic and privacy-preserving; they can be congured to provide various insights based on the input to the algorithms. The input to the below algorithms is a set of trajectories and hangouts. The output of the algorithm can summarize several city population characteristics based on which set of trajectories is provided as the input.

Choose Choose Metric Time Range

Show summarized Hangouts

Trajectories

Heat-map

Map Instance 1

Map Instance 2

Figure 2: Screen shot of the Visualization Framework the subset of input trajectories that pass through that grid. These output data is fed to the visualization framework.

Hangout-Rank.
Given a list of hangouts of the subscribers, Hangout Rank algorithm provides a ranking of dierent grids on the city map. This can be used to create a heat-map of dierent hangouts of the users. Such heat-maps are useful to for enterprises who want to know where a certain set of subscribers hangout at dierent times.

4.

VISUALIZATION FRAMEWORK

Trajectory Rank.
Trajectory Rank algorithm inputs a set of trajectories, and ranks the grid by the number of times the grid is visited by the trajectories. Trajectory Rank algorithm answer questions such as Which physical regions are well connected?. This can have application in identifying new places to construct retail outlets or gas-stations, ATM locations or developing shopping malls.

An interactive visualization framework is built to explore the trajectory analytics discussed in section 3. The framework is designed to simultaneously explore dierent analytics results and compare user interests over regional trajectory summaries. Flow maps [10] are used to represent aggregated trajectory data where a summary trajectory is drawn with varying width based on the number of individual trajectories or their associated attributes. Andrienko et. al. [2] suggest dierent spatial generalization aggregated trajectory data over time on a ow map. Flowstrates [5] uses two separate maps to display the origin and destinations, and a heat map in the middle to represent the changes in ow magnitudes over time. These techniques help in studying the mass movement, however, do not support visualizing user interest information over regional trajectory summaries.

Trajectory Similarity.
We have implemented state-of-the-art trajectory similarity algorithms [6] that identify users sharing common trajectories. This algorithm can be used for several social applications, such as dynamic car pooling.

4.1

Geo-Heat Map Matrix

3.3

Output of Deep Analytics

Analytics result are organized into the following four data structures: 1. Geo Data: This represents how the regions are split in the map during the summarization process. We provide an overall bounding box of the region, and we create rectangular tiles of size 500 x 500 sq m. Each tile is assigned a gridId and has a bounding box (minimum and maximum latitude and longitude) 2. Interest Data: We assign each interest identied in the system an interestId and interestName 3. Trajectory Data: Each trajectory identied in Section 3.1 is assigned a trajectoryId. This contains a sequence of trajectory points. Note that trajectory points are spatio-temporal, i.e, they contain the (latitude, longitude) information along with the (summarized) time of travel. 4. Metric data: The output of the region summarization algorithms dened in Section 3.2 computes various metrics for each region (grid). The metric data contains three elds gridId, metric, {set of trajectoryIds}. The metric indicates the heat of each grid. The trajectory set indicates

In this visualization framework, we use a matrix of geographic maps with synchronized zoom and pan across views to support simultaneous analysis of dierent trajectory analytics results. Users can select a matrix layout n m based on the analytic task at hand. Figure 2 shows a map-matrix with 1 row and 2 columns. The grid framework used to summarize the trajectories are overlaid on top of these maps to generate a heat map. Each map instance has options to independently lter the analytics results and generate a corresponding heat map that will be overlaid on top of it. We currently have three Filters that invoke dierent algorithms with dierent input to show appropriate output. We lter based on: 1. Metric type: This denes which Region Summarization algorithm needs to be run. 2. Interest: Selecting an interest lters the trajectories and hangouts of the subscribers who are interested in the selected value to be ltered. 3. Time: Choosing a time range lters the set of hangouts and trajectories that are selected to be provided as input to the algorithm. Currently, we expose six pre-dened time ranges based on time-of-the-day. In future, this can be generalized to any time-interval. The interactive visualization framework runs on a web browser and is developed using leaet.js [9] and d3.js [7].

5.

CONCLUSION

(a) Music

(b) Sports

(c) Finance

Figure 3: Hangouts (blue dots) identied for subscribers with dierent interests

People In Motion enables a variety of spatio-temporal analytics and visualization on big-data platforms using call records collected from cellular network operators. Our system, tool and derived insights enables enterprises in dierent domains to design novel ways of optimizing their operations and maximizing the revenue. The system also enables cellular network operators to monetize the data that is already collected. The interactive visualization enables the business analysts to visually explore analytics. The analyst can compute, compare and contrast the importance of spatiotemporal regions in dierent dimensions such as dierent interests, at dierent times and using dierent metrics.

6.
At 10:00 AM At 6:00 PM All Day

REFERENCES

Figure 4: Heat-map of dierent regions at dierent times of the day for subscribers with music interest

It interacts with a J2EE web server that executes the trajectory analytics based on the user interactions in the web browser. The client transmits the algorithm and the lters selected to the server. There is also an option (marked by Show Hangouts in Figure 2) to show hangouts of the summarized users; we mark grid-ids of the hangouts that have a Hangout Rank (Algorithm in Section 3.2) of more than a threshold. The server selects a subset of hangouts and trajectories that match the above lters and invokes the required Region Summarization Algorithm. Further, users can study trajectories passing through a cell by selecting it.

4.2

Illustration of Interactive Analysis

Each map instance displays the heat-map based on the received metrics for each grid. The heat-map can be overlaid with other hangout and trajectories. Once an analyst clicks on a particular grid, we display the trajectories that pass through a grid. The left pane in Figure 2 shows the trajectories passing through one grid between the interval 10:00 to 14:00 for the subscribers who have music interest. Also, selecting the Show Hangouts option overlays the summarized hangouts of the subset of subscribers selected. The hangouts for subscribers with dierent interests is shown in Figure 3. Figure 4 shows an instance where the Trajectory Rank of subscribers with music interest is being explored. It can be seen that music lovers are active at dierent places at dierent times. Such information is helpful for a music company to, say, dynamically advertise their ads at dierent places at dierent times. It also assists music retailers to nd the right location for setting up a music store, and to provision resources in the music store for dierent trac at dierent times of the day.

[1] Apache Hbase. online. Available at http://hbase.apache.org/. [2] N. Adrienko and G. Adrienko. Spatial generalization and aggregation of massive movement data. Visualization and Computer Graphics, IEEE Transactions on, 17(2):205219, 2011. [3] R. Becker, R. C aceres, K. Hanson, S. Isaacman, J. M. Loh, M. Martonosi, J. Rowland, S. Urbanek, A. Varshavsky, and C. Volinsky. Human mobility characterization from cellular network data. Commun. ACM, 56(1):7482, Jan. 2013. [4] Bharti Airtel Limited. Quarterly report on the results for the fourth quarter and full year ended March 31, 2013. online, May 2013. Available at http://www.airtel.in/wps/wcm/connect/ 0ef3180d-7ea7-4be5-a38d-db3cbf103507/ Quarterly_report_Q4_May_13.pdf?MOD=AJPERES. [5] I. Boyandin, E. Bertini, P. Bak, and D. Lalanne. Flowstrates: an approach for visual exploration of temporal origin-destination data. In 13th Eurographics / IEEE - VGTC conference on Visualization, EuroVis11, pages 971980, 2011. Uzsu. [6] L. Chen and M. T. A Robust and fast similarity search for moving object trajectories. In SIGMOD, pages 491502, 2005. [7] d3.js. Data-Driven Documents (d3). online, Nov 2013. Available at http://d3js.org/. [8] F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi. Trajectory pattern mining. In 13th ACM SIGKDD, KDD 07, pages 330339, New York, NY, USA, 2007. ACM. [9] leaet.js. Leaet. online, Nov 2013. Available at http://leafletjs.com/. [10] D. Phan, L. Xiao, R. Yeh, P. Hanrahan, and T. Winograd. Flow map layout. In Proceedings of the Proceedings of the 2005 IEEE Symposium on Information Visualization, INFOVIS 05, pages 29, Washington, DC, USA, 2005. IEEE Computer Society. [11] Telefonica. Dynamic Insights. online, May 2013. Available at http://dynamicinsights.telefonica.com/. [12] Vodafone India. Vodafone India Full Year FY13 Results. online, May 2013. Available at https://www.vodafone.in/documents/pdfs/ pressreleases/pr_1280.pdf.

Вам также может понравиться