Вы находитесь на странице: 1из 68

Resolving MutiParty Conflicts in Social Media

ABSTRACT

Detection of emerging topics is now receiving renewed interest motivated by the rapid
growth of social networks. Conventional-term-frequency-based approaches may not be
appropriate in this context, because the information exchanged in social-network posts include
not only text but also images, URLs, and videos. We focus on emergence of topics signaled by
social aspects of these networks. Specifically, we focus on mentions of users--links between
users that are generated dynamically (intentionally or unintentionally) through replies, mentions,
and rewets. We propose a probability model of the mentioning behavior of a social network user,
and propose to detect the emergence of a new topic from the anomalies measured through the
model. Aggregating anomaly scores from hundreds of users, we show that we can detect
emerging topics only based on the reply/mention relationships in social-network posts. We
demonstrate our technique in several real data sets we gathered from Twitter. The experiments
show that the proposed mention-anomaly-based approaches can detect new topics at least as
early as text-anomaly-based approaches, and in some cases much earlier when the topic is poorly
identified by the textual contents in posts.

INTRODUCTION

1
The main goal of the service is to make your social life and that of your friends, more
active and stimulating. Social network can help you both maintain existing relationships and
establish new one s by reaching out of people you have never met before. Before getting to
know a forever member, you can even see how they are connecting you through the friend’s
network. This software is provided as an online only resource so that it may be continually
extended and updated.

Communication through social networks, such as Facebook and Twitter, is increasing its
importance in our daily life. Since the information exchanged over social networks are not only
texts but also URLs, images, and videos, they are challenging test beds for the study of data
mining. There is another type of information that is intentionally or unintentionally exchanged
over social networks: mentions. Here we mean by mentions links to other users of the same
social network in the form of message-to, reply-to, retweet-of, or explicitly in the text. One post
may contain a number of mentions. Some users may include mentions in their posts rarely; other
users may be mentioning their friends all the time. Some users (like celebrities) may receive
mentions every minute; for others, being mentioned might be a rare occasion. In this sense,
mention is like a language with the number of words equal to the number of users in a social
network.

We are interested in detecting emerging topics from social network streams based on
monitoring the mentioning behaviour of users. Our basic assumption is that a new (emerging)
topic is something people feel like discussing about, commenting about, or forwarding the
information further to their friends. Conventional approaches for topic detection have mainly
been concerned with the frequencies of (textual) words. A term frequency based approach could
suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated
preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be
applied when the contents of the messages are mostly non-textual information. On the other
hands, the “words” formed by mentions are unique, requires little prepossessing to obtain (the
information is often separated from the contents), and are available regardless of the nature of the
contents.

2
In this project, we propose a probability model that can capture the normal mentioning
behaviour of a user, which consists of both the number of mentions per post and the frequency of
users occurring in the mentions. Then this model is used to measure the anomaly of future user
behaviour. Using the proposed probability model, we can quantitatively measure the novelty or
possible impact of a post reflected in the mentioning behaviour of the user. We aggregate the
anomaly scores obtained in this way over hundreds of users and apply a recently proposed
change-point detection technique based on the Sequentially Discounting Normalized Maximum
Likelihood (SDNML) coding.

This technique can detect a change in the statistical dependence structure in the time
series of aggregated anomaly scores, and pinpoint where the topic emergence is; see Figure 1.
The effectiveness of the proposed approach is demonstrated on two data sets we have collected
from Twitter. We show that our approach can detect the emergence of a new topic at least as fast
as using the best term that was not obvious at the moment. Furthermore, we show that in two out
of two data sets, the proposed link-anomaly based method can detect the emergence of the topics
earlier than keywordfrequency based methods, which can be explained by the keyword
ambiguity we mentioned above.

We assume that the data arrives from a social network service in a sequential manner
through some API. For each new post we use samples within the past T time interval for the
corresponding user for training the mention model we propose below. We assign anomaly score
to each post based on the learned probability distribution. The score is then aggregated over users
and further fed into a changepoint analysis.

PROBABILITY MODEL

We characterize a post in a social network stream by the number of mentions k it


contains, and the set V of names (IDs) of the users mentioned in the post. Formally, we consider
the following joint probability distribution P(k; V j_; f_vg) = P(kj_)Πv∈ V_v: (1) Here the joint
distribution consists of two parts: the probability of the number of mentions k and the probability

3
of each mention given the number of mentions. The probability of the number of mentions P(kj_)
is defined as a geometric distribution with parameter _ as follows: P(kj_) = (1 􀀀 _)k_: (2)
On the other hand, the probability of mentioning users in V is defined as independent, identical
multinomial distribution with parameters _v ( Σv _v = 1). The anomaly score in is computed for
each user depending on the current post of user u and his/her past behaviour T (t)u . In order to
measure the general trend of user behaviour, we propose to aggregate the anomaly scores
obtained for posts x1; : : : ; xn using a discretization of window size _ > 0.
Given an aggregated measure of anomaly, we apply a change-point detection technique
based on the SDNML coding. This technique detects a change in the statistical dependence
structure of a time series by monitoring the compressibility of the new piece of data. The
SDNML proposed is an approximation for normalized maximum likelihood (NML) code length
that can be computed sequentially and employs discounting in the learning of the AR models;
see also. Algorithmically, the change point detection procedure can be outlined as follows. For
convenience, we denote the aggregate anomaly score as xj instead of s′j .

DYNAMIC THRESHOLD OPTIMIZATION (DTO)

We make an alarm if the change-point score exceeds a threshold, which was determined
adaptively using the method of dynamic threshold optimization (DTO). In DTO, we use a 1-
dimensional histogram for the representation of the score distribution. We learn it in a sequential
and discounting way. Then, for a specified value _, to determine the threshold to be the largest
score value such that the tail probability beyond the value does not exceed _. We call _ a
threshold parameter.

We collected four data sets “Job hunting”, “Youtube”, “NASA”, and “BBC”from
Twitter. Each data set is associated with a list of posts in a service called Togetter1; Togetter is a
collaborative service where people can tag Twitter posts that are related to each other and
organize a list of posts that belong to a certain topic. Our goal is to evaluate whether the
proposed approach can detect the emergence of the topics recognized and collected by people.
We have selected two data sets, “Job hunting”, and “BBC” each corresponding to a user
organized list in Togetter. For each data set we collected posts from users that appeared in each

4
list (participants). The number of participants were 200 for “Job hunting” data set and 47 for
“BBC” data set, respectively. Please find the results for the other two data sets and combination
of the proposed link-anomaly-based method and Kleinberg’s burst model in our technical report.
We compared our proposed approach with a keywordbased change-point detection method. In
the keyword-based method, we looked at a sequence of occurrence frequencies (observed within
one minute) of a keyword related to the topic; the keyword was manually selected to best capture
the topic.
Then we applied DTO to the sequence of keyword frequency. In our experience, the
sparsity of the keyword frequency seems to be a bad combination with the SDNML method;
therefore we did not use SDNML in the keyword-based method. We use the smoothing
parameter _ = 15, and the order of the AR model 30 in the experiments; the parameters in DTO
was set as _ = 0:05, NH = 20, _H = 0:01, rH = 0:005. A drawback of the keyword-based
dynamic thresholding is that the keyword related to the topic must be known in advance,
although this is not always the case in practice. The change-point detected by the keyword-based
methods can be thought of as the time when the topic really emerges. Hence our goal is to detect
emerging topics as early as the keywordbased methods.

“JOB HUNTING” DATA SET

This data set is related to a controversial post by a famous person in Japan that “the
reason students having difficulty finding jobs is, because they are stupid” and various replies.
The keyword used in the keyword-based methods was “Job hunting.” The first alarm time of
the proposed link-anomaly-based change-point analysis was 22:55, whereas that for the
keyword-frequency-based counterpart was 22:57; we can observe that the proposed link-
anomaly-based methods were able to detect the emerging topic as early as keyword-frequency-
based methods.

”BBC” DATA SET

5
This data set is related to angry reactions among Japanese Twitter users against a BBC
comedy show that asked “who is the unluckiest person in the world” (the answer is a Japanese
man who got hit by nuclear bombs in both Hiroshima and Nagasaki but survived). The keyword
used in the keyword-based models is “British” (or “Britain”). The first alarm time of the link-
anomaly-based method was 19:52, which is earlier than the keyword-frequencybased
method at 22:41.

The proposed link-anomaly based method detects an emerging topic significantly earlier
than the keywordfrequency based method on “BBC” data set, whereas for “Job hunting” data set
the detection times are almost the same; The above observation is natural, because for “Job
hunting” data set, the keyword seemed to have been unambiguously defined from the beginning
of the emergence of the topic, whereas for “BBC” data set, the keywords are more ambiguous.
For “BBC” data set, interestingly the “bursty” areas found by the link-anomaly-based change-
point analysis and the keyword-frequency-based method seem to be disjoint. This is probably
because there was an initial stage where people reacted individually using different words and
later there was another stage in which the keywords are more unified.

In our approach, the alarm was raised if the change-point score exceeded a dynamically
optimized threshold based on the significance level parameter shows results for a number of
threshold parameter values. We see that as _ increased, the number of false alarms also
increased. Meanwhile, even when it was so small, our approach was still able to detect the
emerging topics as early as the keyword-based methods. We set _ = 0:05 as a default
parameter value in our experiment.

LITTTERATURE SURVEY

6
Jheser Guzman describe On-line social networks have become a massive communication
and information channel for users world-wide. In particular, the microblogging platform Twitter
is characterized by short-text message exchanges at extremely high rates. In this type of scenario,
the detection of emerging topics in text streams becomes an important research area, essential for
identifying relevant new conversation topics, such as breaking news and trends. Although
emerging topic detection in text is a well established research area, its application to large
volumes of streaming text data is quite novel. Making scalability, efficiency and rapidness, the
key aspects for any emerging topic detection algorithm in this type of environment. Our research
addresses the aforementioned problem by focusing on detecting significant and unusual bursts in
keyword arrival rates or bursty keywords.

We propose a scalable and fast on-line method that uses normalized individual frequency
signals per term and a windowing variation technique. This method reports key-word bursts
which can be composed of single or multiple terms, ranked according to their importance. The
average complexity of our method isO (nlogn), where n is the number of messages in
the time window. This complexity allows our approach to be scalable for large streaming
datasets. If bursts are only detected and not ranked, the algorithm remains with lineal complexity
O(n), making it the fastest in comparison to the current state-of-the-art. We validate our approach
by comparing our performance to similar systems using the TREC Tweet 2011 Challenge tweets,
obtaining 91% of matches with LDA, an off-line gold standard used in sim- ilar evaluations. In
addition, we study Twitter messages related to the SuperBowl football events in 2011 and 2013.

Social media microblogging platforms, such as Twitter1, are characterized by extremely


high exchange rates. This quality, as well as its network structure, makes Twitter ideal for fast
dissemination of information, such as breaking news. Furthermore, Twitter is identified as the
first media source in which important news is posted and national disasters (e.g. earthquakes,
diseases outbreaks) are reported. Each message that is posted on Twitter is called a tweet, and
messages are at most 140-characters long. In addition, Twitter users connect to each other using
a follower/followee directed graph structure.

7
Users can support and propagate messages by using a re-tweet feature which boosts the
original message by reposting it to the user’s followers. Given that Twitter has been adopted as a
preferred source for instant news, real-time detection of emerging events and topics has become
one of the priorities in on-line social network analysis. Research on emerging topic detection in
text is now focused on the analysis of streaming data and on-line identification. In particular, the
analysis of microblog text is quite challenging, mostly because of the large volume and high
arrival rate of new data.

]In this scenario an important part of the analysis must be performed in real-time (or
close to real-time), requiring efficient and scalable algorithms. This problem has generated
increasing interest from the private and academic communities. New models are constantly being
de veloped to better understand human behavior based on social media data in interdisciplinary
work fields such as sociology, political science, economy and business markets. We target the
problem of emerging topic detection in short-text streams by proposing an algorithm for
on-line Bursty Keyword Detection (BD). This in general is considered to be a first important step
in the identification of emerging topics in this context. Our approach uses windows slicing and
window relevance variation rate analysis on keywords.

We validate our methodology on the TREC Tweet 2011 dataset, using it to simulate
streaming data. Our experiments indicate thatthis concept works better at detecting keyword
bursts in text streams than other more complex state-of-the-art solutions based on queue theory.
Also we analyze tweets from the SuperBowl football events in 2011 and 2013 as case studies.
We perform a detailed discussion on the behavior of noise in bursty keyword signals which is
constituted mostly by stopwords. We compare our solution to LDA as a ground truth, similarly
to prior work. Using LDA in a one-keyword per topic mode, we achieved more than 91% topic
matches. This is a great improvement over similar approaches. Moreover, our implementation is
efficient; achieving a complexity of O (nlogn) in the average case, where n is the number of
messages in the current window.

In detail, the contributions of our work are three-fold:

8
1. We introduce a scalable and efficient keyword burst detection algorithm for microblog
text streams, based on window slicing and relevance window variations.
2. We present a technique for eliminating non-informative and irrelevant words from
microblog text streams.
3. We present a detailed proof-of-concept system and validate it on a public dataset.

Our work involves research in the areas of event detection and trend analysis for microblog
text streams. In particular our goal is to identify current popular candidate events listed by most
popular bursty terms in the data stream. In relation to this topic, several applications exist for
detecting events like natural disasters and health alerts. For example, epidemics, wildfires,
hurricanes and floods, earthquakes and tornados. Events have been modeled and analyzed over
time using keyword graphs, link-based topic models, and infinite state automatons.

Leskovec et al. perform analyses of memes for news stories over blogs and news data, and on
Twitter data. Swan et al. deal with constructing overview time- lines of a set of news stories. On-
line bursty keyword detection is considered as the basis for on-line emerging topic detection. For
topic detection in an off-line fashion, algorithms such as Latent Dirichlet Allocation (LDA) or
Phrase Graph Generation Method can be used. Statistical methods and data distribution tests can
also be used to detect bursty keywords. Mathioudakis and Koudas introduce Twitter Monitor, a
system that performs detection of topic trends (emerging topics) in the Twitter stream. Trends
are identified based on individual key-wordbursts and are detected in two steps.

This system identifies bursts of keywords by computing the occurrence of individual key-
words in tweets. The system groups keyword trends based on their co-occurrences. To detect
bursts of keywords, they introduce the algorithm QueueBurst with the following characteristics:
(1) one- step analysis per keyword, (2) real-time analysis (based on the tweet stream), (3)
adjustable against “false” explosions, and (4) adjustable against spam. In order to group sets of
related bursty keywords, the authors introduce an algorithm named GroupBurst, which evaluates
co-occurrences in recent tweets. Our current work focuses on keyword detection up to three term
keywords, for which we compare to QueueBurst. Twitter Monitor requires an intensive pre-

9
processing step for determining its optimal parameter settings for each keyword and also for
global variables. These parameter settings must be computed with a historical dataset.

A different approach is presented by Wenget al., with their system EDCoW (Event Detection
with Clustering of Wavelet-based Signals). EDCoW builds individual word signals by applying
wavelet analysis to word frequencies. It filters away trivial words by looking at their
corresponding signal auto-correlations. The remaining words are clustered to form events with a
modularity-based graph partitioning technique. The wavelet transform is applied to a signal
(time-series) created using the TF-IDF index. Their approach was implemented in a proof-of-
concept system, which they used to analyze online discussions about the Singapore General
Election of 2011.

Another relevant study is that of Naaman et al. In this work the authors make two
contributions for interpreting emerging temporal trends. First, they develop a taxonomy of trends
found in data, based on a large Twitter message dataset. Secondly, they identify important
features by which trends can be categorized, as well as the key features for each category. The
dataset used by Naaman et al. consists of over 48 million messages posted on Twitter between
September 2009 and March 2010 by 855,000 unique New York users. For each tweet in this
dataset, they recorded its textual content, the associated timestamp and the user ID.

BURST DETECTION MODEL

We propose a methodology based on time-window analysis. We compute keyword


frequencies, normalize by relevance and compare them in adjacent time windows. This
comparison consists of analyzing variations in term arrival rates and their respective
Variation percentages per window. A similar notion (Discrete Second Derivative) has been used
in the context of detection of bursts in academic citations. We define Relevance Rates (RR) as
the probability of occurrence of a non-stopword term in a window. We use RR to generalize
burst detection making them independent of the arrival rate. Even though the public Twitter API
only provides a stream which is said to be less than 10% of the actual tweets posted on Twitter,

10
we believe our method can be easily adapted for the complete data stream using a MapReduce
schema.

Arrival rates vary periodically (in a non-bursty way) during the day; depending on the
hour, time zone, user region, global events and language. Bursty keywords are ranked according
to their relevance variation rate. Our method avoids the use of statistical distribution analysis
methods for keyword frequencies; the main reason is that this approach, commonly used in state-
of-the-art approaches, increases the complexity of the process. We show that a simple relevance
variation concept is sufficient for our purposes if we use good stop word filter and noise
minimization analysis.

To study the efficiency of our algorithm, which we name Window Variation Keyword
Burst Detection, we implement a proof-of-concept system. The processes involved in this system
are five. These modules are independent of each other and they have been structured for
processing in threads. These modules are: Stream Data Pre-Processing In this stage, data is pre-
processed extracting keywords from each message, so that later on, our burst detection algorithm
an- alyzes them.

This stage is composed of the following three modules :(1) Stream Listener, (2) Tweet
Filter and (3) Tweet Packing. 1) Stream Listener Module: This module receives streaming data
in the form of Twitter messages, which can come directly form the Twitter API or some other
source 2. Messages are received in JSON format 3. This data is parsed and encapsulated. After
the encapsulation of each message it is en-queued in memory for the next module in the pipeline.
It should be noted that message encapsulation is prone to delays caused by the Internet
bandwidth connection and Twitter’s information delivery rate, which can cause data loss.

Tweet Filter Module module discards messages which are not written in languages
accepted by our system. We perform language classification using a Naive Bayes classifier. This
module also standardizes tweets according to the following rules: Treatment of special
characters and separation marks: Replacing special characters and removal of accents, apostro-

11
phes, etc. Standardization of data: Upper and lower case conversion nand replacement of special
characters.

After normalization and language detection, the tweet is enqueued into queue Q1 for
posterior analysis. 3) Tweet Packing Module: Filtered and standardized tweets in queue Q1, are
grouped into a common set determined by creation timestamp, shown in Figure 2. This set of
tweets, which we refer to as Bag of Tweets, represent an individual time-window. It is important
to note that the arrival of tweets maintains a chronological order. In the case that an old or
delayed tweet appears, it is included in the current window. Each of these windows is sent to the
following stage for processing.

BURSTY KEYWORD DETECTION

This process involves two modules. The second module must wait for the first module to
finish processing a window in order to process an entire window at a time (serial mode). The
algorithm describes this process, where the second module starts in line 24. 1) Window
Processing Module: Each keyword, composed of single or adjacent word n-grams, is mapped
into a hash table data structure. This structure manages keywords in addition to the in- formation
of its two adjacent windows and their rates. We consider as n-grams the n ordered correlative
words. The hash table allows access to keyword information in constant time for most cases
O(1), and in the worst case with complexity of O(n) when collisions occur. This process is
detailed in the algorithm described. This data structure controls the complexity of the algorithm
with optimal insertions and search O(1).

Keyword Ranker Module: Bursty keywords are included implicitly into the hash table.
Therefore, we extract bursty keywords by discarding those that do not classify as having a
positive relevance variation. We discard non-bursty keywords using the criteria. TwitterMonitor
(TM) is one of the earliest works in the field of detecting emerging topics on Twitter. Its core
algorithm named QueueBurst uses five parameters and concepts of Queue Theory M/M/1. We
used the recommended parameter settings from the technical paper provided directly to us by the
authors, setting the tolerances 0 and 1 to10.

12
Weng et al. developed EDCoW (Event Detection with Clustering of Wavelet-based
Signals), a system that uses queueing technique for bursty keyword detection and wavelet
techniques to detect trends in Twitter. As mentioned earlier, LDA (with the Gibbs Sampling
technique for parameter estimation) is a reasonable gold standard for our evaluation. This
follows the approach used in the EDCoW article to compare TM and EDCoW with our proposal.
Given that with EDCoW there are no details available for the implementation, we can only
perform a similar experiment and compare results to TM and our method.

Next, we analyze the complexity of TM, LDA and BD. We cannot analyze the
complexity of EDCoW. It should be noted that LDA is an off-line method, therefore it competes
with an advantage in relation to on-line methods TM, BD (and EDCoW) given that it has a
complete view of the information, as opposed to the limited data that on-line methods use. TM is
O(wni), where n is the number of tweets to be processed, w is the average number of words per
message and I is the iterations for generating the exponential aleatory variables for the M/M/1
queue. The parameter I cannot be determined exactly because of randomness, leaving the
complexity of the algorithm in O(ni).

This algorithm analyzes each tweet in one pass, but this property creates delays because it
cannot be parallelized. LDA is a statistical method that determines k topics that represent a
document, where k is the number of the top events in that document. Assume we have N
documents and a vocabulary of size V. The complexity of mean-field variational inference for
LDA is O(NKV). In this scenario, we assume a document to be the concatenation of all of the
tweets in a same time-window.

It should be noted that LDA does not constitute a perfect gold standard for burst de-
tection in streaming data, but constitutes an approximation. Some topics might not be bursty and
some bursts do not correspond to topics. For our Window Variation Keyword Burst Detection
(BD) a threshold T can be included in order to truncate the number of bursty keywords returned.
This, and an optimal selection algorithm, reduce complexity of the ranking algorithm to O(Tn).

13
Because T is constant it remains lineal O (n). This technique does not require parameters to be
reset at run time. Otherwise, if we decide not to use a threshold, the complete ranking would take
O(nlog(n)).

We compare our BD algorithm to the topics returned by LDA on a one-keyword-per-


topic basis. We do this to determine the percentage of topic matches on the TREC Tweet 2011
dataset. We also compared Twitter Monitor with LDA using the same dataset and assumptions,
and discuss results obtained similarly in EDCoW. Detection and tracking of topics have been
studied extensively in the area of topic detection and tracking (TDT). In this context, the main
task is to either classify a new document into one of the known topics (tracking) or to detect that
it belongs to none of the known categories. Subsequently, temporal structure of topics have been
modeled and analyzed through dynamic model selection, temporal text mining, and factorial
hidden Markov models.

Another line of research is concerned with formalizing the notion of “bursts” in a stream
of documents. In his seminal paper, Kleinberg modeled bursts using time varying Poisson
process with a hidden discrete process that controls the firing rate. Recently, He and Parker
developed a physics inspired model of bursts based on the change in the momentum of topics.
All the above mentioned studies make use of textual content of the documents, but not the social
content of the documents. The social content (links) have been utilized in the study of citation
networks. However, citation networks are often analyzed in a stationary setting. The novelty of
the current paper lies in focusing on the social content of the documents (posts) and in combining
this with a change-point analysis.

Yuki Takeichi, Kazutoshi Sasahara, Reiji Suzuki and Takaya Arita describe Twitter often
behaves like a “social sensor” in which users actively sense real-world events and spontaneously
mention these events in cyberspace. Here, we study the temporal dynamics and structural
properties of Twitter as a social sensor in major sporting events. By examining Japanese
professional baseball games, we found that Twitter as a social sensor can immediately show
reactions to positive and negative events by a burst of tweets, but only positive events induce a
burst of retweets to follow. In addition, retweet networks during the baseball games exhibit clear
polarization in user clusters depending on baseball teams, as well as a scalefree in-degree

14
distribution. These empirical findings provide mechanistic insights into the emergence and
evolution of social sensors.

Online social media sites, such as Twitter and Facebook, have become increasingly
popular, to the point that they are now essential tools in everyday life, thereby facilitating
massive, near-realtime, and networked social interactions in cyberspace. In addition, these media
can have an impact not only in cyberspace but also in the physical-world. For example, it was
reported that Twitter helped Arab Spring activists to spread and share information, playing a key
role in the ensuing revolutionary social movements1. Thus, online social media can work as
interfaces between cyberspace and real- world environments, connecting people and information
insome nontrivial ways.

Consequently, online social media form a hybrid system of users and the web, which
may behave like a single organism that evolves in time, providing a new research subject for the
study of artificial life. Many social media studies have already been conducted, though not in the
context of artificial life. Focusing particularly on Twitter, we see that previous studies have
reported its unique characteristics, such as the structural properties of user networks (Kwak et al.,
2010; Bollenet al., 2011a), the nature of social interactions (Grabowiczet al., 2012; Conover et
al., 2012) and information diffusion(Romero et al., 2011; Weng et al., 2012), collective atten-
tion (Lehmann et al., 2012; Sasahara et al., 2013) and collective mood (Golder and Macy, 2011;
Dodds et al., 2011), and users’ dynamics related to particular real-life events (Sakaki et al., 2010;
Borge-Holthoefer et al., 2011; Gonzalez-Baizonet al., 2011). Twitter data were also used to
detect emerging topics (Takahashi et al., 2014) and to predict the stock markets (Bollen et al.,
2011b).

This paper focuses on Twitter as a “social sensor,” a new type of emergent collective
behavior in the social age. Twitter allows users to read, post, and forward a short text message of
140 characters or less, called “tweets,” in online user networks. As shown in Fig. 1, Twitter users
actively sense real-world events and spontaneously make utterances about these events by
posting tweets, which immediately spread over online user networks. In addition, such
information cascades can be amplified by chains of “retweets” (forwarded tweets) from other
users, called followers. This is not a passive one-shot process, but rather an active process that is

15
recurrently happening and constantly evolving due to changes both in the physical-world and in
cyberspace. Consequently, the Twitter system can behave like a social sensor, exhibiting
collective dynamics and a distinct structure linked with target events. This is true in principle,
and the previous studies mentioned above have revealed some aspects of social sensors.
However, little is known about the dynamic nature of social sensors which cannot be explained
solely by “bursts of tweets.”

We therefore conducted a case study of Twitter as a dynamic social sensor in major


sporting events—Japan’s 2013 Nippon Professional Baseball (NPB) games—by focusing on co-
occurrences of tweets and retweets. These target events were suitable for our primary study
because it is known that major sporting events are the subjects of strong collective attention of
viewers, which gives rise to a large volume of tweets and retweets (Bagrow et al., 2011;
Sasahara et al., 2013). Our study provides key insights into how and when the collective
dynamics of social media users emerge and function as a social sensor.

CONSTRUCTION OF RETWEET NETWORKS

The structures of social sensors linked with major sporting events are examined using
complex networks. Complex networks consist of a large number of nodes with sparse
connections between them, and they are used to describe, analyze, and model real-world
networks, ranging from biological systems to social systems to artificial systems (Newman,
2010). Using official retweets (not user retweets—posts with “RT” by hand), we constructed
“retweet networks,” in which each node represents a user and a directed edge is attached from
user B to user A if user B retweets a tweet posted originally by user A. Note that if another user
C retweets a user B’s retweet, a directed edge is connected from user C to user A (i.e., tweet
origin). This is due to the official retweet specification of the Twitter system. Thus, influential
users (also known as “hub” users) whose tweets are preferentially retweeted are represented as
nodes with many incoming edges. The resulting retweet networks are visualized in a
forcedirected layout algorithm called OpenOrd4 using Gephi5. The size of nodes is proportional
to the logarithm of the number of in-degrees. In addition, cumulative in-degree distributions are
calculated from retweet networks to access their structural properties.

16
Amogh Mahapatra , Nisheeth Srivastava and Jaideep Srivastava describe using side
information to further inform anomaly detection algorithms of the semantic context of the text
data they are analyzing, thereby considering both divergence from the statistical pattern seen in
particular datasets and divergence seen from more general semantic expectations. Computational
experiments show that our algorithm performs as expected on data that reflect real-world events
with contextual ambiguity, while replicating conventional clustering on data that are either too
specialized or generic to result in contextual information being actionable. These results suggest
that our algorithm could potentially reduce false positive rates in existing anomaly detection
systems. Anomaly detection techniques in the textual domain aim at uncovering novel and
interesting topics and words in a document corpus. The data is usually represented in the format
of a document to word co-occurrence matrix, which makes it very sparse and high dimensional.
Hence, one of the major challenges those most learning techniques have to deal with while
working on textual data is being able to deal with the curse of dimensionality.

Manevitz et al. have used neural networks to classify positive documents from negative
ones. They use a feed forward neural network which is first trained on a set of positive examples
(labeled) and then in the test phase the network filters out the positive documents from the
negative ones. In another work which also uses the principle of supervised learning, Algorithms
2012, 4 472 Manevitz et al. have used one class SVMs to classify outliers from the normal set of
documents. They show that it works better than techniques based on naive Bayes, nearest
neighbor algorithms and performs just as good as neural network based techniques. The above
approach might not be very useful in an unsupervised setting, whereas our approach can work in
both supervised and unsupervised settings. This problem has been studied in unsupervised
settings as well. Srivastava et al. have used various clustering techniques like k-means, Sammons
mapping, von Mises–Fisher Clustering and Spectral Clustering to cluster and visualize textual
data. Sammons mapping gives out the best set of well separated clusters followed by von Mises–
Fisher Clustering, Spectral clustering and K-means.

Their technique requires manual examination of the textual clusters and does not talk
about a method of ordering the clusters. Agovic et al. have used topic models to detect

17
anomalous topics from aviation logs. Guthrie et al. have done anomaly detection in texts under
unsupervised conditions by trying to detect the deviation in author, genre, topic and tone. They
define about 200 stylistic features to characterize various kinds of writing and then use statistical
measures to find deviations. All the techniques we describe above rely entirely on the content of
the dataset being evaluated to make their predictions. To the best of our knowledge, ours is the
first attempt that makes use of external contextual information to find anomalies in text logs.
Since the topics detected in statistical content analysis strip lexical sequence information away
from text samples, our efforts to reintroduce context information must look to techniques of
automatic meaning or semantic sense determination. Natural language processing techniques
have focused deeply on being able to identify the lexical structure of text samples. However,
research into computationally identifying the semantic relationships between words
automatically is far sparser, since the problem is much harder.

In particular, while lexical structure can be inferred purely statistically given a dictionary
of known senses of word meanings in particular sequences, such a task becomes almost
quixotically difficult when it comes to trying to identify semantic relations between individual
words. However, a significant number of researchers have tried to define measures of similarity
for words based on, e.g., information-theoretic and corpus overlap criteria. Cilibrasi and Vitanyi
have shown, very promisingly, that it is possible to infer word similarities even from an
uncurated corpus of semantic data, viz. the Web accessed through Google search queries. This
observation has been subsequently developed and refined by and presents possibilities for
significant improvements to current corpus-based methods of meaning construction.

Semantic similarity measures have been used in the past to accomplish several semantic
tasks like ranking of tags within an image, approximate word ontology matching etc., and thus
hold the promise of further simplifying cognitively challenging tasks in the future Methods We
would like to detect the occurrence of abnormal/deviant topics and themes in large scale textual
logs (like emails, blogs etc.) that could help us in inferring anomalous shifts in behavioral
patterns. Our system should then use two inputs: a text log whose content is of immediate
interest and an external corpus of text and semantic relationships to derive contextual
information. The output would be the set of final topics considered anomalous by our system. It

18
is possible to subsequently evaluate the topics discovered by our dataset manually and flag a
certain number of topics as anomalous.

The efficacy of Algorithms 2012, 4 473 our approach is measured as the ratio of number
of anomalies correctly detected by the system to the number of anomalies flagged through
manual inspection. Consider all words to live on a large semantic graph, where nodes represent
word labels and edges are continuous quantities that represent degrees of semantic relatedness to
other words. Some subset of these words populates the documents that we evaluate in clustering-
based text analysis schemes. A document can then be described by an indicator vector that
selects a subset of the word labels to populate an individual document. Traditional document
clustering creates clusters of typical word co-occurrences across multiple documents, which are
called topics. However, such an approach throws away all information embedded in the semantic
network structure. The goal of a semantically sensitive anomaly detection technique is to
integrate information potentially available in the edges of the semantic graph to improve
predictive performance.
Rather than attempting to include semantic information within a traditional clustering
scheme by flattening the relatedness graph into additional components of the data feature vector,
we attempt to introduce context as a post-processing filter for regular topic modeling-based
anomaly detection techniques. By doing so, we simplify the construction of our algorithm and
also ensure that the results from both unfiltered and filtered versions of the algorithm are clearly
visible, so that the relative value added by introducing semantic information is clearly visible.
Since our approach involves combining two separate modes of analyzing data—one content-
driven and one context-driven—we now describe both these modalities in turn, and
subsequently, our technique for combining them.

19
SYSTEM ANALYSIS

FEASIBILITY STUDY

Feasibility study is a process which defines exactly what a project is and what strategic issues
need to be considered to assess its feasibility, or likelihood of succeeding. Feasibility studies are
useful both when starting a new business, and identifying a new opportunity for an existing
business. Ideally, the feasibility study process involves making rational decisions about a number
of enduring characteristics of a project, including:

 Technical feasibility- do we’ have the technology’? If not, can we get it?
 Operational feasibility- do we have the resources to build the system? Will the system be
acceptable? Will people use it?
 Economic feasibility, technical feasibility, schedule feasibility, and operational
feasibility- are the benefits greater than the costs?

TECHNICAL FEASIBILITY

Technical feasibility is concerned with the existing computer system (Hardware,


Software etc.) and to what extend it can support the proposed addition. For example, if particular
software will work only in a computer with a higher configuration, an additional hardware is
required. This involves financial considerations and if the budget is a serious constraint, then the
proposal will be considered not feasible.

OPERATIONAL FEASIBILITY

Operational feasibility is a measure of how well a proposed system solves the problems,
and takes advantages of the opportunities identified during scope definition and how it satisfies
the requirements identified in the requirements identified in the requirements analysis phase of
system development.

20
ECONOMIC FEASIBILITY

Economic analysis is the most frequently used method for evaluating the effectiveness of
a candidate system. More commonly known as cost/ benefit analysis, the procedure is to
determine the benefits and savings that are expected from a candidate system and compare them
with costs. If benefits outweigh costs, then the decision is made to design and implement the
system.

EXISTING SYSTEM

 Conventional approaches for topic detection have mainly been concerned with the
frequencies of (textual) words.
 We are interested in detecting emerging topics from social network streams based on
monitoring the mentioning behaviour of users.
 The information exchanged over social networks such as Facebook and Twitter is not
only texts but also URLs, images, and videos, they are challenging test beds for the study
of data mining.
 Our basic assumption is that a new (emerging) topic is something people feel like
discussing about, commenting about, or forwarding the information further to their
friends.
 A term frequency based approach could suffer from the ambiguity caused by synonyms
or homonyms. It may also require complicated preprocessing (e.g., segmentation)
depending on the target language.
 It cannot be applied when the contents of the messages are mostly non-textual
information.
 The “words” formed by mentions are unique, requires little prepossessing to obtain (the
information is often separated from the contents), and are available regardless of the
nature of the contents.

ISSUES OF EXISTING SYSTEM

1. A term frequency based approach could suffer from the ambiguity caused by synonyms or
homonyms.

21
2. It may also require complicated preprocessing (e.g., segmentation) depending on the target
language.
3. It cannot be applied when the contents of the messages are mostly non-textual information
4. The “words” formed by mentions are unique, requires little prepossessing to obtain and are
available regardless of the nature of the contents.
5. The keyword was manually selected to best capture the topic.
6. The sparsity of the keyword frequency seems to be a bad combination with the SDNML
method.
7. A drawback of the keyword-based dynamic thresholding is that the keyword related to the
topic must be known in advance.

PROPOSED SYSTEM

 Using this project, we can quantitatively measure the novelty or possible impact of a post
reflected in the mentioning behaviour of the user.
 This project is used to measure the anomaly of future user behaviour.
 It proposed a probability model that can capture the normal mentioning behaviour of a
user, which consists of both the number of mentions per post and the frequency of users
occurring in the mentions.
 It aggregate the anomaly scores obtained in this way over hundreds of users and apply a
recently proposed change-point detection technique based on the Sequentially
Discounting Normalized Maximum Likelihood (SDNML) coding.
 This technique can detect a change in the statistical dependence structure in the time
series of aggregated anomaly scores, and pinpoint where the topic emergence is.
 The effectiveness of the proposed approach is demonstrated on two data sets we have
collected from Twitter.
 It shows that this approach can detect the emergence of a new topic at least as fast as
using the best term that was not obvious at the moment.
 The proposed link-anomaly based method can detect the emergence of the topics earlier
than keyword frequency based methods.

22
ADVANTAGES OF PROPOSED SYSTEM

1. The proposed method does not rely on the textual contents of social network posts.
2. It is robust to rephrasing.
3. The probability model that captures both the number of mentions per post and the
frequency of mentionee.
4. It can be applied to case where Topics are concerned with the information exchanged are
not only texts but also images, URLs, and videos.
5. The proposed link-anomaly based approach has detected the emergence of the topics
much earlier than the keyword-based approach
6. The proposed link-anomaly based approach performed even better than the keyword-
based approach on “NASA” and “BBC” datasets.

SYSTEM CONFIGURATION

HARDWARE SPECIFICATION:

Processor : Pentium-IV

Speed : 1.1GHz

RAM : 512MB

Hard Disk : 40GB

General : KeyBoard, Monitor , Mouse

SOFTWARE SPECIFICATION:

Operating System : Windows XP

Front End : ASP.Net

Programming interface : C#

Back End : SQL Server

23
SOFTWARE DESCRIPTION

ABOUT THE SOFTWARE

FEATURES OF VISUAL BASIC .NET

THE .NET FRAMEWORK


The .NET Framework is a new computing platform that simplifies application
development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK:

1. To provide a consistent object-oriented programming environment whether object codes is


stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees


safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-
based applications. To make communication on distributed environment to ensure that code be
accessed by the .NET Framework can integrate with any other code.

COMPONENTS OF .NET FRAMEWORK

1. THE COMMON LANGUAGE RUNTIME (CLR):

The common language runtime is the foundation of the .NET Framework. It manages
code at execution time, providing important services such as memory management, thread
management, and remoting and also ensures more security and robustness. The concept of code
management is a fundamental principle of the runtime. Code that targets the runtime is known as
managed code, while code that does not target the runtime is known as unmanaged code.

THE .NET FRAME WORK CLASS LIBRARY:

24
It is a comprehensive, object-oriented collection of reusable types used to develop
applications ranging from traditional command-line or graphical user interface (GUI)
applications to applications based on the latest innovations provided by ASP.NET, such as Web
Forms and XML Web services.

The .NET Framework can be hosted by unmanaged components that load the common
language runtime into their processes and initiate the execution of managed code, thereby
creating a software environment that can exploit both managed and unmanaged features. The
.NET Framework not only provides several runtime hosts, but also supports the development of
third-party runtime hosts.

Internet Explorer is an example of an unmanaged application that hosts the runtime (in the
form of a MIME type extension). Using Internet Explorer to host the runtime to enables embeds
managed components or Windows Forms controls in HTML documents.

FEATURES OF THE COMMON LANGUAGE RUNTIME:

The common language runtime manages memory; thread execution, code


execution, code safety verification, compilation, and other system services these are all run on
CLR.

Security.

Robustness.

Productivity.

Performance.

SECURITY

25
The runtime enforces code access security. The security features of the runtime thus
enable legitimate Internet-deployed software to be exceptionally feature rich. With regards to
security, managed components are awarded varying degrees of trust, depending on a number of
factors that include their origin to perform file-access operations, registry-access operations, or
other sensitive functions.

ROBUSTNESS:

The runtime also enforces code robustness by implementing a strict type- and code-
verification infrastructure called the common type system(CTS). The CTS ensures that all
managed code is self-describing. The managed environment of the runtime eliminates many
common software issues.

PRODUCTIVITY:

The runtime also accelerates developer productivity. For example, programmers can
write applications in their development language of choice, yet take full advantage of the
runtime, the class library, and components written in other languages by other developers.

PERFORMANCE:

The runtime is designed to enhance performance. Although the common language


runtime provides many standard runtime services, managed code is never interpreted. A feature
called just-in-time (JIT) compiling enables all managed code to run in the native machine
language of the system on which it is executing. Finally, the runtime can be hosted by high-
performance, server-side applications, such as Microsoft® SQL Server™ and Internet
Information Services (IIS).

ASP.NET

ASP.NET is the next version of Active Server Pages (ASP); it is a unified Web
development platform that provides the services necessary for developers to build enterprise-
class Web applications. While ASP.NET is largely syntax compatible, it also provides a new
programming model and infrastructure for more secure, scalable, and stable applications.

26
ASP.NET is a compiled, NET-based environment, we can author applications in any
.NET compatible language, including Visual Basic .NET, C#, and JScript .NET. Additionally,
the entire .NET Framework is available to any ASP.NET application. Developers can easily
access the benefits of these technologies, which include the managed common language runtime
environment (CLR), type safety, inheritance, and so on.

ASP.NET has been designed to work seamlessly with WYSIWYG HTML editors and
other programming tools, including Microsoft Visual Studio .NET. Not only does this make Web
development easier, but it also provides all the benefits that these tools have to offer, including a
GUI that developers can use to drop server controls onto a Web page and fully integrated
debugging support.

Developers can choose from the following two features when creating an ASP.NET
application. Web Forms and Web services, or combine these in any way they see fit. Each is
supported by the same infrastructure that allows you to use authentication schemes; cache
frequently used data, or customizes your application's configuration, to name only a few
possibilities.

Web Forms allows us to build powerful forms-based Web pages. When building these
pages, we can use ASP.NET server controls to create common UI elements, and program them
for common tasks. These controls allow we to rapidly build a Web Form out of reusable built-in
or custom components, simplifying the code of a page.

An XML Web service provides the means to access server functionality remotely. Using
Web services, businesses can expose programmatic interfaces to their data or business logic,
which in turn can be obtained and manipulated by client and server applications. XML Web
services enable the exchange of data in client-server or server-server scenarios, using standards
like HTTP and XML messaging to move data across firewalls. XML Web services are not tied to
a particular component technology or object-calling convention. As a result, programs written in
any language, using any component model, and running on any operating system can access
XML Web services.

Each of these models can take full advantage of all ASP.NET features, as well as the
power of the .NET Framework and .NET Framework common language runtime.

27
Accessing databases from ASP.NET applications is an often-used technique for
displaying data to Web site visitors. ASP.NET makes it easier than ever to access databases for
this purpose. It also allows us to manage the database from your code.

ASP.NET provides a simple model that enables Web developers to write logic that runs
at the application level. Developers can write this code in the global.aspx text file or in a
compiled class deployed as an assembly. This logic can include application-level events, but
developers can easily extend this model to suit the needs of their Web application.

ASP.NET provides easy-to-use application and session-state facilities that are familiar
to ASP developers and are readily compatible with all other .NET Framework APIs.

ASP.NET offers the IHttpHandler and IHttpModule interfaces. Implementing the


IHttpHandler interface gives you a means of interacting with the low-level request and response
services of the IIS Web server and provides functionality much like ISAPI extensions, but with a
simpler programming model. Implementing the IHttpModule interface allows you to include
custom events that participate in every request made to your application.

ASP.NET takes advantage of performance enhancements found in the .NET


Framework and common language runtime. Additionally, it has been designed to offer
significant performance improvements over ASP and other Web development platforms. All
ASP.NET code is compiled, rather than interpreted, which allows early binding, strong typing,
and just-in-time (JIT) compilation to native code, to name only a few of its benefits. ASP.NET is
also easily factorable, meaning that developers can remove modules (a session module, for
instance) that are not relevant to the application they are developing.

ASP.NET provides extensive caching services (both built-in services and caching APIs).
ASP.NET also ships with performance counters that developers and system administrators can
monitor to test new applications and gather metrics on existing applications.

Writing custom debug statements to your Web page can help immensely in
troubleshooting your application's code. However, it can cause embarrassment if it is not
removed. The problem is that removing the debug statements from your pages when your
application is ready to be ported to a production server can require significant effort.

28
ASP.NET offers the Trace Context class, which allows us to write custom debug
statements to our pages as we develop them. They appear only when you have enabled tracing
for a page or entire application. Enabling tracing also appends details about a request to the page,
or, if you so specify, to a custom trace viewer that is stored in the root directory of your
application.

The .NET Framework and ASP.NET provide default authorization and authentication
schemes for Web applications. we can easily remove, add to, or replace these schemes,
depending upon the needs of our application .

DATA ACCESS WITH ADO.NET

As you develop applications using ADO.NET, you will have different requirements for
working with data. You might never need to directly edit an XML file containing data - but it is
very useful to understand the data architecture in ADO.NET.

ADO.NET offers several advantages over previous versions of ADO:

Interoperability

Maintainability

Programmability

Performance Scalability

INTEROPERABILITY:

ADO.NET applications can take advantage of the flexibility and broad acceptance of
XML. Because XML is the format for transmitting datasets across the network, any component
that can read the XML format can process data. The receiving component need not be an
ADO.NET component.

The transmitting component can simply transmit the dataset to its destination without
regard to how the receiving component is implemented. The destination component might be a
Visual Studio application or any other application implemented with any tool whatsoever.

29
The only requirement is that the receiving component be able to read XML. SO, XML
was designed with exactly this kind of interoperability in mind.

MAINTAINABILITY:

In the life of a deployed system, modest changes are possible, but substantial,
Architectural changes are rarely attempted because they are so difficult. As the performance load
on a deployed application server grows, system resources can become scarce and response time
or throughput can suffer. Faced with this problem, software architects can choose to divide the
server's business-logic processing and user-interface processing onto separate tiers on separate
machines.

In effect, the application server tier is replaced with two tiers, alleviating the shortage of
system resources. If the original application is implemented in ADO.NET using datasets, this
transformation is made easier.

ADO.NET data components in Visual Studio encapsulate data access functionality in


various ways that help you program more quickly and with fewer mistakes.

PERFORMANCE:

ADO.NET datasets offer performance advantages over ADO disconnected record sets. In
ADO.NET data-type conversion is not necessary.

SCALABILITY:

ADO.NET accommodates scalability by encouraging programmers to conserve limited


resources. Any ADO.NET application employs disconnected access to data; it does not retain
database locks or active database connections for long durations.

VISUAL STUDIO .NET


Visual Studio .NET is a complete set of development tools for building ASP Web
applications, XML Web services, desktop applications, and mobile applications In addition to
building high-performing desktop applications, you can use Visual Studio's powerful
component-based development tools and other technologies to simplify team-based design,
development, and deployment of Enterprise solutions.

30
Visual Basic .NET, Visual C++ .NET, and Visual C# .NET all use the same integrated
development environment (IDE), which allows them to share tools and facilitates in the creation
of mixed-language solutions. In addition, these languages leverage the functionality of the .NET
Framework and simplify the development of ASP Web applications and XML Web services.

Visual Studio supports the .NET Framework, which provides a common language
runtime and unified programming classes; ASP.NET uses these components to create ASP Web
applications and XML Web services. Also it includes MSDN Library, which contains all the
documentation for these development tools.

XML WEB SERVICES:


XML Web services are applications that can receive the requested data using XML over
HTTP. XML Web services are not tied to a particular component technology or object-calling
convention but it can be accessed by any language, component model, or operating system. In
Visual Studio .NET, you can quickly create and include XML Web services using Visual Basic,
Visual C#, JScript, Managed Extensions for C++, or ATL Server.

XML SUPPORT:

Extensible Markup Language (XML) provides a method for describing structured data.
XML is a subset of SGML that is optimized for delivery over the Web. The World Wide Web
Consortium (W3C) defines XML standards so that structured data will be uniform and
independent of applications. Visual Studio .NET fully supports XML, providing the XML
Designer to make it easier to edit XML and create XML schemas.

VISUAL BASIC .NET

Visual Basic.NET, the latest version of visual basic, includes many new features. The
Visual Basic supports interfaces but not implementation inheritance.

Visual basic.net supports implementation inheritance, interfaces and overloading. In


addition, Visual Basic .NET supports multithreading concept.

31
COMMON LANGUAGE SPECIFICATION (CLS):

Visual Basic.NET is also compliant with CLS (Common Language Specification) and
supports structured exception handling. CLS is set of rules and constructs that are supported by
the CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET
Framework; it manages the execution of the code and also makes the development process easier
by providing services.

Visual Basic.NET is a CLS-compliant language. Any objects, classes, or components that


created in Visual Basic.NET can be used in any other CLS-compliant language. In addition, we
can use objects, classes, and components created in other CLS-compliant languages in Visual
Basic.NET .The use of CLS ensures complete interoperability among applications, regardless of
the languages used to create the application.

IMPLEMENTATION INHERITANCE:

Visual Basic.NET supports implementation inheritance. This means that, while creating
applications in Visual Basic.NET, we can drive from another class, which is know as the base
class that derived class inherits all the methods and properties of the base class. In the derived
class, we can either use the existing code of the base class or override the existing code.
Therefore, with help of the implementation inheritance, code can be reused.

CONSTRUCTORS AND DESTRUCTORS:

Constructors are used to initialize objects, whereas destructors are used to destroy them.
In other words, destructors are used to release the resources allocated to the object. In Visual
Basic.NET the sub finalize procedure is available. The sub finalize procedure is used to complete
the tasks that must be performed when an object is destroyed. The sub finalize procedure is
called automatically when an object is destroyed. In addition, the sub finalize procedure can be
called only from the class it belongs to or from derived classes.

GARBAGE COLLECTION:

Garbage Collection is another new feature in Visual Basic.NET. The .NET Framework
monitors allocated resources, such as objects and variables. In addition, the .NET Framework

32
automatically releases memory for reuse by destroying objects that are no longer in use. In
Visual Basic.NET, the garbage collector checks for the objects that are not currently in use by
applications. When the garbage collector comes across an object that is marked for garbage
collection, it releases the memory occupied by the object.

OVERLOADING:

Overloading is another feature in Visual Basic.NET. Overloading enables us to define


multiple procedures with the same name, where each procedure has a different set of arguments.
Besides using overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:

Visual Basic.NET also supports multithreading. An application that supports


multithreading can handle multiple tasks simultaneously, we can use multithreading to decrease
the time taken by an application to respond to user interaction. To decrease the time taken by an
application to respond to user interaction, we must ensure that a separate thread in the application
handles user interaction.

STRUCTURED EXCEPTION HANDLING:

Visual Basic.NET supports structured handling, which enables us to detect and remove
errors at runtime. In Visual Basic.NET, we need to use Try…Catch…Finally statements to create
exception handlers. Using Try…Catch…Finally statements, we can create robust and effective
exception handlers to improve the performance of our application.

BACK END DESIGN


Features of SQL-SERVER
The OLAP Services feature available in SQL Server version 7.0 is now called SQL
Server 2000 Analysis Services. The term OLAP Services has been replaced with the term
Analysis Services. Analysis Services also includes a new data mining component. The
Repository component available in SQL Server version 7.0 is now called Microsoft SQL Server
2000 Meta Data Services. References to the component now use the term Meta Data Services.
The term repository is used only in reference to the repository engine within Meta Data Services

33
SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

TABLE

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify
what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

A query is a question that has to be asked the data. Access gathers data that answers the
question from one or more table. The data that make up the answer is either dynaset (if you edit
it) or a snapshot(it cannot be edited).Each time we run query, we get latest information in the

34
dynaset.Access either displays the dynaset or snapshot for us to view or perform an action on it
,such as deleting or updating.

FORMS:

A form is used to view and edit information in the database record by record .A form
displays only the information we want to see in the way we want to see it. Forms use the familiar
controls such as textboxes and checkboxes. This makes viewing and entering data easy.

Views of Form:

We can work with forms in several primarily there are two views,

They are,

1. Design View

2. Form View

Design View

To build or modify the structure of a form, we work in forms design view. We can add
control to the form that are bound to fields in a table or query, includes textboxes, option buttons,
graphs and pictures.

Form View

The form view which display the whole design of the form.

REPORT:

A report is used to views and print information from the database. The report can ground
records into many levels and compute totals and average by checking values from many records
at once. Also the report is attractive and distinctive because we have control over the size and
appearance of it.

35
MACRO:

A macro is a set of actions. Each action in macros does something. Such as opening a
form or printing a report .We write macros to automate the common tasks the work easy and
save the time

C# (pronounced C Sharp) is a multi-paradigm programming language that encompasses


functional, imperative, generic, object-oriented (class-based), and component-oriented
programming disciplines. It was developed by Microsoft as part of the .NET initiative and later
approved as a standard by ECMA (ECMA-334) and ISO (ISO/IEC 23270). C# is one of the 44
programming languages supported by the .NET Framework's Common Language Runtime.

C# is intended to be a simple, modern, general feedback-purpose, object-oriented


programming language. Anders Hejlsberg, the designer of Delphi, leads the team which is
developing C#. It has an object-oriented syntax based on C++ and is heavily influenced by other
programming languages such as Delphi and Java. It was initially named Cool, which stood for
"C like Object Oriented Language". However, in July 2000, when Microsoft made the project
public, the name of the programming language was given as C#. The most recent version of the
language is C# 3.0 which was released in conjunction with the .NET Framework 3.5 in 2007.
The next proposed version, C# 4.0, is in development.

History:-

In 1996, Sun Microsystems released the Java programming language with Microsoft soon
purchasing a license to implement it in their operating system. Java was originally meant to be a
platform independent language, but Microsoft, in their implementation, broke their license
agreement and made a few changes that would essentially inhibit Java's platform-independent
capabilities. Sun filed a lawsuit and Microsoft settled, deciding to create their own version of a
partially compiled, partially interpreted object-oriented programming language with syntax
closely related to that of C++.

During the development of .NET, the class libraries were originally written in a
language/compiler called Simple Managed C (SMC). In January 1999, Anders Hejlsberg formed

36
a team to build a new language at the time called Cool, which stood for "C like Object Oriented
Language".Microsoft had considered keeping the name "Cool" as the final name of the language,
but chose not to do so for trademark reasons. By the time the .NET project was publicly
announced at the July 2000 Professional Developers Conference, the language had been renamed
C#, and the class libraries and ASP.NET runtime had been ported to C#.

C#'s principal designer and lead architect at Microsoft is Anders Hejlsberg, who was
previously involved with the design of Visual J++, Borland Delphi, and Turbo Pascal. In
interviews and technical papers he has stated that flaws in most major programming languages
(e.g. C++, Java, Delphi, and Smalltalk) drove the fundamentals of the Common Language
Runtime (CLR), which, in turn, drove the design of the C# programming language itself. Some
argue that C# shares roots in other languages.

Features of C#:-

By design, C# is the programming language that most directly reflects the underlying
Common Language Infrastructure (CLI). Most of C#'s intrinsic types correspond to value-types
implemented by the CLI framework. However, the C# language specification does not state the
code generation requirements of the compiler: that is, it does not state that a C# compiler must
target a Common Language Runtime (CLR), or generate Common Intermediate Language (CIL),
or generate any other specific format. Theoretically, a C# compiler could generate machine code
like traditional compilers of C++ or FORTRAN; in practice, all existing C# implementations
target CIL.

Some notable C# distinguishing features are:

 There are no global variables or functions. All methods and members must be declared
within classes. It is possible, however, to use static methods/variables within public
classes instead of global variables/functions.
 Local variables cannot shadow variables of the enclosing block, unlike C and C++.
Variable shadowing is often considered confusing by C++ texts.
 C# supports a strict Boolean data type, bool. Statements that take conditions, such as
while and if, require an expression of a Boolean type. While C++ also has a Boolean

37
type, it can be freely converted to and from integers, and expressions such as if(a) require
only that a is convertible to bool, allowing a to be an int, or a pointer. C# disallows this
"integer meaning true or false" approach on the grounds that forcing programmers to use
expressions that return exactly bool can prevent certain types of programming mistakes
such as if (a = b) (use of = instead of ==).
 In C#, memory address pointers can only be used within blocks specifically marked as
unsafe, and programs with unsafe code need appropriate permissions to run. Most object
access is done through safe object references, which are always either pointing to a valid,
existing object, or have the well-defined null value; a reference to a garbage-collected
object, or to random block of memory, is impossible to obtain. An unsafe pointer can
point to an instance of a value-type, array, string, or a block of memory allocated on a
stack. Code that is not marked as unsafe can still store and manipulate pointers through
the System.IntPtr type, but cannot dereference them.
 Managed memory cannot be explicitly freed, but is automatically garbage collected.
Garbage collection addresses memory leaks. C# also provides direct support for
deterministic finalization with the using statement (supporting the Resource Acquisition
Is Initialization idiom).
 Multiple inheritance is not supported, although a class can implement any number of
interfaces. This was a design decision by the language's lead architect to avoid
complication, avoid dependency hell and simplify architectural requirements throughout
CLI.
 C# is more type safe than C++. The only implicit conversions by default are those which
are considered safe, such as widening of integers and conversion from a derived type to a
base type. This is enforced at compile-time, during JIT, and, in some cases, at runtime.
There are no implicit conversions between Booleans and integers, nor between
enumeration members and integers (except for literal 0, which can be implicitly
converted to any enumerated type). Any user-defined conversion must be explicitly
marked as explicit or implicit, unlike C++ copy constructors (which are implicit by
default) and conversion operators (which are always implicit).
 Enumeration members are placed in their own scope.

38
 C# provides syntactic sugar for a common pattern of a pair of methods, accessor (getter)
and mutator (setter) encapsulating operations on a single attribute of a class, in form of
properties.
 Full type reflection and discovery is available.
 C# currently (as of 3 June 2008) has 77 reserved words.

Common Type system (CTS)

C# has a unified type system. This unified type system is called Common Type System
(CTS).

A unified type system implies that all types, including primitives such as integers, are
subclasses of the System. Object class. For example, every type inherits a To String () method.
For performance reasons, primitive types (and value types in general feedback) are internally
allocated on the stack.

Categories of data types

CTS separate data types into two categories:

 Value types
 Reference types

Value types are plain aggregations of data. Instances of value types do not have
referential identity or referential comparison semantics - equality and inequality comparisons for
value types compare the actual data values within the instances, unless the corresponding
operators are overloaded. Value types are derived from System.ValueType, always have a
default value, and can always be created and copied. Some other limitations on value types are
that they cannot derive from each other (but can implement interfaces) and cannot have a default
(parameter less) constructor. Examples of value types are some primitive types, such as int (a
signed 32-bit integer), float (a 32-bit IEEE floating-point number), char (a 16-bit Unicode code
point), and System.DateTime (identifies a specific point in time with millisecond precision).

39
In contrast, reference types have the notion of referential identity - each instance of
reference type is inherently distinct from every other instance, even if the data within both
instances is the same. This is reflected in default equality and inequality comparisons for
reference types, which test for referential rather than structural equality, unless the corresponding
operators are overloaded (such as the case for System. String). In general feedback, it is not
always possible to create an instance of a reference type, nor to copy an existing instance, or
perform a value comparison on two existing instances, though specific reference types can
provide such services by exposing a public constructor or implementing a corresponding
interface (such as ICloneable or I Comparable). Examples of reference types are object (the
ultimate base class for all other C# classes), System. String (a string of Unicode characters), and
System. Array (a base class for all C# arrays).

Both type categories are extensible with user-defined types.

Boxing and unboxing

Boxing is the operation of converting a value of a value type into a value of a


corresponding reference type.

ACCESS PRIVIEGES

IIS provides several new access levels. The following values can set the type of access
allowed to specific directories:

o Read
o Write
o Script
o Execute
o Log Access
o Directory Browsing.
IIS WEBSITE ADMINISTRATION

Administering websites can be time consuming and costly, especially for people who
manage large internet Service Provider(ISP)Installations. to save time and money Sip’s support

40
only large company web siesta the expense of personal websites. But is there a cost-effective
way to support both? The answer is yes, if you can automate administrative tasks and let users
administer their own sites from remote computers.This solution reduces the amount of time and
money it takes to manually administer a large installation, without reducing the number of web
sites supported.

Microsoft Internet Information server (IIS) version 4.0 offers technologies to do this:

1. Windows scripting Host(WSH)

2. IIS Admin objects built on top of Active Directory service Interface(ADS))


With these technologies working together behind the scenes, the person can administers
sites from the command line of central computer and can group frequently used commands in
batch files.

Then all user need to do is run batch files to add new accounts,change permissions, add a
virtual server to a site and many other tasks.

.NET FRAMEWORK

The .NET Framework is many things, but it is worthwhile listing its most important
aspects. In short, the .NET Framework is:

A Platform designed from the start for writing Internet-aware and Internet-enabled
applications that embrace and adopt open standards such as XML, HTTP, and SOAP.

A Platform that provides a number of very rich and powerful application development
technologies, such as Windows Forms, used to build classic GUI applications, and of course
ASP.NET, used to build web applications.

A Platform with an extensive class library that provides extensive support for date access
(relational and XML), a director services, message queuing, and much more.

A platform that has a base class library that contains hundreds of classes for performing
common tasks such as file manipulation, registry access, security, threading, and searching of
text using regular expressions.

41
A platform that doesn’t forget its origins, and has great interoperability support for
existing components that you or third parties have written, using COM or standard DLLs.

A Platform with an independent code execution and management environment called the
Common Language Runtime(CLR), which ensures code is safe to run, and provides an abstract
layer on top of the operating system, meaning that elements of the .NET framework can run on
many operating systems and devices.

ASP.NET

ASP.NET is part of the whole. NET framework, built on top of the Common Language
Runtime (also known as the CLR) - a rich and flexible architecture, designed not just to cater for
the needs of developers today, but to allow for the long future we have ahead of us. What you
might not realize is that, unlike previous updates of ASP, ASP.NET is very much more than just
an upgrade of existing technology – it is the gateway to a whole new era of web development.

ASP.NET is a feature at the following web server releases

o Microsoft IIS 5.0 on WINDOWS 2000 Server


o Microsoft IIS 5.1 on WINDOWS XP
ASP.NET has been designed to try and maintain syntax and run-time compatibility with
existing ASP pages wherever possible. The motivation behind this is to allow existing ASP
Pages to be initially migrated ASP.NET by simply renaming the file to have an extension of
.aspx. For the most part this goal has been achieved, although there are typically some basic
code changes that have to be made, since VBScript is no longer supported, and the VB language
itself has changed.

Some of the key goals of ASP.NET were to

o Remove the dependency on script engines, enabling pages to be type safe and
compiled.
o Reduce the amount of code required to develop web applications.
o Make ASP.NET well factored, allowing customers to add in their own custom
functionality, and extend/ replace built-in ASP.NET functionality.

42
o Make ASP.NET a logical evolution of ASP, where existing ASP investment and
therefore code can be reused with little, if any, change.
o Realize that bugs are a fact of life, as ASP.NET should be as fault tolerant as
possible.
Benefits of ASP.NET

The .NET Framework includes a new data access technology named ADO.NET, an
evolutionary improvement to ADO. Though the new data access technology is evolutionary, the
classes that make up ADO.NET bear little resemblance to the ADO objects with which you
might be familiar. Some fairly significant changes must be made to existing ADO applications to
convert them to ADO.NET. The changes don't have to be made immediately to existing ADO
applications to run under ASP.NET, however.

ADO will function under ASP.NET. However, the work necessary to convert ADO
applications to ADO.NET is worthwhile. For disconnected applications, ADO.NET should offer
performance advantages over ADO disconnected record sets. ADO requires that transmitting and
receiving components be COM objects. ADO.NET transmits data in a standard XML-format file
so that COM marshaling or data type conversions are not required.

ASP.NET has several advantages over ASP.

The following are some of the benefits of ASP.NET:

o Make code cleaner.


o Improve deployment, scalability, and reliability.
o Provide better support for different browsers and devices.
o Enable a new breed of web applications.

VB Script

VB Script, some times known as Visual Basic Scripting Edition, is Microsoft’s answer to
Java Script. Just as Java Script’s syntax loosely based on Java. VB Script’s syntax is loosely
based on Microsoft Visual Basic a popular programming language for Windows machines.

43
Like Java Script, VB Script is a simple scripting and we can include VB Script statements
within an HTML document. To begin a VB Script, we use the <script LANGUAGE=”VB
Script”>tag.

VB Script can do many of the same things as Java Script and it even looks similar in
some cases.

It has two main advantages:

For these who already know Visual Basic, it may be easier to learn than Java Script.

It is closely integrated with ActiveX, Microsoft’s standard for Web-embedded


applications.

VB Script’s main disadvantage is that only Microsoft Internet Explorer supports it. Both
Netscape and Internet Explorer, on the other hand, support Java Script is also a much more
popular language, and we can see it in use all over the Web.

ActiveX

ActiveX is a specification develops by Microsoft that allows ordinary Windows programs


to be run within a Web page. ActiveX programs can be written in languages such as Visual Basic
and they are complied before being placed on the Web server.

ActiveX application, called controls, are downloaded and executed by the Web browser,
like Java applets. Unlike Java applets, controls can be installed permanently when they are
downloaded; eliminating the need to download them again. ActiveX’s main advantage is that it
can do just about anything.

This can also be a disadvantage:

Several enterprising programmers have already used ActiveX to bring exciting new
capabilities to Web page, such as “the Web page that turns off your computer” and “the Web
page that formats disk drive”.

44
Fortunately, ActiveX includes a signature feature that identifies the source of the control
and prevents controls from being modified. While this won’t prevent a control from damaging
system, we can specify which sources of controls we trust.

ActiveX has two main disadvantages

It isn’t as easy to program as scripting language or Java.

ActiveX is proprietary.

It works only in Microsoft Internet Explorer and only Windows platforms.

ADO.NET

ADO.NET provides consistent access to data sources such as Microsoft SQL Server, as
well as data sources exposed via OLE DB and XML. Data-sharing consumer applications can
use ADO.NET to connect to these data sources and retrieve, manipulate, and update data.

ADO.NET cleanly factors data access from data manipulation into discrete components
that can be used separately or in tandem. ADO.NET includes .NET data providers for connecting
to a database, executing commands, and retrieving results. Those results are either processed
directly, or placed in an ADO.NET Dataset object in order to be exposed to the user in an ad-hoc
manner, combined with data from multiple sources, or remote between tiers. The ADO.NET
Dataset object can also be used independently of a .NET data provider to manage data local to
the application or sourced from XML.

Why ADO.NET?

As application development has evolved, new applications have become loosely coupled
based on the Web application model. More and more of today's applications use XML to encode
data to be passed over network connections. Web applications use HTTP as the fabric for
communication between tiers, and therefore must explicitly handle maintaining state between
requests. This new model is very different from the connected, tightly coupled style of
programming that characterized the client/server era, where a connection was held open for the
duration of the program's lifetime and no special handling of state was required.

45
In designing tools and technologies to meet the needs of today's developer, Microsoft
recognized that an entirely new programming model for data access was needed, one that is built
upon the .NET Framework. Building on the .NET Framework ensured that the data access
technology would be uniform—components would share a common type system, design
patterns, and naming conventions.

ADO.NET was designed to meet the needs of this new programming model:
disconnected data architecture, tight integration with XML, common data representation with the
ability to combine data from multiple and varied data sources, and optimized facilities for
interacting with a database, all native to the .NET Framework.

Leverage Current ADO Knowledge

Microsoft's design for ADO.NET addresses many of the requirements of today's


application development model. At the same time, the programming model stays as similar as
possible to ADO, so current ADO developers do not have to start from scratch in learning a
brand new data access technology. ADO.NET is an intrinsic part of the .NET Framework
without seeming completely foreign to the ADO programmer.

ADO.NET coexists with ADO. While most new .NET applications will be written using
ADO.NET, ADO remains available to the .NET programmer through .NET COM
interoperability services. For more information about the similarities and the differences between
ADO.NET and ADO.ADO.NET provides first-class support for the disconnected, n-tier
programming environment for which many new applications are written. The concept of working
with a disconnected set of data has become a focal point in the programming model. The
ADO.NET solution for n-tier programming is the Dataset.

XML Support

XML and data access are intimately tied—XML is all about encoding data, and data
access is increasingly becoming all about XML. The .NET Framework does not just support
Web standards—it is built entirely on top of them.

46
SYSTEM DESIGN

SYSTEM ARCHITECTURE

47
DFD:

Level 0:

48
Level 1:

Users User Records


list

Send friend 1.1


User invitation Make Friend

Friend details
Accept friend

Post topic contents Friend Records

Post content

2.1 Post Records


Post Topic

Post comments Post content

3.1
Post Comments comment Comment Records
On Topic

Post Records

4.1
Detect anomaly of topic Detect Anomaly Post content

User User Records


Admin
anoamly

Anomaly

5.1
Aggregate Anomaly Records
Anomaly

Post content Post Records


Anomaly Score

Comment Records
6.1 Comment
Process
Emerging topic day wise
Emerging Topic
User
User Records

Emerging topic

49
Level 2:

Post Records
Post Topic

Post new topic


Users Post topics

User
1.2 2.2
Post Topic Post topic Training

Users Mentions

Detect anomaly
Admin

3.2
Compute
Emerging topic individual
anomaly score
Comment
records
4.2
Aggregate mention
Anomalies Detected anomaly
Anomaly

Anomaly Anomaly Records

50
ER DIAGRAM:

Username

Password

Friend ID
Name
User Make Friend
Friend Friend name
Email
Username

Write Post Post

Post ID Post Content

Post Date

Username Password Username

Find Emerging Comment Comments


Admin Topic

Post Content

Detect
anomaly Post Date

Comment ID

Post Content
Post
Post Date

Post ID 51
UML DIAGRAMS

Use case diagram:

Post Topic

«uses»

«uses»

Post Comment

«uses»

«uses»
Add Friend
User

View Friend Lists

Find Emerging Topic

«uses»

Anomaly Detection
«uses»

«uses»

Aggregate Anomaly
Scores
«uses»

«uses»

Burst Detection Admin

Managev User

52
Class diagram:
«uses»

Friend
-friendid : Integer
-username : String
-password : String
-name : String
User -email : String Training
-username : String -mobile : String -triningID : Integer
-password : String +acceptInvitation(username)() : Boolean -username : String
-name : String +postComment(comment)() : Boolean -topic : String
-email : String -status : string
-mobile : String +traininTopic(username,topic)() : Boolean
+searchFriend(username)() : String +getTraing() : String
+viewFriendRequest()() : String
+postTopic(topic)() : Boolean
+postComments(Comment)() : Boolean

1 -
Comment
-commentID : Integer
-username : String «uses»
-comment : String Admin
-topic : String -username : String
-date : Date -pasword : String
«uses»
+setComments(comment,topic,username)() : void +viewAnmalyScore()() : Integer
+getComments()() : String +viewEmergingTopic(date)() : String

* -

* - 1 -

«uses»
Post
-postID : Integer
-topic : String
-username : String
-date : Date Anomaly
-count : Integer
-anomalyID : Integer
+setPosts(topic,username,date)() : void
-trainingID : Integer
+getPosts() : String
-topic : String
-status : String
-anomalyCount : Integer
+detectAnomaly(trainingID,topic)() : Boolean
+countAnomaly(anomalyID)() : Integer

53
Sequence diagram:

User Friend Post Comment Training Admin Anomaly

send friend request

accept freind reuest

post topic

post successfully

post comments

post comments

update post comments

enter posted topics


find emering topic

request trained topic

return topic

detect anomaly

return anomaly score


return emerging topic aggregate anomaly score

54
Activity diagram:

user

Invite friends Accept friend request

Post topic

write comments

Training

Combining user posts

Detect Anomaly for each user posts


Admin

Aggregate Anomaly

Anomaly Score calculate

Date enter Burst detection

Find emerging topic

display emerging topic

55
MODULES

 Event Detection Streams


 Event Description Module
 User Profiling In Social Media
 Kleinberg’s Burst-Detection Method
 Data Set.

1. Event Detection Streams

Microblogs have become an important source for reporting real-world events. A real-world
occurrence reported in microblogs is also called a social event. Social events may hold critical
materials that describe the situations during a crisis. In real applications, such as crisis
management and decision making, monitoring the critical events over social streams will enable
watch officers to analyze a whole situation that is a composite event, and make the right decision
based on the detailed contexts such as what is happening, where an event is happening, and who
are involved. Although there has been significant research effort on detecting a target event in
social networks based on a single source, in crisis, we often want to analyze the composite events
contributed by different social users. So far, the problem of integrating ambiguous views from
different users is not well investigated. To address this issue, we propose a novel framework to
detect composite social events over streams, which fully exploits the information of social data
over multiple dimensions. Specifically, we first propose a graphical model called location-time
constrained topic (LTT) to capture the content, time, and location of social messages. Using
LTT, a social message is represented as a probability distribution over a set of topics by
inference, and the similarity between two messages is measured by the distance between their
distributions.

Then, the events are identified by conducting efficient similarity joins over social media
streams. To accelerate the similarity join, we also propose a variable dimensional extendible
hash over social streams. We have conducted extensive experiments to prove the high
effectiveness and efficiency of the proposed approach.

56
2. Event description module

The rise of Social Media services in the last years has created huge streams of information
that can be very valuable in a variety of scenarios. What precisely these scenarios are and how
the data streams can efficiently be analyzed for each scenario is still largely unclear at this point
in time and has therefore created significant interest in industry and academia. In this paper, we
describe a novel algorithm for geo-spatial event detection on Social Media streams. We monitor
all posts on Twitter issued in a given geographic region and identify places that show a high
amount of activity. In a second processing step, we analyze the resulting spatio-temporal clusters
of posts with a Machine Learning component in order to detect whether they constitute real-
world events or not. We show that this can be done with high precision and recall. The detected
events are finally displayed to a user on a map, at the location where they happen and while they
happen.

3. User profiling in social media

A user profile is a visual display of personal data associated with a specific user, or a
customized desktop environment. A profile refers therefore to the explicit digital representation
of a person's identity. A user profile can also be considered as the computer representation of
user .A profile can be used to store the description of the characteristics of person. This
information can be exploited by systems taking into account the persons' characteristics and
preferences. Profiling is the process that refers to construction of a profile via the extraction from
a set of data. User profiles can be found on operating systems, computer programs, recommender
systems, or dynamic websites (such as online social networking sites or bulletin boards).

A social networking service is a platform to build social networks or social


relations among people who share interests, activities, backgrounds or real-life connections. A
social network service consists of a representation of each user (often a profile), his or her social
links, and a variety of additional services. Social networks are web-based services that allow
individuals to create a public profile, to create a list of users with whom to share connections,
and view and cross the connections within the system. Most social network services are web-

57
based and provide means for users to interact over the Internet, such as e-mail and instant
messaging. Social network sites are varied and they incorporate new information and
communication tools such as mobile connectivity, photo/video/sharing and blogging. Online
community services are sometimes considered as a social network service, though in a broader
sense, social network service usually means an individual-centered service whereas online
community services are group-centered. Social networking sites allow users to share ideas,
pictures, posts, activities, events, interests with people in their network.

A social network is a social structure made up of a set of social actors (such as


individuals or organizations) and a set of the dyadic ties between these actors. The social
network perspective provides a set of methods for analyzing the structure of whole social entities
as well as a variety of theories explaining the patterns observed in these structures.[1] The study
of these structures uses social network analysis to identify local and global patterns, locate
influential entities, and examine network dynamics.

4. Kleinberg’s Burst-Detection Method

In addition to the change-point detection based on SDNML followed by DTO described


in previous sections, we also test the combination of our method with Kleinberg’s burst-detection
method. More specifically, we implemented a two-state version of Kleinberg’s burst detection
model. The reason we chose the two-state version was because in this experiment we expect no
The proposed link-anomaly-based change-point detection is highly scalable. Every step
described in the previous subsections requires only linear time against the length of the analyzed
time period. Computation of the predictive distribution for the number of mentions can be
computed in linear time against the number of mentions. Computation of the predictive
distribution for the mention probability and can be efficiently performed using a hash table.
Aggregation of the anomaly scores from different users takes linear time against the number of
users, which could be a computational bottle neck but can be easily parallelized. SDNML-based
change-point detection requires two swipes over the analyzed time period. Kleinberg’s burst-
detection method can be efficiently implemented with dynamic programming.

58
5. Data set.

This data set is related to the recent leakage of some confidential video by the Japan
Coastal Guard officer. The keyword used in the keyword-based methods was “Senkaku.” the
results of link-anomaly based change detection and burst detection, respectively. Text-anomaly-
based change detection and burst detection, respectively. This data set is related to a
controversial post by a famous person in Japan that “the reason students having difficulty finding
jobs is, because they are stupid” and various replies to that post. The keyword used in the
keyword-based methods was “Job hunting.” The four data sets we collected are called “Job
hunting”, “Youtube”, “NASA”, “BBC” and each of them corresponds to a user organized list in
Togetter.
For each list, we extracted a list of Twitter users that appeared in the list, and collected
Twitter posts from those users. Number of participants and the number of posts we collected for
each data set. Note that we collected Twitter posts up to 30 days before the time period of
interest for each user; thus, the number of posts we analyzed was much larger than the number of
posts listed in Togetter. This data set is related to the discussion among Twitter users interested
in astronomy that preceded NASA’s press conference about discovery of an arsenic-eating
organism. This data set is related to angry reactions among Japanese Twitter users against a BBC
comedy show that asked “who is the unluckiest person in the world” (the answer is a Japanese
man who got hit by nuclear bombs in both Hiroshima and Nagasaki but survived).

59
NORMALIZATION

The basic objective of normalization is to be reducing redundancy which means that


information is to be stored only once. Storing information several times leads to wastage of
storage space and increase in the total size of the data stored.
If a Database is not properly designed it can gives rise to modification anomalies. Modification
anomalies arise when data is added to, changed or deleted from a database table. Similarly, in
traditional databases as well as improperly designed relational databases, data redundancy can be
a problem. These can be eliminated by normalizing a database.

Normalization is the process of breaking down a table into smaller tables. So that each
table deals with a single theme. There are three different kinds of modifications of anomalies and
formulated the first, second and third normal forms (3NF) is considered sufficient for most

60
practical purposes. It should be considered only after a thorough analysis and complete
understanding of its implications.

FIRST NORMAL FORM (1NF):

This form also called as a “flat file”. Each column should contain data in respect of a
single attributes and no two rows may be identical. To bring a table to First Normal Form,
repeating groups of fields should be identified and moved to another table.

SECOND NORMAL FORM (2NF):

A relation is said to be in 2NF if it is 1NF and non-key attributes are functionality


dependent on the key attributes. A ‘Functional Dependency’ is a relationship among attributes.
One attribute is said to be functionally dependent on another if the value of the first attribute
depends on the value of the second attribute. In the given description flight number and halt code
is the composite key.

THIRD NORMAL FORM (3NF) :

Third Normal Form normalization will be needed where all attributes in a relation tuple are not
functionally dependent only on the key attribute. A transitive dependency is one in which one in
which one attribute depends on second which is turned depends on a third and so on.

IMPLEMENTATION

SCALABILITY OF THE PROPOSED ALGORITHM MODULE

The proposed link-anomaly based change-point detection is highly scalable. Every step
described in the previous subsections requires only linear time against the length of the analyzed
time period. Computation of the predictive distribution for the number of mentions can be
computed in linear time against the number of mentions. Computation of the predictive
distribution for the mention probability in and can be efficiently performed using a hash table.
Aggregation of the anomaly scores from different users takes linear time against the number of
users, which could be a computational bottle neck but can be easily parallelized. SDNML-based

61
change-point detection requires two swipes over the analyzed time period. Kleinberg’s burst
detection method can be efficiently implemented with dynamic programming

SYSTEM TESTING

The goal of testing is to improve the program’s quality. Quality is assured primarily
through some form of software testing. The history of testing goes back to the beginning of the
computing field. Testing is done at two Levels of Testing of individual modules and testing the
entire system. During the system testing, the system is used experimentally to ensure that the
software will run according to the specifications and in the way the user expects. Testing is very
tedious and time consuming. Each test case is designed with the intent of finding errors in the
way the system will process it.

LEVELS OF TESTING

The software underwent the following tests by the system analyst.

WHITE BOX TESTING

By using this technique it was tested that all the individual logical paths were executed at
least once. All the logical decisions were tested on both their true and false sides. All the loops
were tested with data in between the ranges and especially at the boundary values.

BLACK BOX TESTING

By the use of this technique, the missing functions were identified and placed in their
positions. The errors in the interfaces were identified and corrected. This technique was also used
to identify the initialization and termination errors and correct them.

UNIT TESTING

62
It is the verification of a single module usually in the isolated environment. The System
Analyst tests each and every module individually by giving a set of known input data and
verifying for the required output data. The System Analyst tested the software Top Down model
starting from the top of the model. The units in a system are the modules and routines that are
assembled to perform a specific function. The modules should be tested for correctness of login
applied and should detect errors in coding. This is the verification of the system to its initial
objective. This is a verification process when it is done in a simulated environment and it is a
validation process when it is done in a line environment.

INTEGRATION TESTING

The purpose of unit testing is to determine that each independent module is correctly
implemented. This gives little chance to determine that the interface between modules is also
correct and for this reason integration testing must be performed. One specific target of
integration testing is the interface. Whether parameters match on both sides as to type,
permissible ranges, meaning and utilization. Module testing assures us that the detailed design
was correctly implemented; now it is necessary to verity that the architectural design
specifications were met. Chosen portions of the structure tree of the software are put together.
Each sub tree should have some logical reason for being tested. It may be a particularly difficult
or tricky part of the code; or it may be essential to the function of the rest of the product. As the
testing progresses, we find ourselves putting together larger and longer parts of the tree, until the
entire product has been integrated.

63
VALIDATION TESTING

The main aim of this testing is to verify that the software system does what it was
designed forThe system was tested to ensure that the purpose of automating the system “Machine
Order”.Alpha testing was carried out to ensure the validity of the system.

OUTPUT TESTING

Asking the users about the format required by them tests the outputs generated by the
system under consideration .The output format on the screen is found to be correct as the format
was designed in the system design. Output testing was done and it did not result in any change or
correction in the system.

USER ACCEPTANCE TESTING

The system under consideration is tested for user acceptance by constantly keeping in
touch with prospective system users at time of developing and making changes whenever
required. The following points are considered.

 Input screen design

 Output screen design

 Online message to guide the user

 Menu driven system

 Format of adhoc queries and reports

PERFORMANCE TESTING

Performance is taken as the last part of implementation. Performance is perceived as


response time for user queries, report generation and process related activities.

64
Test Cases:
We are following Ad-Hoc testing.

Test Test Case Test Case Expected Result Actual Test Test Test
Case Id Name Description Result Result Sev Prio
erit rity
(P/F)
y

001 login Login stage To verify login Login P - -


name,password,trans successful
action code

002 Registrati Enter the To enter name, Registration P - -


on require details gender, contact, successful
and login address , email id ,
occupation,
password

003 Certificati Certificate To enter the details Certificate P - -


on creation of transaction ID, creation
creation name, contact no,
certification validity, successful
amount, bank,
upload biometrics.

004 Download To browse & To verify username, Download P - -


certificate download transaction code, certificate
certificate biometrics

005 Upload File To enter certificate Encrypted p - -


file and outsourcing no, upload file, file
encryptio outsourching server
n file and encrypted data

006 Server file To select the Server file list Show Download P - -
list file & files and download
download File
successful

007 Save file Encrypted file Save the file Encrypted P - -


File saved

65
CONCLUSION

In this paper, we have proposed a new approach to detect the emergence of topics in a
social network stream. The basic idea of our approach is to focus on the social aspect of the posts
reflected in the mentioning behaviour of users instead of the textual contents. We have proposed
a probability model that captures both the number of mentions per post and the frequency of
mentionee. Furthermore, we have combined the proposed mention model with recently proposed
SDNML change-point detection algorithm to pin-point the emergence of a topic.

We have applied the proposed approach to two real data sets we have collected from
Twitter. In all the data sets our proposed approach showed promising performance; the detection
by the proposed approach was as early as termfrequency based approaches in the hindsight of the
keywords that best describes the topic that we have manually chosen afterwards. Furthermore,
for “BBC” data set, in which the keyword that defines the topic is more ambiguous than the other
data set, the proposed link-anomaly based approach has detected the emergence of the topics
much earlier than the keyword-based approach.

66
REFERENCES

[1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang et al., “Topic detection and
tracking pilot study: Final report,” in Proceedings of the DARPA broadcast news transcription
and understanding workshop, 1998.
[2] J. Kleinberg, “Bursty and hierarchical structure in streams,” Data Min. Knowl. Disc., vol. 7,
no. 4, pp. 373–397, 2003.

[3] Y. Urabe, K. Yamanishi, R. Tomioka, and H. Iwai, “Realtime change-point detection using
sequentially discounting normalized maximum likelihood coding,” in Proceedings. Of the 15th
PAKDD, 2011.

[4] S. Morinaga and K. Yamanishi, “Tracking dynamics of topic trends using a finite mixture
model,” in Proceedings of the 10th ACM SIGKDD, 2004, pp. 811–816.

[5] Q. Mei and C. Zhai, “Discovering evolutionary theme patterns from text: an exploration of
temporal text mining,” in Proceedings of the 11th ACM SIGKDD, 2005, pp. 198–207.

[6] A. Krause, J. Leskovec, and C. Guestrin, “Data association for topic intensity tracking,” in
Proceedings of the 23rd ICML, 2006, pp. 497–504.

[7] D. He and D. S. Parker, “Topic dynamics: an alternative model of bursts in streams of


topics,” in Proceedings of the 16th ACM SIGKDD, 2010, pp. 443–452.

[8] H. Small, “Visualizing science by citation mapping,” Journal of the American society for
Information Science, vol. 50, no. 9, pp. 799–813, 1999.

[9] D. Aldous, “Exchangeability and related topics,” in E´cole d’E´te´ de Probabilite´s de Saint-
Flour XIII—1983. Springer, 1985, pp. 1–198.

67
[10] J. Takeuchi and K. Yamanishi, “A unifying framework for detecting outliers and change
points from time series,” IEEE T. Knowl. Data En., vol. 18, no. 44, pp. 482–492, 2006.

[11] J. Rissanen, “Strong optimality of the normalized ML models as universal codes and
information in data,” IEEE T. Inform. Theory, vol. 47, no. 5, pp. 1712–1717, 2002.

[12] J. Rissanen, T. Roos, and P. Myllym¨aki, “Model selection by sequentially normalized least
squares,” Journal of Multivariate Analysis, vol. 101, no. 4, pp. 839–849, 2010.

[13] K. Yamanishi and Y. Maruyama, “Dynamic syslog mining for network failure monitoring,”
Proceeding of the 11th ACM SIGKDD, p. 499, 2005.

[14] T. Takahashi, R. Tomioka, and K. Yamanishi, “Discovering emerging topics in social


streams via link anomaly detection,” arXiv:1110.2899v1 [stat.ML], Tech. Rep., 2011.

68

Вам также может понравиться