Вы находитесь на странице: 1из 17

Repected teachers and my fellow classmates:

Our presentation topic is Big Data and it will be presented


admist you all by:
Arpan Chakroborty
Brinta Roy Chowdhury
Sourish Dutta and myself smriti verma.(till here 33 words)
INTRODUCTION: SLIDE 1(Smriti):
Today we are gonna talk about big data and exactly what it is.
It starts with a progression and word progression is used
because of the history of evolution of data. Historically data
was created manually by employees who used to enter data
and gradually it was used by people. Then we came to an era
of internet or evolved to the present time where every single
person creates data. And this magnitude of difference between
before and user created data was large..really large. But now
we also have third level. Technically speaking, it is machine
accumulated data and in simpler terms we can say every
computer around the world,buildings with cctvs,monitors for
temperature,humidity and technologies making our lives easy
is generating data.Even the satelites that monitors earth for
24 hours a day is generating large amount of data which is
difficult to store and process by traditional methods.
Hence the term big data came to existence.
We will discuss the

Definition
Characteristics
Challenges
analytical techniques and
Benefits of big data through our presentation.

moving ahead:

What is Big Data? Slide 2: (smriti)


We discussed the evolution of data and a brief history.Lets
move on to the basic definition of the term :
Big data is a term for data sets that are so large and complex
that traditional data processing applications are inadequate.
It is next big thing in IT industry. And by this we mean that
every organisation that needs data storage or processing or
backups is in a need of this.
It is a result of data explosion. And graphically it is shown this
way:
(explaining graph) By time data became big from small.And our
ability to do anything with data lies here and big data has
reached here in volume.
Lets move on to another point that is 2.5 exabyte of data
created everyday.
For start, retailers are building data base of recorded
customers .
We have organisations working in finance, logistics and
healthcare storing more data.Today hospitals have past records
of patients .Every financial transaction is recorded. Then we
have social media which is one of the biggest source of data
today. Vision recognision also add to the contribution.
Lastly, Internet of things which is one big source too as here
the things not directly related to technology is put on internet
for various purpose and hence data created again.

Next arpan will explain about the need, sources and appeal of
big data:

Need for big data? Slide 3 (Arpan)


After understanding the basic definition of big data..lets have a
look on the needs of Big data. Or more specifically the why
factor?
Why big data?

We have certain points that explains why we need big data and
its solutions:
Firstly. Its a fact that 90%of data present today has been
generated in last two years, which means the exponential
increase is really very high.
Moreover 80% of the data is unstructured or exists in widely
varying structures, which are difficult to analyse.
And among the structured data, we face a limitation with
respect to handling large quantities.
Next point is that it becomes difficult to integrate information
distributed across multiple systems that isdifferent sources
Next,The organisations and business men face difficulty in
analysis.
And regarding backups ,The database has a lot of useful
data but their life span is really short because of storage
problems.
Then we also face an issue with potential value of data being
discarded or not getting enough value.

These facts lead us to the need to solve big data problems..


In the next slide we will have a look at the importance of the
term and a reality check on everything we have stated till now.

Big data sources: arpan(slide 4)


This portion is all about where this comes from, already an idea
has been provided that the big data sources basically comprise
of
Web logs
Sensor networks
Social media
Internet text and documents
Internet pages
Search index data
Atmospheric science
astronomy
Biochemical
Medical Records
Scientific Research
Military surveillance
Photography archives
Now after having an overall idea of what big data is lets go
further to study the characterstics of big data that will be
presented by brinta:

Charactestics of Big data:Slide 5 and 6


(brinta)

Characteristics of big data, or we can say the properties or


challenges that differs it from small data.
First characteristics
Variety:
This is the basic type of dataor the nature of data..i.e.
structured unstructured and different factors from where
data is coming.
Big Data isn't just numbers, dates, and strings. Big Data is
also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.
The second characteristics is
Volume:
This characteristic is both the greatest challenge and
opportunity
To understand everything in a better way. Large volume
defines big data.
Then we have
Velocity:
This characteristic is about the speed at which data is
generated and processed to meet the demands and
challenges that lie in the path of growth and development.
Next is
Value:
We have data from different statistics, events and hence
different values for which data is generated. The
determination of potentiality becomes tough.
And lastly
Veracity:
This refers to data in biases, noise and abnormality in
data. This is basically differing dirty data from clean data.
It is one trustworthy property of big data and depends on
availability and accountability.
These 5 vs explain big datas characteristics.
in short we can say:
Volume is data at rest
Velocity is data in motion
Variety is data in many forms

Veracity is data in doubt And


Value is data in limbo

Types of data:
One of the most important aspects of our discussion today:
This differentiates between data according to its types and
further the analysis depends on the categorization done here:
First to be structured then unstructured and finally hybrid or
multi structured.
Structured are data that can be processed by traditional
methods and stored by it like:
(state out points from image)
Mysql
Mainframe
Oracle
Db2
Sybase
Access,excel,txt etc
Teradata
Neteeza,other mpp
SAP,JDE,JDA etc
Unstructured are data that does not have a formal data model
but might have some resemblance of structure eg-xml files
And sources like
Social media
Digital
Video
Audio
Geospacial data
Third kind or hybrid is with mixed property:
From emerging market data
e commerce
weather
currency conversions
demographical etc
Types be like POS,POL,IR etc.

Now, we will move on to the importance of big data by


sourish

Why is it important: slide 8 and 9(sourish)

Importance of big data


We have an insight of Data collected in every 60 sec:

694445+ google search queries


20000+ new post on Tumblr
1600+ reads on scribd
168 million email sent
70+ domains are registered
60+ new videos with 48+hour duration
1500+ blog posts
695000+ Facebook status updates
430000+ tweets
44000000 messages processed on WhatsApp
56000 photos are uploaded on Instagram

And a total of 1820tb data created


Reminding again this is for just 60 sec.
Another image is a look into the landscape of big data:
We can see some facts like

Todays smartphone is more powerful than 1985s computers


and
Videos streamed takes 1/3rd of total traffic online
There are many more facts to prove the data explosion
and other terms we used for explaining big data.
but talking about statistics which section actually provides large
amount or less amount .For that a graph has been made to
be more specific
This is types of data analysed through different sources and we
will look at the percentage of respondents
Transaction provides the largest data i.e. 70%
Log data stands at 55%
Machine or sensor data at 42%
Emails ,social media at 36 and 32 respectively
Free form text at 26
Geospacial,images,video ,audio provides 23,16,9,and 6
percentage respectively and 12 %is provides by other source
comprisely.
This analysis is collective of many industries.
Next we will move on to the analytical portion or in simpler
words the ways to overcome these and how to analyse data.
I ll hand over it to Arpan for the next section.

Why big data analysis is needed?


(arpan)(slide 10)
Why do we need big data analysis.
Big data analytics examines large amounts of data to uncover
hidden patterns, correlations and other insights.
Some of the main points that can be big reasons for analysis
are
Examining large amount of data

As we have seen that data collected is of different


size,type and property. When we go for analysis , we need
to work with such cluster amount of data. The proper
examination is need to get proper results.
Next we have
Appropriate information
very clear with the word itself as everyone loves to the
point and clean data. This is the first and foremost
Next we have
Identification of hidden patterns, unknown correlations
This basically means that such large data have different
relation patterns. We need proper analytics to understand
it..
Competitive advantage
and
Better business decisions: strategic and operational
Early times, data scientist used algorithm techniques and
other tools we have todayevery data had to be
processed with long strategies and mathematics. then
better ways had to be discovered to solve the problems.
More growth and hence more needs. These analytical
processes gave us better business decision making and
competitively we rose above.
lastly

Analysis also helps in Effective marketing, customer


satisfaction, increased revenue

These are the main reasons and goals behind


analytics of big data.
Lets look further into the phases and challenges faced
during analysis.

Phases and challenges of big data?


(arpan)(slide 11)

The phases of a proper analysis can be shown in


5 steps.
Data Acquisition and Recording
This Acquisition and recording phase decrease human effort

Information Extraction and Cleaning


This is cleaning and extracting from dirty data.

Data Integration, Aggregation, and


Representation
Given the heterogeneity of the flood of data the tools then
come to this phase of analysis for better results

Query Processing, Data Modeling, and Analysis

Last is

Interpretation

Which is a combination or an overlook over all the phases


covered above.

The second portion is about the challenges: and


the first one is
Timeliness
The larger the data set to be processed, the longer it will
take to analyse.

Heterogeneity and Incompleteness


Hence difficulty in analysing

Human Collaboration

This simply means that however computational


advancement is gained. We will still need human hand .

Scale and Privacy


This particular point is very obvious as every data will be
accessed and hence scale will majorly effect and privacy
will be hampered.
After looking at the phase and challenge lets look at the
Tools and Techniques.

Tools and techniques used for big data?


(slide 11and 12) (smriti)
We will discuss about the tools and techniques of big data.
And the different ways used are
1. Big data analytics in the cloud: when traditional storage
seized to solve problems. Cloud sources came to existence.
And the technique was:
Hadoop, a framework and set of tools for processing very large
data sets, was originally designed to work on clusters of
physical machines. That has changed. Now an increasing
number of technologies are available for processing data in the
cloud.
Examples include
Amazons Redshift hosted BI data warehouse
Googles BigQuery data analytics service
IBMs Bluemix cloud platform
Amazons Kinesis data processing service.
These technologies are used by:
Smarter Remarketer, a provider of SaaS-based retail
analytics
The Indianapolis-based company.\

2. Hadoop: The new enterprise data operating system


Distributed analytic frameworks, such as MapReduce, are
evolving into distributed resource managers that are gradually
turning Hadoop into a general-purpose data operating system.
With these systems you can perform many different data
manipulations and analytics operations by plugging them into
Hadoop as the distributed file storage system.
What does this mean for the enterprise?
As SQL, MapReduce, in-memory, stream processing, graph
analytics and other types of workloads are able to run on
Hadoop with adequate performance, more businesses will use
Hadoop as an enterprise data hub. The ability to run many
different kinds of [queries and data operations] against data in
Hadoop will make it a low-cost, general-purpose place to put
data that you want to be able to analyse.
3. Big data lakes
Traditional database theory dictates that you design the data
set before entering any data. A data lake, also called an
enterprise data lake or enterprise data hub, turns that model on
its head. it provides tools for people to analyze the data, along
with a high-level definition of what data exists in the lake.
People build the views into the data as they go along. Its a
very incremental, organic model for building a large-scale
database.
On the downside, the people who use it must be highly skilled.
We want the capabilities that traditional enterprise databases
have had for decades monitoring access control, encryption,
securing the data and tracing the lineage of data from source
to destination.
4. More predictive analytics

With big data, analysts have not only more data to work with,
but also the processing power to handle large numbers of
records with many attributes. Traditional machine learning uses
statistical analysis based on a sample of a total data set. You
now have the ability to do very large numbers of records and
very large numbers of attributes per record and that increases
predictability.
The combination of big data and compute power also lets
analysts explore new behavioral data throughout the day, such
as websites visited or location. This is sparse data,because to
find something of interest you must wade through a lot of data
that doesnt matter. Now you can find which variables are best
analytically by thrusting huge computing resources at the
problem. It really is a game changer.
To enable real-time analysis and predictive modeling out of the
same Hadoop core, thats where the interest is for us.The
problem has been speed, with Hadoop taking up to 20 times
longer to get questions answered than did more established
technologies. So Apache Spark, a large-scale data processing
engine, and its associated SQL query tool, Spark SQL. Spark has
this fast interactive query as well as graph services and
streaming capabilities. It is keeping the data within Hadoop, but
giving enough performance to close the gap for us.
5. SQL on Hadoop: Faster, better
If youre a smart coder and mathematician, you can drop data
in and do an analysis on anything in Hadoop. Thats the
promise and the problem. But we need someone to put it
into a format and language structure that we are familiar with.
Thats where SQL for Hadoop products come in, although any
familiar language could work. Tools that support SQL-like
querying let business users who already understand SQL apply
similar techniques to that data. SQL on Hadoop opens the door
to Hadoop in the enterprise.we can write scripts using Java,
JavaScript and Python something Hadoop users have
traditionally needed to do.
These tools are nothing new. Apache Hive has offered a
structured a structured, SQL-like query language for Hadoop for

some time. But commercial alternatives from Cloudera,


Pivotal Software, IBM and other vendors not only offer much
higher performance, but also are getting faster all the time.
That makes the technology a good fit for iterative analytics,
where an analyst asks one question, receives an answer, and
then asks another one. That type of work has traditionally
required building a data warehouse. SQL on Hadoop isnt going
to replace data warehouses, at least not anytime soon but it
does offer alternatives to more costly software and appliances
for certain types of analytics.
6. More, better NoSQL
Alternatives to traditional SQL-based relational databases,
called NoSQL (short for Not Only SQL) databases, are rapidly
gaining popularity as tools for use in specific kinds of analytic
applications.. He estimates that there are 15 to 20 open-source
NoSQL databases out there, each with its own specialization.
For example, a NoSQL product with graph database capability,
such as ArangoDB, offers a faster, more direct way to analyze
the network of relationships between customers or salespeople
than does a relational database.

7. Deep learning
Deep learning, a set of machine-learning techniques based on
neural networking, is still evolving but shows great potential for
solving business problems,Deep learning . . . enables
computers to recognize items of interest in large quantities of
unstructured and binary data, and to deduce relationships
without needing specific models or programming instructions.
it could be used to recognize many different kinds of data, such
as the shapes, colors and objects in a video or even the
presence of a cat within images. This notion of cognitive
engagement, advanced analytics and the things it implies . . .
are an important future trend.
8. In-memory analytics

The use of in-memory databases to speed up analytic


processing is increasingly popular and highly beneficial in the
right setting, says Beyer. In fact, many businesses are already
leveraging hybrid transaction/analytical processing (HTAP)
allowing transactions and analytic processing to reside in the
same in-memory database.
But theres a lot of hype around HTAP, and businesses have
been overusing it, Beyer says. For systems where the user
needs to see the same data in the same way many times
during the day and theres no significant change in the data
in-memory is a waste of money.
And while you can perform analytics faster with HTAP, all of the
transactions must reside within the same database.

Opportunities of Big Data in different


sectors: (Slide 13 and 14)(brinta):
We have seen different techniques adapted by companies and
organisations.
Lets now discuss about the usefulness gathered by different sectors.
Smarter healthcare:imagine having every information related
to patients recorded. There will be better understanding
,treatement and Millons of dollas will be saved.
Homeland Security Again a better place to be if we are
secure.
Traffic control manufacturing and multi channel sales:
Its a fact that after every certain years the data recorded
related to these has to be deleted. Big data will overcome this
problem.
Other sectors that will be affected positively are: telecom,
trading analytics and search quality due to big data.
Potential Value of Big data:
Lets have a look at financial side involved in big data.
$300 billion potential annual value to US health care.

$600 billion potential annual consumer surplus from using


personal location data.
60% potential in retailers operating margins.
These margins are just examples,the complete worth is a proof
that this is actually next big thing in it industry.
The table shows Annual potential talent pool yearly for big data
project in india.The data is in numbers of people in different
section.
For it professionals its highest at 28,00,000 28 lakhs. And
gradually decreasing for maths/science graduates at 7 lakhs
,engineers at 5 lakhs,eco students at around 3 lakh 50
thousand,math science post graduates at 3 lakh,mbas at 2lakh
50000 phds at 1 thousand and statistic graduates at 5000.
This is an estimated data.
Last but not least we will discuss the benefits.

Benefits: (Slide 15,16)(sourish)


After studying all the other aspect about big data,let us discuss
the benefits we gain from it.
Better ability to make strategic decisions 59% improvment is
achived.
Better steering of operational processes is increased by 51%
Faster Analysis by 50%
Detailed analysis by 43%
proved customer service by 32%
Better targeted market by 31%
Better insight by 28
Cost reduction at 28
Better product and better customer retention by 25 and 16
Big Data is Timely 60% of each workday is spend attempting
to find and manage data.

Big Data is Accessible Half of senior executives report that


accessing the right data is difficult.

Big Data is Holistic Information is currently kept in silos


within the organization. Big data will focus on its particular silo.
Big Data is Trustworthy Things as simple as monitoring
multiple systems for customer contact information updates can
save millions of dollars.
Big Data is Relevant 43% of companies are dissatisfied with
their tools ability.
Proper analytics can provide a ton of insight into your acquisition
efforts.
Big Data is Secure The average data security breach costs
$214 per customer.
Big Data is Authoritive 80% of organizations struggle with
multiple versions of the truth depending on the source of their
data. By combining multiple, vetted sources, more companies
can produce highly accurate intelligence sources.
Big Data is Actionable Outdated or bad data results in 46%
of companies making bad decisions that can cost billions.

Вам также может понравиться