COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Contents What is Big Data Conventional Approaches Problems with Conventional Approaches Welcome to the world of Big Data COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION What is Big Data Every day, world create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION What is Big Data According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is also Big Data. COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Why Big Data Huge Competition in the Market: Retails Customer analytics Travel travel pattern of the customer Website Understand users navigation pattern, interest, conversion, etc Sensors, satellite, geospatial Data Military and intelligence COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Essence of Big Data COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Volume Today we are living in the world of data. There are multiple factors contributing in data growth Huge volumes of data are generated from various sources: Transaction based data (stored through years) Text, Images, Videos from Social Media Increased amounts of data generated by sensors COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Volume Turn 12 terabytes of Tweets created each day into improved product sentiment analysis Convert 350 billion annual meter readings to better predict power consumption Turn billions of customer complaints to analyze root cause of customer churn COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Velocity According to Gartner, velocity "means both how fast data is being produced and how fast the data must be processed to meet demand." Scrutinize 5 million trade events created each day to identify potential fraud Analyze customers searching/buying pattern and show them advertisement of attractive offers in real time COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Velocity (example) Take Googles example, about processing of the data: As soon as a blog is posted it comes into the search result. If we search about traveling, shopping(electronics, apparels, shoes, watch, etc.), job, etc. the relevant advertisement it provides us, while browsing. Even ads in the mail are highly content driven COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Variety Data today comes in all types of formats from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Veracity Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Veracity in data analysis is the biggest challenge when compares to things like volume and velocity. Keep your data clean and processes to keep dirty data from accumulating in your systems. COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Conventional Approaches Storage RDBMS (Oracle, DB2, MySQL, etc.) OS Filesystem Processing SQL Queries Custom framework C/C++ Python/Perl COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Why Big Data Technologies Conventional Approaches/Technologies are not able to solve current problems They are good for certain use-cases But they cannot handle the data in the range of peta-bytes COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Problems with Conventional Approaches Limited Storage capacity Limited Processing capacity No scalability Single point of failure Sequential Processing RBMSs can handle structured data Requires preprocessing of data Information is collected according to current business needs COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Limited Storage capacity Installed on single machine Have specified storage limits Requires to archive the data again and again Problems of reloading data back to the repository, according to the business needs Only process the data that can be stored on a single machine COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Limited Processing capacity Installed on single machine Have specified processing limits Have certain no of processing elements (CPUs) Not able to process the large amount of data efficiently COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION No scalability One of biggest limitations of conventional RDBMs, is the no scalability We cannot add more resources on the fly COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION Thank You COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION