Вы находитесь на странице: 1из 36

Links

Using color and size in visualisation

https://blog.datawrapper.de/colors/ 
https://www.educba.com/what-is-data-warehouse/

Data visualisation
What is Data Visualization?
Data visualization is a graphic representation that expresses the significance of
data. It reveals insights and patterns that are not immediately visible in the raw
data. It is an art through which information, numbers, and measurements can be
made more understandable. According to (Friedman 2008):
The primary goal of data visualization is to communicate information clearly and
effectively through graphical means. It does not mean that data visualization
needs to look boring to be functional or extremely sophisticated to look
beautiful. To convey ideas effectively, both aesthetic form and functionality need
to go hand in hand, providing insights into a rather sparse and complex data set
by communicating its key-aspects more intuitively.
The main goal of data visualization is to communicate information clearly and
effectively through graphical means. It doesn’t mean that data visualization
needs to look boring to be functional or extremely sophisticated to look
beautiful. To convey ideas effectively, both aesthetic form and functionality need
to go hand in hand, providing insights into a rather sparse and complex data set
by communicating its key-aspects in a more intuitive way.
“Data is the new oil” may be a cliche, but it is true. Like oil, data in its raw,
unrefined form is pretty worthless. To unlock its value, data needs to be refined,
analyzed and understood (Disney 2017). More and more organizations are seeing
potential in their data connections, but how do you allow non-experts to analyze
data at scale and extract potentially complex insights? One answer is through
interactive graph visualization.
Information visualization is the art of representing data so that it is easy to
understand and manipulate, thus making the information useful. Visualization
can make sense of information by helping to find relationships in the data and
support (or disproving) ideas about the data.
Why data visualization is such a powerful tool:

 Intuitive: Presenting a graph as a node-link structure instantly makes


sense, even to people who have never worked with graphs before.
 Fast: It is fast because our brains are great at identifying patterns, but only
when data is presented in a tangible format. Armed with visualization, we can
spot trends and outliers very effectively.
 Flexible: The world is densely connected, so as long as there is an
interesting relationship in your data somewhere, you will find value in graph
visualization.
 Insightful: Exploring graph data interactively allows users to gain more in-
depth knowledge, understand the context and ask more questions, compared to
static visualization or raw data.

Key Figures in the History of Data Visualization


The idea of visualizing data is old: After all, that’s what a map is—a
representation of geographic information—and we’ve had maps for about 8,000
years. But it was rare to graph anything other than geography. (n.d.)
(https://www.smithsonianmag.com/history/surprising-history-infographic-
180959563/) The history of data visualization is full of incredible stories marked
by significant events, led by a few key players. The
article (Infogram 2016) introduces some of the fantastic men and women who
paved the way by combining art, science, and statistics. One of them is Charles
Joseph Minard, whose most famous work is the map of Napoleons Russian
campaign of 1812 which could use as a data product for Data Visualization.
Below we have some visualizes with their famous works and other stories in the
article (Infogram 2016).

1.3.1.1 William Playfair (1759–1823)


William Playfair is considered the father of statistical graphics, having invented
the line and bar chart are used so often today. He is also credited with having
created the area and pie chart. Playfair was a Scottish engineer and political
economist who published “The Commercial and Political Atlas” in 1786. This
book featured a variety of graphs including the image below. In this famous
example, he compares exports from England with imports into England from
Denmark and Norway from 1700 to 1780.
1.3.1.2 John Snow (1813–1858)
In 1854, a cholera epidemic spread quickly through Soho in London. The Broad
Street area had seen over 600 dead, and the surviving residents and business
owners had primarily fled the terrible disease. Physician John Snow plotted the
locations of cholera deaths on a map. The surviving maps of his work show a
method of tallying the death counts, drawn as lines parallel to the street, at the
appropriate addresses. Snow’s research revealed a pattern. He saw an evident
concentration around the water pump on Broad Street, which helped find the
cause of the infection.

1.3.1.3 Charles Joseph Minard (1781–1870)


Charles Joseph Minard was a French civil engineer famous for his representation
of numerical data on maps. His most famous work is the map of Napoleon’s
Russian campaign of 1812 illustrating the dramatic loss of his army over the
advance on Moscow and the following retreat. This classic lithograph dates back
to 1869, displaying the number of men in Napoleon’s 1812 Russian army, their
movements, and the temperatures they encountered along their way. It has been
called one of the “best statistical drawings ever created.” The work is an
essential reminder that the fundamentals of data visualization lie in a nuanced
understanding of the many dimensions of data. Tools like D3.js and HTML are
no proper without a firm grasp of your dataset and sharp communication skills. It
represents the earliest beginning of data journalism.

1.3.1.4 Florence Nightingale (1820–1910)


Florence Nightingale is famous for her work as a nurse during the Crimean War,
but she was also a data journalist. She realized soldiers were dying from poor
sanitation and malnutrition, so she kept meticulous records of the death tolls in
the hospitals and visualized the data. Her coxcomb or rose diagrams helped her
fight for better hospital conditions and ultimately save lives.

1.3.1.5 Edmond Halley (1656–1742)


Edmond Halley was an English astronomer, geophysicist, mathematician,
meteorologist, and physicist who is best known for computing the orbit of
Halley’s Comet. According to the BBC, Halley developed the use of contour
lines on maps to connect and describe areas that display differences in
atmospheric conditions from place to place. These lines are now commonly used
to describe meteorological variation common to us from weather reports.
1.3.1.6 Charles de Fourcroy (1766–1824)
Charles de Fourcroy was a French mathematician and scholar. He produced a
visual analysis of the work of French civil engineers and a comparison of the
demographics of European cities.In 1782 he published Tableau Poléometrique, a
treatise on engineering and civil construction. His use of geometric shapes
predates the modern treemap, which is widely used today to display hierarchical
data.

1.3.1.7 Luigi Perozzo (1856–1916)


Luigi Perozzo was an Italian mathematician and statistician who stood out for
being the first to introduce 3D graphical representations, showing the
relationships between three variables on the same graph. Perozzo published one
of the first 3D representations of data showing the age group of the Swedish
population between the 18th and 19th centuries.

Data visualization tools


1. Tableau
2. Microsoft Power BI
3. Sisence
4. Zoho Reports
5. Jupyter
6. Google charts
7. Infogram
8. Plotly
9. Qlik View
10.Klipfolio
11.Visme
12.Adaptive discovery
13.Watson Analytics
14.Domo
15.High charts

Let’s see the significance of the individual Data Visualization Tools in details

1. Tableau: They are often regarded as a powerful business intelligence tool. It


allows us to handle extensive and huge data sets that are used in fields like
artificial intelligence, machine learning, business intelligence and they have a
customer link around many IT organizations due to their simplicity in solving the
data problems. Tableau helps in importing all sizes of data and managing metadata.
The data are pulled from various sources on different platforms. And these data are
connected via tableau desktop. These data are published to the tableau server. The
tableau reader helps the user in reading and viewing the file. Tableau has a huge
number of data connectors and offers a large community of users.

 2. Microsoft Power BI: Power BI has the ability to create a personalized


dashboard with easy-to-use platforms. It supports and imports data from various
sources and embeds with charts, maps, tables to make better visualizations. Power
BI uses R language for better visualizations also as cloud capabilities to leverage
on the desktop. The storage capacity is limited to 10GB cloud storage. Power BI
helps in publishing data online for the collaborators.

3. Sisence: It is an open-source licensed business intelligence software that makes


data very easy to act as a self-service. Sisence makes data very interactive and
connects to the different data sources which are put into a repository for easy
access from the dashboards. This real-time data visualization provides ready
insights into a particular organization. They have simple user-friendly drag and
drop with good interactive graphics, charts, and visualizations. Sisense was chosen
for its easy setup and good data exporting range.

4. Zoho Reports: It’s a BI analytics tool founded in 1996 with free service for two
users allows to create Adhoc reports and provides reporting options like sharing,
mail scheduling, summary views, pivot tables with a highly secure platform. This
tool is of primary importance to the app developers and ISVs. They have intuitive
visualizations to have great insights. Zoho reports or Zoho analytics is an online
Reporting with various capabilities like blend and merging real-time collaboration
and has high secure SSL connections. Their features include good financial reports,
scheduled reports, Streaming analytics.

 5. Jupyter: Jupyter is an open-source web application widely used to share the


source codes and the documents and executed individually. They are interactive
3D visualization known to be a handy tool with supportive GUI tool kits. Jupyter is
powerful easy, shareable to work in the same environment as most data science
projects perform visualization in this IDE. They are considered higher when
compared with other standard tools. Jupyter has flexible publishing to the users in
pdf, dashboards (plotly’s), Html.

6. Google charts: It’s an open-source allows data visualization on the website. It is


free of cost and completely used for commercial and educational purposes with a
rich set of gallery features. It’s a web service with varieties of charts types like pie
charts, bubble charts, line, area, scatter charts. All these charts can take either static
data or from databases. They are considered to be a JavaScript library and have
API packages. This API embeds a graph onto an SVG Canvas which has high
definition clarity. These Google charts are used for business needs, financial
reports, Statistical website reports. Their working steps take HTML file with
JavaScript to embed for the charts, meanwhile it executes AJAX request and sets
into SVG, finally, the Html files import the canvas to the web page.

7. Infogram: Infogram is developed for business strategies with free paid purposes


and known to be the best infographic tool to handle complex data. This tool builds
slick reports, infographics, one-pagers with the ready-made template design.
Infogram doesn’t require codes to work with, its operating structure allows a user
to save a lot of time. Infogram visually represents data with a variety of options,
formats and forms a very reliable platform, the working interface is dynamic. Their
visualization content grabs the viewers and they operate with the intention of
making and discovering new facts and outliers.

8. Plotly: It’s an interactive and open-source visualization tool with few lines of


code to write and specifically, they are a Python charting library. Plotly is colorful
with open source script which is easy to modify then and then. Their objects pose
data components and layout components. Plotly products allow high-level API
wrappers to save time. Plotly technically provides online graphing, statistical tools
with easy to use interactive python library. Plotly is built on plotly.js a JavaScript
library. All those graphs and charts with stunning visualizations will be completely
interactive for presentations. This python library makes use of declarative
programming with a complete framework for implementation.

9. Qlik View: It’s a powerful business Intelligence discovery platform created for


analytics applications. It’s a paid query-based tool with the in-memory application.
They do not require any professional development skills to build an analytical
application. They have benefits like flexible charts, better control on
transformation, Implementation time period is very less. A qlik view can take up
data from multiple sources to do deeper insights to face business challenges. A qlik
view promotes to make users the right decisions and it is easy to access. It makes
to analyze and separates the data association and unwanted data (filtering done by
a user wish). It’s a data discovery product and lets the user edit the search and
build the own application to fit the needs.
10. Klipfolio: This tool is enriched in the dashboards platform and helpful in real-
life business. It is customizable for all sizes of the business for manipulating
complex data using data engines. It connects to multiple sources and switching
between them is very simple. Klipfolio is flexible enough to embed third-party
visualization into its own dashboards. Kilpfolio is connected to google analytics,
twitter, and Datawarehouse.

11. Visme: It’s a tool to create Infographics with free to use the design tool as a
part of visual evolution. Their feature is the most attractive benefit in creating
presentations for content creation. Visme content is published and shared
everywhere and has 1000 of built-in templates and graphics. Visme finds an
integration of data onto the Microsoft application.

12. Adaptive Discovery: Adaptive insights tool allows identifying the issues


spotted on interactive drill down. It is specifically designed for proper decision
making in the business. They are created to analyze the company’s financial data
for performance management and the planning process. These data visualizations
tools are the power of self – service dashboards (charts, bars are easy to navigate
and easy to monitor the variances). They have the caliber to do calculations and
display them as a contextual, most importantly no code is required. Manages
financial needs and makes easy collaboration and sharing.

13. High Charts: This tool is very helpful in interactive visualizations for web
pages. Its free version helps non-commercial users. The high charts software has
varieties of chart types and even combines multiple charts to single.

14. Watson Analytics: IBM released these analytics for statistical procedures and
planned to use for non- commercial purposes. Visualizes unstructured content by
discovering new patterns, automates prediction and insights on the available data
and uses natural languages for communicating with the data. Its features include
detecting the patterns faster, data presentation is done with the pre-ready-made
templates with all in one click access. With customer satisfaction algorithms, we
can gather reviews and feedbacks across social media.

15. Domo: This cloud-based dashboard tool enables us to identify insights and


share the business stories in real-time. The communication is done via
notifications, messaging to update or alter the data sets. More importantly, they
have data connectors and 350 streams.
Types of data visualisation
The data visualization is broadly classified into 6 different types. Though the
area of data visualization is ever-growing, it won’t be a surprise if the number
of categories increases.

Temporal: Data for these types of visualization should satisfy both conditions: data represented should be

linear and should be one dimensional. These types of visualization are represented through lines that might

overlap and also have a common start and finish data point.
Scatter Plots Uses dots to represent a data point. The most common in

today’s world in machine learning during exploratory data

analysis.

Pie Chart This type of visualization includes circular graphics where the

arc length signifies the magnitude.

Polar area diagram Like Pie chart, the Polar area diagram is a circular plot, except

the sector angles, are equal in length and the distance of

extending from center signifies the magnitude.

Line graphs Like the scatter plot, the data is represented by points, except

joined by lines to maintain continuity.

Timelines In this way, we display a list of data points in chronological

order of time.
Time series sequences In time series we represent the magnitude of data in a 2-D

graph in chronological order of timestamp in data.

Hierarchical: These types of visualizations portray ordered groups within a larger group. In simple language,
the main intuition behind these visualizations is the clusters can be displayed if the flow of the clusters

starts from a single point.


Tree Diagram In a tree diagram, the hierarchical flow is represented in the

form of a tree as the name suggests. Few terminologies for

this representation are:

–  Root Node: Origination point.

–  Child node: Has a parent above

–   Leaf node: No more child node.

Ring Charts / Sunburst The tree representation in the Tree diagram is converted into a

Diagram radial basis. This type helps in presenting the tree in a concise

size. The innermost circle is the root-node. And the area of the

child node signifies the % of data.

TreeMap The tree is represented in the form of rectangles closely

packed. The area signifies the quantity contained.

Circle Packing Similar to a treemap, it uses circular packing instead of

rectangles.

Network: The visualization of these type connects datasets to datasets. These visualizations portray how

these datasets relate to one another within a network.


Matrix charts This type of visualization is widely used to find the connection

between different variables within themselves. For example,

correlation plot
Alluvial diagrams This is a type of flow diagram in which the changes in the flow  

of the network are represented over intervals as desired by the

user.

Word cloud This is typically used for representing text data. The words are  

closely packed, and the size of the text signifies the frequency

of the word.

Node-link diagrams Here the nodes are represented as dots and the connection

between nodes is presented.

Multidimensional: In contrast to the temporal type of visualization, these types can have multiple dimensions.

In this, we can use 2 or more features to create a 3-D visualization through concurrent layers. These will

enable the user to present key takeaways by breaking a lot of non-useful data.
Scatter plots In multi-dimensional data, we select any 2 features and then

plot them in a 2-D scatter plot. By doing this we would

have nC2 = n(n-1)/2 graphs.


        
Stacked bar graphs The representation segment bars on top of each other. It can

be either 100% Stacked Bar graph where the segregation is

represented in % or simple stacked bar graph which denotes

the actual magnitude


Parallel Co-ordinate plot In this representation, a backdrop is drawn, and n parallel lines

are drawn (for n-dimensional data).

Geospatial: These visualizations relates to present real-life physical location by crossing it over with maps

(It may be a geospatial or spatial map). The intuition behind these visualizations is to create a holistic view

of performance. 
Flow map Movement of information or objects from one location to

another is presented where the size of the arrow signifies the

amount.

Choropleth Map The geospatial map is colored on the basis of a particular data

variable.

Cartogram This type of representation uses the thematic variable for

mapping. These maps distort reality to present information.

This means that on a particular variable the maps are

exaggerated. For example, the image on the left is a spatial

map distorted to a bee-hive structure.

Heat Map These are very similar to Choropleth in the geospatial genre

but can be used in areas apart from geospatial as well.


Miscellaneous: These visualizations can’t be generalized in a particularly large group. So instead of forming
smaller groups for the individual type, we group it into miscellaneous. Few examples are below:
Open-High-Low-Close This type of graphs is typically used for stock price

chart representation. The increasing trend is called as Bullish and

decreasing as Bearish.
     
Kagi-Chart Typically the demand-supply of an asset is represented using

this chart.

Introduction to Tableau
Tableau is the fastly growing and powerful data visualization tool. Tableau is a business
intelligence tool which helps us to analyze the raw data in the form of the visual manner; it
may be a graph, report, etc.

Example: - If you have any data like Big Data, Hadoop, SQL, or any cloud data and if you
want to analyze that given data in the form of pictorial representation of data, you can use
Tableau.

Data analysis is very fast with Tableau, and the visualizations created are in the form of
worksheets and dashboards. Any professional can understand the data created using
Tableau.

Tableau software doesn't require any technical or any programming skills to operate.
Tableau is easy and fast for creating visual dashboards.

Features of Tableau
o Data Blending: Data blending is the most important feature in Tableau. It is used
when we combine related data from multiple data sources, which you want to
analyze together in a single view, and represent in the form of a graph.

Example: Assume, we have Sales data in relational database and Sales Target data in an
Excel sheet. Now, we have to compare actual sales with target sales, and blend the data
based on common dimensions to get access. The two sources which are involved in data
blending referred to as primary data and secondary data sources. A left join will be created
between the primary data source and the secondary data source with all the data rows from
primary and matching data rows from secondary data source to blend the data.

o Real-time analysis: Real-Time Analysis makes users able to quickly understand


and analyze dynamic data, when the Velocity is high, and real-time analysis of data
is complicated. Tableau can help extract valuable information from fast moving data
with interactive analytics.
o The Collaboration of data: Data analysis is not isolating task. That's why Tableau
is built for collaboration. Team members can share data, make follow up queries,
and forward easy-to-digest visualizations to others who could gain value from the
data. Making sure everyone understands the data and can make informed decisions
is critical to success.

Tools of Tableau
A list of Tableau tools:

o Tableau Desktop
o Tableau Public
o Tableau Online
o Tableau Server
o Tableau Reader

Data analytics in Tableau is classified into two parts:-

1. Developer Tools:- The Tableau tools which are used for development such as the
creation of charts, dashboards, report generation and visualization are known as
developer's tools. Tableau Desktop and the Tableau Public, are the example of this
type.
2. Sharing Tools:- The role of these tools are sharing the reports, visualizations, and
dashboards that were created using the developer tools. The Tableau tools that fall
into this category are Tableau Server, Tableau Online, and Tableau Reader.

1. Tableau Desktop
Tableau Desktop has a rich feature set and allows us to code and customize reports. Right
from creating the reports, charts to blending them all to form a dashboard, all the necessary
work is created in Tableau Desktop.

For live data analysis, Tableau Desktop establish connectivity between the Data Warehouse
and other various types of files. The dashboards and the workbooks created here can be
either shared locally or publicly.

Based on the connectivity to the publishing option and data sources, Tableau Desktop is
also classified into two parts-

o Tableau Desktop Personal:- The personal version of the Tableau desktop keeps


the workbook private, and the access is limited. The workbooks can't be published
online. So, it should be distributed either offline or in Tableau public.
o Tableau Desktop Professional:- It is similar to Tableau desktop. The main
difference is that the workbooks created in the Tableau desktop can be published
online or in Tableau server. In the professional version, there is full access to all
sorts datatypes. It is best for those who want to publish their workbook in Tableau
server.

2. Tableau Public
This Tableau version is specially built for cost-effective users. The word 'Public' means that
the created workbooks cannot be saved locally. They should be kept on the Tableau's public
cloud, which can be accessed and viewed by anyone.

There is no privacy of the files saved on the cloud, so anyone can access and download the
same data. This version is the best for them who want to share their data with the general
public and for the individuals who want to learn Tableau.

3. Tableau Online
Its functionality is similar to the tableau server, but data is stored on the servers that
hosted on the cloud, which is maintained by the Tableau group.

There is no storage limit on the data which is published in the Tableau Online. Tableau
Online creates a direct link over 40 data sources who are hosted in the cloud such as
the Hive, MySQL, Spark SQL, Amazon Aurora, and many more.
To be published, both Tableau Server and Tableau online require the workbooks that are
created by Tableau Desktop. Data that flow from the web applications say Tableau Server
and Tableau Online also support Google Analytics and Salesforce.com.

4. Tableau Server
The software is correctly used to share the workbooks, visualizations, which is created in
the Tableau Desktop application over the organization. To share dashboards in the Tableau
Server, you should first publish your workbook in the Tableau Desktop. Once the workbook
has been uploaded to the server, it will be accessible only to the authorized users.

It's not necessary that the authorized users have the Tableau Server installed on their
machine. They only require the login credentials by which they can check reports by the
web browser. The security is very high in Tableau server, and it is beneficial for quick and
effective sharing of data.

The admin of the organization has full control over the server. The organization maintains
the hardware and the software.

5. Tableau Reader
Tableau Reader is a free tool which allows us to view the visualizations and workbooks,
which is created using Tableau Desktop or Tableau Public. The data can be filtered, but
modifications and editing are restricted. There is no security in Tableau Reader as anyone
can view workbook using Tableau Reader.

Tableau Architecture
Tableau Server is designed to connect many data tiers. It can connect clients from Mobile,
Web, and Desktop. Tableau Desktop is a powerful data visualization tool. It is very secure
and highly available.

It can run on both the physical machines and virtual machines. It is a multi-
process, multi-user, and multi-threaded system.

Providing such powerful features requires unique architecture.


Let's study about the different component of the Tableau architecture:

1. Data server:- The primary component of Tableau Architecture is the Data sources which
can connect to it.

Tableau can connect with multiple data sources. It can blend the data from various data
sources. It can connect to an excel file, database, and a web application at the same
time. It can also make the relationship between different types of data sources.

2. Data connector:- The Data Connectors provide an interface to connect external data


sources with the Tableau Data Server.

Tableau has in-built SQL/ODBC connector. This ODBC Connector can be connected with any
databases without using their native connector. Tableau desktop has an option to select
both extract and live data. On the uses basis, one can be easily switched between live and
extracted data.

o Real-time data or live connection: Tableau can be connected with real data by


linking to the external database directly. It uses the infrastructure existing database
by sending dynamic multidimensional expressions (MDX) and SQL statements.
This feature can be used as a linking between the live data and Tableau rather than
importing the data. It makes optimized and a fast database system. Mostly in other
enterprises, the size of the database is large, and it is updated periodically. In these
cases, Tableau works as a front-end visualization tool by connecting with the live
data.
o Extracted or in-memory data: Tableau is an option to extract the data from
external data sources. We make a local copy in the form of Tableau extract file. It
can remove millions of records in the Tableau data engine with a single click.
Tableau's data engine uses storage such as ROM, RAM, and cache memory to
process and store data. Using filters, Tableau can extract a few records from a large
dataset. This improves performance, especially when we are working on massive
datasets. Extracted data allows the users to visualize the data offline, without
connecting to the data source.

3. Components of Tableau server: Different types of component of the Tableau server


are:

o Application server
o VizQL server
o Data server

A. Application server: The application server is used to provide the authorizations and


authentications. It handles the permission and administration for mobile and web interfaces.
It gives a guarantee of security by recording each session id on Tableau Server. The
administrator is configuring the default timeout of the session in the server.

B. VizQL server: VizQL server is used to convert the queries from the data source into
visualizations. Once the client request is forwarded to the VizQL process, it sends the query
directly to the data source retrieves information in the form of images. This visualization or
image is presented for the users. Tableau server creates a cache of visualization to reduce
the load time. The cache can be shared between many users who have permission to view
the visualization.

C. Data server: Data server is used to store and manage the data from external data
sources. It is a central data management system. It provides data security, metadata
management, data connection, driver requirements, and data storage. It stores the
related details of data set like calculated fields, metadata, groups, sets,
and parameters. The data source can extract the data as well as make live connections
with external data sources.

4. Gateway: The gateway directed the requests from users to Tableau components. When
the client sends a request, it is forwarded to the external load balancer for processing. The
gateway works as a distributor of processes to different components. In case of absence of
external load balancer, the gateway also works as a load balancer. For single server
configuration, one gateway or primary server manages all the processes. For multiple server
configurations, one physical system works as a primary server, and others are used as
worker servers. Only one machine is used as a primary server in Tableau Server
environment.

5. Clients: The visualizations and dashboards in Tableau server can be edited and viewed
using different clients. Clients are a web browser, mobile applications, and Tableau
Desktop.

o Web Browser: Web browsers like Google Chrome, Safari, and Firefox support


the Tableau server. The visualization and contents in the dashboard can be edited by
using these web browser.
o Mobile Application: The dashboard from the server can be interactively visualized
using mobile application and browser. It is used to edit and view the contents in the
workbook.
o Tableau Desktop: Tableau desktop is a business analytics tool. It is used to view,
create, and publish the dashboard in Tableau server. Users can access the various
data source and build visualization in Tableau desktop.

Advantages of Tableau
o Data Visualization:- Tableau is a data visualization tool, and provides complex
computation, data blending, and dashboarding for creating beautiful data
visualizations.
o Quickly Create Interactive Visualization:- Users can create a very interactive
visual by using drag n drop functionalities of Tableau.
o Comfortable in Implementation:- Many types of visualization options are
available in Tableau, which enhances the user experience. Tableau is very easy to
learn in comparison to Python. Who don't have any idea about coding, they also can
quickly learn Tableau.
o Tableau can Handle Large Amounts of Data:- Tableau can easily handle millions
of rows of data. A large amount of data can create different types of visualization
without disturbing the performance of the dashboards. As well as, there is an option
in Tableau where the user can make 'live' to connect different data sources like SQL,
etc.
o Use of other Scripting Language in Tableau:- To avoid the performance issues
and to do complex table calculations in Tableau, users can include Python or R.
Using Python Script, user can remove the load of the software by performing data
cleansing tasks with packages. However, Python is not a native scripting language
accepted by Tableau. So you can import some of the packages or visuals.
o Mobile Support and Responsive Dashboard:- Tableau Dashboard has an
excellent reporting feature that allows you to customize dashboard specifically for
devices like a mobile or laptops. Tableau automatically understands which device is
viewing the report by the user and make adjustments to ensure that accurate report
is delivered to the right device.

Disadvantages of Tableau
o Scheduling of Reports:- Tableau does not provide the automatic schedule of
reports. That's why there is always some manual effort required when the user
needs to update the data in the back end.
o No Custom Visual Imports:- Other tools like Power BI, a developer can create
custom visual that can be easily imported in Tableau, so any new visuals can
recreate before imported, but Tableau is not a complete open tool.
o Custom Formatting in Tableau:- Tableau's conditional formatting, and limited 16
column table that is very inconvenient for users. Also, to implement the same format
in multiple fields, there is no way for the user that they can do it for all fields
directly. Users have to do that manually for each, so it is a very time-consuming.
o Static and Single Value Parameter:- Tableau parameters are static, and it always
select a single value as a parameter. Whenever the data gets changed, these
parameters also have to be updated manually every time. There is no other option
for users that can automate the updating of parameters.
o Screen Resolution on Tableau Dashboards:- The layout of the dashboards is
distributed if the Tableau developer screen resolution is different from users screen
resolution.
Example:- If the dashboard is created on the screen resolution of 1920 X 1080 and
it viewed on 2560 X 1440, then the layout of the dashboard will be destroyed a little
bit, their dashboard is not responsive. So, you will need to create a dashboard for
desktop and mobile differently.

Data warehousing
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to
Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-
volatile collection of data. This data helps analysts to take informed decisions in an
organization.
An operational database undergoes frequent changes on a daily basis on account of
the transactions that take place. Suppose a business executive wants to analyze
previous feedback on any data such as a product, a supplier, or any consumer data,
then the executive will have no data available to analyze because the previous data
has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional
view. Along with generalized and consolidated view of data, a data warehouses also
provides us Online Analytical Processing (OLAP) tools. These tools help us in
interactive and effective analysis of data in a multidimensional space. This analysis
results in data generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and online analytical processing.

Data Warehouse Features


The key features of a data warehouse are discussed below −
 Subject Oriented − A data warehouse is subject oriented because it provides information
around a subject rather than the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the
ongoing operations, rather it focuses on modelling and analysis of data for decision making.
 Integrated − A data warehouse is constructed by integrating data from heterogeneous
sources such as relational databases, flat files, etc. This integration enhances the effective
analysis of data.
 Time Variant − The data collected in a data warehouse is identified with a particular time
period. The data in a data warehouse provides information from the historical point of view.
 Non-volatile − Non-volatile means the previous data is not erased when new data is added
to it. A data warehouse is kept separate from the operational database and therefore
frequent changes in operational database is not reflected in the data warehouse.

Types of Data Warehouse


Information processing, analytical processing, and data mining are the three types of
data warehouse applications that are discussed below −
 Information Processing − A data warehouse allows to process the data stored in it. The
data can be processed by means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
 Analytical Processing − A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.
 Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.

Functions of Data Warehouse Tools and Utilities


The following are the functions of data warehouse tools and utilities −
 Data Extraction − Involves gathering data from multiple heterogeneous sources.
 Data Cleaning − Involves finding and correcting the errors in data.
 Data Transformation − Involves converting the data from legacy format to warehouse
format.
 Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and
building indices and partitions.
 Refreshing − Involves updating from data sources to warehouse.
Data Warehouse Architectures
There are mainly three types of Datawarehouse Architectures: -

Single-tier architecture

The objective of a single layer is to minimize the amount of data stored. This
goal is to remove data redundancy. This architecture is not frequently used in
practice.

Two-tier architecture

Two-layer architecture separates physically available sources and data


warehouse. This architecture is not expandable and also not supporting a
large number of end-users. It also has connectivity problems because of
network limitations.

Three-tier architecture

This is the most widely used architecture.

It consists of the Top, Middle and Bottom Tier.

1. Bottom Tier: The database of the Datawarehouse servers as the


bottom tier. It is usually a relational database system. Data is cleansed,
transformed, and loaded into this layer using back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP or MOLAP model. For a
user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the
database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and
API that you connect and get data out from the data warehouse. It could
be Query tools, reporting tools, managed query tools, Analysis tools and
Data mining tools.

Datawarehouse Components
The data warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key components to make
the entire environment functional, manageable and accessible

There are mainly five components of Data Warehouse:

A. Data Warehouse Database


The central database is the foundation of the data warehousing environment.
This database is implemented on the RDBMS technology. Although, this kind
of implementation is constrained by the fact that traditional RDBMS system is
optimized for transactional database processing and not for data
warehousing. For instance, ad-hoc query, multi-table joins, aggregates are
resource intensive and slow down performance.

Hence, alternative approaches to Database are used as listed below-

 In a datawarehouse, relational databases are deployed in parallel to


allow for scalability. Parallel relational databases also allow shared
memory or shared nothing model on various multiprocessor
configurations or massively parallel processors.
 New index structures are used to bypass relational table scan and
improve speed.
 Use of multidimensional database (MDDBs) to overcome any limitations
which are placed because of the relational data model. Example:
Essbase from Oracle.
B. Sourcing, Acquisition, Clean-up and Transformation
Tools (ETL)
The data sourcing, transformation, and migration tools are used for performing
all the conversions, summarizations, and all the changes needed to transform
data into a unified format in the datawarehouse. They are also called Extract,
Transform and Load (ETL) Tools.

Their functionality includes:

 Anonymize data as per regulatory stipulations.


 Eliminating unwanted data in operational databases from loading into
Data warehouse.
 Search and replace common names and definitions for data arriving
from different sources.
 Calculating summaries and derived data
 In case of missing data, populate them with defaults.
 De-duplicated repeated data arriving from multiple datasources.

These Extract, Transform, and Load tools may generate cron jobs,
background jobs, Cobol programs, shell scripts, etc. that regularly update data
in datawarehouse. These tools are also helpful to maintain the Metadata.

These ETL Tools have to deal with challenges of Database & Data
heterogeneity.

C. Metadata
The name Meta Data suggests some high- level technological concept.
However, it is quite simple. Metadata is data about data which defines the
data warehouse. It is used for building, maintaining and managing the data
warehouse.

In the Data Warehouse Architecture, meta-data plays an important role as it


specifies the source, usage, values, and features of data warehouse data. It
also defines how data can be changed and processed. It is closely connected
to the data warehouse.

For example, a line in sales database may contain:

4030 KJ732 299.90


This is a meaningless data until we consult the Meta that tell us it was

 Model number: 4030


 Sales Agent ID: KJ732
 Total sales amount of $299.90

Therefore, Meta Data are essential ingredients in the transformation of data


into knowledge.

Metadata helps to answer the following questions

 What tables, attributes, and keys does the Data Warehouse contain?
 Where did the data come from?
 How many times do data get reloaded?
 What transformations were applied with cleansing?

Metadata can be classified into following categories:

1. Technical Meta Data: This kind of Metadata contains information about


warehouse which is used by Data warehouse designers and
administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives
end-users a way easy to understand information stored in the data
warehouse.

D. Query Tools
One of the primary objects of data warehousing is to provide information to
businesses to make strategic decisions. Query tools allow users to interact
with the data warehouse system.

These tools fall into four different categories:

1. Query and reporting tools


2. Application Development tools
3. Data mining tools
4. OLAP tools

1. Query and reporting tools:


Query and reporting tools can be further divided into
 Reporting tools
 Managed query tools

Reporting tools: Reporting tools can be further divided into production


reporting tools and desktop report writer.

1. Report writers: This kind of reporting tool are tools designed for end-
users for their analysis.
2. Production reporting: This kind of tools allows organizations to generate
regular operational reports. It also supports high volume batch jobs like
printing and calculating. Some popular reporting tools are Brio,
Business Objects, Oracle, PowerSoft, SAS Institute.

Managed query tools:

This kind of access tools helps end users to resolve snags in database and
SQL and database structure by inserting meta-layer between users and
database.

2. Application development tools:


Sometimes built-in graphical and analytical tools do not satisfy the analytical
needs of an organization. In such cases, custom reports are developed using
Application development tools.

3. Data mining tools:


Data mining is a process of discovering meaningful new correlation, pattens,
and trends by mining large amount data. Data mining tools are used to make
this process automatic.

4. OLAP tools:
These tools are based on concepts of a multidimensional database. It allows
users to analyse the data using elaborate and complex multidimensional
views.
E. Data warehouse Bus Architecture
Data warehouse Bus determines the flow of data in your warehouse. The data
flow in a data warehouse can be categorized as Inflow, Upflow, Downflow,
Outflow and Meta flow.

While designing a Data Bus, one needs to consider the shared dimensions,
facts across data marts.

Different Layers of Data Warehouse Architecture

There are four different types of layers which will always be present in Data

Warehouse Architecture.

1. Data Source Layer

 The Data Source Layer is the layer where the data from the source is
encountered and subsequently sent to the other layers for desired operations.
 The data can be of any type.
 The Source Data can be a database, a Spreadsheet or any other kinds of a
text file.
 The Source Data can be of any format. We cannot expect to get data with the
same format considering the sources are vastly different.
 In Real Life, Some examples of Source Data can be
 Log Files of each specific application or job or entry of employers in a
company.
 Survey Data, Stock Exchange Data, etc.
 Web Browser Data and many more.

2. Data Staging Layer


The following steps take place in Data Staging Layer.

1. Data Extraction

The Data received by the Source Layer is feed into the Staging Layer where the
first process that takes place with the acquired data is extraction.

2. Landing Database

 The extracted data is temporarily stored in a landing database.


 It retrieves the data once the data is extracted.

3. Staging Area

 The Data in Landing Database is taken and several quality checks and
staging operations are performed in the staging area.
 The Structure and Schema are also identified and adjustments are made to
data which are unordered thus trying to bring about a commonality among
the data that has been acquired.
 Having a place or set up for the data just before transformation and changes
is an added advantage that makes the Staging process very important.
 It makes data processing easier.

4. ETL

 It is an Extraction, Transformation, and Load.


 ETL Tools are used for integration and processing of data where logic is
applied to rather raw but somewhat ordered data.
 This data is extracted as per the analytical nature that is required and
transformed to data that is deemed fit to be stored in the Data Warehouse.
 After Transformation, the data or rather an information is finally loaded into
the data warehouse.
 Some examples of ETL tools are Informatica, SSIS, etc.

3. Data Storage Layer

 The processed data is stored in the Data Warehouse.


 This Data is cleansed, transformed and prepared with a definite structure and
thus provides opportunities for employers to use data as required by the
Business.
 Depending upon the approach of the Architecture, the data will be stored in
Data Warehouse as well as Data Marts. Data Marts will be discussed in the
later stages.
 Some also include an Operational Data Store.

4. Data Presentation Layer

 This Layer where the users get to interact with the data stored in the data
warehouse.
 Queries and several tools will be employed to get different types of
information based on the data.
 The information reaches the user through the graphical representation of
data.
 Reporting Tools are used to get Business Data and Business logic is also
applied to gather several kinds of information.
 Meta Data Information and System operations and performance are also
maintained and viewed in this layer.

Star schema
Star schema is the easiest approach and dimensional model where the
function tables, dimensions, and facts are arranged in an organized manner
and it is mostly applied in Business Intelligence and Data Warehousing. A
Star schema is formed by arranging each fact with its related dimensions that
resemble a star. A fact is an outcome that is infinite such as sales details and
login counts. A dimension is the collection of reference data including facts,
such as date, details about the product and customers. Star schema is
optimized for huge data queries in data warehousing, Online Analytical
Processing data cubes, and also ad-hoc queries.

HoHow to Create a Star Schema?


w to Create a Star Schema?
Here the user is going to create Star Schema by conversion of the entity-
relationship model. Entity-relationship models are too complex to explain the
functional quantities and attributes so it is simplified to dimensional star schema as
follows:

Find the enterprise procedure from entity-relationship view and understand the
model which can be split into several dimensional models. An entity-relationship
consists of business data.
 Find many to many tables in entity-relationship which explains the company
procedure and convert them into dimensional model reality tables. This table
contains data comprises of the fact table and a dimensional table with
numeric values and unique key attributes.
 The idea behind this process is to differentiate the exchange-based
information tables or the information erased tables. So it is necessary to
design many to numerous relationships. For example, in the ERP database,
there are invoice details which are the exchange table. Details that are
updated and refreshed are exchange based tables. Now comparing both
tables, it’s derived that the data in genuinely static.
 The reality table is a representation of a dimensional model that shows many
to numerous networks between finite measurements. This results that foreign
keys in reality tables share many to numerous that is a countable
relationship. most of this table falls under exchange based tables
 The last step in designing star schema is to de-normalize the residing tables
into measurement tables. The mandatory key is to make a duplicate key.
This key relies on the reality table which helps in better understanding. Find
the date and time from entity-relationship design and fil the dimension table.
Dates are saved as the date and time stamps. A date dimension column
represents the year, month or date or time

Example: The time dimensional table has TIMEID, Quartername, QuarterNo,


MonthName, MonthNo, DayName, DayofMonth, DayOfWeek which can be
important criteria of dimensional tables. Similarly, all tables have Unique id and
attributes. Query languages such as SQL can be applied to data mining, data
warehouse, and data analytics.

Syntax of Cube Definition:

Define cube (cube-name)(dimension-list): (measure-list)

Cubes are deployed to address the alerts at various levels and response time to
answer the query is minimum. It is available as a pre-built design and applicable in
required situations. Creating of Star schema is very easy and efficient to apply and
is adaptable too. Completion of the fact table and the dimensional table is
mandatory which in turn forms as star and can be formed using SQL queries or
running code. This design is made for better understanding and easy fetching of
data.
Characteristics of Star Schema
Characteristics of Star Schema
1. Star schema provides fast aggregations and calculations such as total items sold
and revenue of income gained at the end of every month. These details and process
can be filtered according to the requirements by framing suitable queries.

2. It has the capacity of filtering the data from normalized data and provide Data
warehousing needs. The associated information of the normalized table is stacked
in multiple dimensions tab. A unique key is generated for each fact table to identify
each row.

3. Fact Table is the measurement of specific events including finite number values
and consists of foreign keys related to dimensional tables. This table is framed with
facts values at the atomic level and permits to store multiple records at a time.
There are three different types of fact table.

4. Transaction fact tables consist of data about specific events such as holiday
events, sales events.

5. Recording facts for given periods like account information at the end of every
quarter.

6. Tables with rapid aggregation for a certain period is called as Accumulating


Snapshot tables.

7. Dimensional tables provide detailed attribute data, records found in fact table.
The dimension table can have varied features. Dimensional tables are used mainly
as Time and date Dimension table, Product and purchase order Dimensional table,
Employee and account details Dimensional table, Geography and locations
dimensional table. These tables are assigned with a single integer data type which
is the duplicate primary key.

8. The user can design his table according to requirements. For example, if he
needs a sales dimensional table with product and customer key, date and time key,
the revenue of income generated key. If the businessman frames a product
dimensional table with key attributes such as color, date of the purchased item,
promotion key and client key.
Advantages
 It is formed with simple logic and queries easy to extract the data from the
transactional process.

 It has a common reporting logic which is implied dynamically.


 Star schema can offer an increase in performance for reporting applications.
 Star schema designed by feeding cubes applied by the Online Transaction
Process to build and make the cubes work effectively.

Disadvantages
 It has high integrity and a high de-normalized state. If the user fails to update
the values the complete process will be collapsed. The protections and
security are not reliable up to the limit. It is not as flexible as an analytical
model and does not extend its efficient support to many relationships.

 Star schema is deployed in the database to control the faster recovery of


data. The query is employed to select the need rather than searching the
whole database. The filtered and selected data can be applied in different
cases. Hence this star schema is a simple model that is adopted easily.

Snowflake schema
Snowflake Schema is the Schema type widely used in the process of ‘Data
Warehousing Schema Architecture’ amongst other types available in today’s Data
warehousing methods. In this process, the Schema is structured into a sensible
design of tables/ data for the given Database. One can say that Snowflake Schema
is an expanded or fully normalized version of Star Schema. This Schema type is
picked based on the parameters abridged by the Project team that should
synchronize with the requirements they have received for the given particular
project.

Snowflake Schema must contain a single Fact Table in the center, with single or
multiple levels of Dimension Table. All the Dimension Tables are completely
Normalized that can lead to any number of levels. Normalization is nothing but
breaking down one Dimension table into two or more Dimension tables, to make
sure minimum or no redundancy. While all the first level Dimension tables are
linked to the center Fact table, all the other Dimension tables can be linked to one
another if required. This Structure resembles a Snowflake (Fig. 01), hence the
name ‘Snowflake Schema’.

Characteristics

 This model can involve only one fact table and multiple dimension tables
which must be further normalized until there is no more room for further
normalization.
 Snowflake Schema makes it possible for the data in the Database to be more
defined, in contrast to other schemas, as normalization is the main attribute
in this schema type.
 Normalization is the key feature that distinguishes Snowflake schema from
other schema types available in the Database Management System
Architecture.
 The Fact Table will have all the facts/ measures, while the Dimension Tables
will have foreign keys to connect with the Fact Table.
 Snowflake Schema allows the Dimension Tables to be linked to other
Dimension tables, except for the Dimension Tables in the first level.
 This Multidimensional nature makes it easy to implement on complex
Relational Database systems, thus resulting in effective Analysis &
Reporting processes.
 In terms of Accessibility, Complex multiple levels of Join queries are
required to fetch aggregated data from the central fact table, using the
foreign keys to access all the required Dimension tables.
 Multiple Dimension tables, which are created as a result of normalization,
serve as lookup tables when querying with Joins.
 The process of breaking down all the Dimension tables into multiple small
Dimensions until it is completely normalized takes up a lot of storage space
compared to other schemas.
 As the querying process is complex, the pace for Data Retrieval is by far
low.

Workflow of snowflake schema


Here we will discuss the Workflow of Snowflake Schema by explaining how to
create snowflake schema along with the pros and cons.

How to Create a Snowflake Schema?


When the requirement is to create a schema with a fact table ‘A’ that has 6
dimension tables ‘B, C, D, E, F, G’, and each of these dimension tables has
furthermore normalization in-scope, then Snowflake schema will be the right pick
in this case.

These Dimension tables ‘B, C, D, E, F, G’ are further disintegrated into further


more Dimension tables. This process continues up until there is no further
approach to break the already normalized Dimension tables.

Say our ‘A’ is a ‘Clothing Sales’, it could have the below dimensions as its ‘B, C,
D, E, F, G’, which has the scope for further normalization –

 Employees
 Customer
 Store
 Products
 Sales
 Exchange

Now let us design a Snowflake Schema for this –


The above Dimension tables can be further broken as –

 Stores – Owned & Rented, which can be further broken into location,
country, state, region, city/ town, etc in each level, depending on the
available data and requirements.
 Sales – Limited Editions & other Branded, which can be further broken into
seasonal, nonseasonal, etc.
 Exchanges – Reasons as the second level, Exchange for ‘money-back’ &
‘different product’ as the third level of Dimensions.
 Products – ‘Product Types’ table as second-level Dimensions, and levels for
each type of product. This can be continued until the last level of
normalization.
 Customers – ‘customer types’ as Men & Women, which can be additionally
split as members, non-members, types of membership, etc.
 Employees – the type of employees as ‘Permanent’, ‘Temporary/ Part-time’
employees. The next level here can be departments, location, Salary Grade,
etc.

This can be further normalized to its final level of dimension tables, as it helps in
reducing redundancy in final data. This Schema can be used for Analysis or
Reporting when the focus is mainly on the Clothing Sales alone (fact table), and
the first level dimensions as specified above.

Pros & Cons of snowflake schema


The following pros and cons are mention below –

1. Minimum or no redundancy, as a result of Normalization, which is the core


quality for Snowflake Schema.
2. Snowflake Schema is a complex system, as it can have any number of levels
of normalization depending on the depth of the given database.
3. Data Quality will be exceptional, as Normalization grants the benefit for the
well-defined form of tables/ data.
4. If any new requirement creates a need for denormalization, data quality will
be taken back and redundancy may occur. This may be lead to restructuring
the entire Schema.
5. When queried with Joins, clear & accurate data is retrieved.
6. Maintenance is difficult as the higher-level dimensions need to be expanded
constantly.
7. High Data quality & accuracy helps in facilitating efficient Reporting &
Analysis.
8. Low performance as it required complex Join queries.
9. Easy implementation process when provided with multipart Relational
Databases.
10.Large storage space is required for full Normalization and elaborate
querying process.

Вам также может понравиться