Академический Документы
Профессиональный Документы
Культура Документы
https://blog.datawrapper.de/colors/
https://www.educba.com/what-is-data-warehouse/
Data visualisation
What is Data Visualization?
Data visualization is a graphic representation that expresses the significance of
data. It reveals insights and patterns that are not immediately visible in the raw
data. It is an art through which information, numbers, and measurements can be
made more understandable. According to (Friedman 2008):
The primary goal of data visualization is to communicate information clearly and
effectively through graphical means. It does not mean that data visualization
needs to look boring to be functional or extremely sophisticated to look
beautiful. To convey ideas effectively, both aesthetic form and functionality need
to go hand in hand, providing insights into a rather sparse and complex data set
by communicating its key-aspects more intuitively.
The main goal of data visualization is to communicate information clearly and
effectively through graphical means. It doesn’t mean that data visualization
needs to look boring to be functional or extremely sophisticated to look
beautiful. To convey ideas effectively, both aesthetic form and functionality need
to go hand in hand, providing insights into a rather sparse and complex data set
by communicating its key-aspects in a more intuitive way.
“Data is the new oil” may be a cliche, but it is true. Like oil, data in its raw,
unrefined form is pretty worthless. To unlock its value, data needs to be refined,
analyzed and understood (Disney 2017). More and more organizations are seeing
potential in their data connections, but how do you allow non-experts to analyze
data at scale and extract potentially complex insights? One answer is through
interactive graph visualization.
Information visualization is the art of representing data so that it is easy to
understand and manipulate, thus making the information useful. Visualization
can make sense of information by helping to find relationships in the data and
support (or disproving) ideas about the data.
Why data visualization is such a powerful tool:
Let’s see the significance of the individual Data Visualization Tools in details
4. Zoho Reports: It’s a BI analytics tool founded in 1996 with free service for two
users allows to create Adhoc reports and provides reporting options like sharing,
mail scheduling, summary views, pivot tables with a highly secure platform. This
tool is of primary importance to the app developers and ISVs. They have intuitive
visualizations to have great insights. Zoho reports or Zoho analytics is an online
Reporting with various capabilities like blend and merging real-time collaboration
and has high secure SSL connections. Their features include good financial reports,
scheduled reports, Streaming analytics.
11. Visme: It’s a tool to create Infographics with free to use the design tool as a
part of visual evolution. Their feature is the most attractive benefit in creating
presentations for content creation. Visme content is published and shared
everywhere and has 1000 of built-in templates and graphics. Visme finds an
integration of data onto the Microsoft application.
13. High Charts: This tool is very helpful in interactive visualizations for web
pages. Its free version helps non-commercial users. The high charts software has
varieties of chart types and even combines multiple charts to single.
14. Watson Analytics: IBM released these analytics for statistical procedures and
planned to use for non- commercial purposes. Visualizes unstructured content by
discovering new patterns, automates prediction and insights on the available data
and uses natural languages for communicating with the data. Its features include
detecting the patterns faster, data presentation is done with the pre-ready-made
templates with all in one click access. With customer satisfaction algorithms, we
can gather reviews and feedbacks across social media.
Temporal: Data for these types of visualization should satisfy both conditions: data represented should be
linear and should be one dimensional. These types of visualization are represented through lines that might
overlap and also have a common start and finish data point.
Scatter Plots Uses dots to represent a data point. The most common in
analysis.
Pie Chart This type of visualization includes circular graphics where the
Polar area diagram Like Pie chart, the Polar area diagram is a circular plot, except
Line graphs Like the scatter plot, the data is represented by points, except
order of time.
Time series sequences In time series we represent the magnitude of data in a 2-D
Hierarchical: These types of visualizations portray ordered groups within a larger group. In simple language,
the main intuition behind these visualizations is the clusters can be displayed if the flow of the clusters
Ring Charts / Sunburst The tree representation in the Tree diagram is converted into a
Diagram radial basis. This type helps in presenting the tree in a concise
size. The innermost circle is the root-node. And the area of the
rectangles.
Network: The visualization of these type connects datasets to datasets. These visualizations portray how
correlation plot
Alluvial diagrams This is a type of flow diagram in which the changes in the flow
user.
Word cloud This is typically used for representing text data. The words are
closely packed, and the size of the text signifies the frequency
of the word.
Node-link diagrams Here the nodes are represented as dots and the connection
Multidimensional: In contrast to the temporal type of visualization, these types can have multiple dimensions.
In this, we can use 2 or more features to create a 3-D visualization through concurrent layers. These will
enable the user to present key takeaways by breaking a lot of non-useful data.
Scatter plots In multi-dimensional data, we select any 2 features and then
Geospatial: These visualizations relates to present real-life physical location by crossing it over with maps
(It may be a geospatial or spatial map). The intuition behind these visualizations is to create a holistic view
of performance.
Flow map Movement of information or objects from one location to
amount.
Choropleth Map The geospatial map is colored on the basis of a particular data
variable.
Heat Map These are very similar to Choropleth in the geospatial genre
decreasing as Bearish.
Kagi-Chart Typically the demand-supply of an asset is represented using
this chart.
Introduction to Tableau
Tableau is the fastly growing and powerful data visualization tool. Tableau is a business
intelligence tool which helps us to analyze the raw data in the form of the visual manner; it
may be a graph, report, etc.
Example: - If you have any data like Big Data, Hadoop, SQL, or any cloud data and if you
want to analyze that given data in the form of pictorial representation of data, you can use
Tableau.
Data analysis is very fast with Tableau, and the visualizations created are in the form of
worksheets and dashboards. Any professional can understand the data created using
Tableau.
Tableau software doesn't require any technical or any programming skills to operate.
Tableau is easy and fast for creating visual dashboards.
Features of Tableau
o Data Blending: Data blending is the most important feature in Tableau. It is used
when we combine related data from multiple data sources, which you want to
analyze together in a single view, and represent in the form of a graph.
Example: Assume, we have Sales data in relational database and Sales Target data in an
Excel sheet. Now, we have to compare actual sales with target sales, and blend the data
based on common dimensions to get access. The two sources which are involved in data
blending referred to as primary data and secondary data sources. A left join will be created
between the primary data source and the secondary data source with all the data rows from
primary and matching data rows from secondary data source to blend the data.
Tools of Tableau
A list of Tableau tools:
o Tableau Desktop
o Tableau Public
o Tableau Online
o Tableau Server
o Tableau Reader
1. Developer Tools:- The Tableau tools which are used for development such as the
creation of charts, dashboards, report generation and visualization are known as
developer's tools. Tableau Desktop and the Tableau Public, are the example of this
type.
2. Sharing Tools:- The role of these tools are sharing the reports, visualizations, and
dashboards that were created using the developer tools. The Tableau tools that fall
into this category are Tableau Server, Tableau Online, and Tableau Reader.
1. Tableau Desktop
Tableau Desktop has a rich feature set and allows us to code and customize reports. Right
from creating the reports, charts to blending them all to form a dashboard, all the necessary
work is created in Tableau Desktop.
For live data analysis, Tableau Desktop establish connectivity between the Data Warehouse
and other various types of files. The dashboards and the workbooks created here can be
either shared locally or publicly.
Based on the connectivity to the publishing option and data sources, Tableau Desktop is
also classified into two parts-
2. Tableau Public
This Tableau version is specially built for cost-effective users. The word 'Public' means that
the created workbooks cannot be saved locally. They should be kept on the Tableau's public
cloud, which can be accessed and viewed by anyone.
There is no privacy of the files saved on the cloud, so anyone can access and download the
same data. This version is the best for them who want to share their data with the general
public and for the individuals who want to learn Tableau.
3. Tableau Online
Its functionality is similar to the tableau server, but data is stored on the servers that
hosted on the cloud, which is maintained by the Tableau group.
There is no storage limit on the data which is published in the Tableau Online. Tableau
Online creates a direct link over 40 data sources who are hosted in the cloud such as
the Hive, MySQL, Spark SQL, Amazon Aurora, and many more.
To be published, both Tableau Server and Tableau online require the workbooks that are
created by Tableau Desktop. Data that flow from the web applications say Tableau Server
and Tableau Online also support Google Analytics and Salesforce.com.
4. Tableau Server
The software is correctly used to share the workbooks, visualizations, which is created in
the Tableau Desktop application over the organization. To share dashboards in the Tableau
Server, you should first publish your workbook in the Tableau Desktop. Once the workbook
has been uploaded to the server, it will be accessible only to the authorized users.
It's not necessary that the authorized users have the Tableau Server installed on their
machine. They only require the login credentials by which they can check reports by the
web browser. The security is very high in Tableau server, and it is beneficial for quick and
effective sharing of data.
The admin of the organization has full control over the server. The organization maintains
the hardware and the software.
5. Tableau Reader
Tableau Reader is a free tool which allows us to view the visualizations and workbooks,
which is created using Tableau Desktop or Tableau Public. The data can be filtered, but
modifications and editing are restricted. There is no security in Tableau Reader as anyone
can view workbook using Tableau Reader.
Tableau Architecture
Tableau Server is designed to connect many data tiers. It can connect clients from Mobile,
Web, and Desktop. Tableau Desktop is a powerful data visualization tool. It is very secure
and highly available.
It can run on both the physical machines and virtual machines. It is a multi-
process, multi-user, and multi-threaded system.
1. Data server:- The primary component of Tableau Architecture is the Data sources which
can connect to it.
Tableau can connect with multiple data sources. It can blend the data from various data
sources. It can connect to an excel file, database, and a web application at the same
time. It can also make the relationship between different types of data sources.
Tableau has in-built SQL/ODBC connector. This ODBC Connector can be connected with any
databases without using their native connector. Tableau desktop has an option to select
both extract and live data. On the uses basis, one can be easily switched between live and
extracted data.
o Application server
o VizQL server
o Data server
B. VizQL server: VizQL server is used to convert the queries from the data source into
visualizations. Once the client request is forwarded to the VizQL process, it sends the query
directly to the data source retrieves information in the form of images. This visualization or
image is presented for the users. Tableau server creates a cache of visualization to reduce
the load time. The cache can be shared between many users who have permission to view
the visualization.
C. Data server: Data server is used to store and manage the data from external data
sources. It is a central data management system. It provides data security, metadata
management, data connection, driver requirements, and data storage. It stores the
related details of data set like calculated fields, metadata, groups, sets,
and parameters. The data source can extract the data as well as make live connections
with external data sources.
4. Gateway: The gateway directed the requests from users to Tableau components. When
the client sends a request, it is forwarded to the external load balancer for processing. The
gateway works as a distributor of processes to different components. In case of absence of
external load balancer, the gateway also works as a load balancer. For single server
configuration, one gateway or primary server manages all the processes. For multiple server
configurations, one physical system works as a primary server, and others are used as
worker servers. Only one machine is used as a primary server in Tableau Server
environment.
5. Clients: The visualizations and dashboards in Tableau server can be edited and viewed
using different clients. Clients are a web browser, mobile applications, and Tableau
Desktop.
Advantages of Tableau
o Data Visualization:- Tableau is a data visualization tool, and provides complex
computation, data blending, and dashboarding for creating beautiful data
visualizations.
o Quickly Create Interactive Visualization:- Users can create a very interactive
visual by using drag n drop functionalities of Tableau.
o Comfortable in Implementation:- Many types of visualization options are
available in Tableau, which enhances the user experience. Tableau is very easy to
learn in comparison to Python. Who don't have any idea about coding, they also can
quickly learn Tableau.
o Tableau can Handle Large Amounts of Data:- Tableau can easily handle millions
of rows of data. A large amount of data can create different types of visualization
without disturbing the performance of the dashboards. As well as, there is an option
in Tableau where the user can make 'live' to connect different data sources like SQL,
etc.
o Use of other Scripting Language in Tableau:- To avoid the performance issues
and to do complex table calculations in Tableau, users can include Python or R.
Using Python Script, user can remove the load of the software by performing data
cleansing tasks with packages. However, Python is not a native scripting language
accepted by Tableau. So you can import some of the packages or visuals.
o Mobile Support and Responsive Dashboard:- Tableau Dashboard has an
excellent reporting feature that allows you to customize dashboard specifically for
devices like a mobile or laptops. Tableau automatically understands which device is
viewing the report by the user and make adjustments to ensure that accurate report
is delivered to the right device.
Disadvantages of Tableau
o Scheduling of Reports:- Tableau does not provide the automatic schedule of
reports. That's why there is always some manual effort required when the user
needs to update the data in the back end.
o No Custom Visual Imports:- Other tools like Power BI, a developer can create
custom visual that can be easily imported in Tableau, so any new visuals can
recreate before imported, but Tableau is not a complete open tool.
o Custom Formatting in Tableau:- Tableau's conditional formatting, and limited 16
column table that is very inconvenient for users. Also, to implement the same format
in multiple fields, there is no way for the user that they can do it for all fields
directly. Users have to do that manually for each, so it is a very time-consuming.
o Static and Single Value Parameter:- Tableau parameters are static, and it always
select a single value as a parameter. Whenever the data gets changed, these
parameters also have to be updated manually every time. There is no other option
for users that can automate the updating of parameters.
o Screen Resolution on Tableau Dashboards:- The layout of the dashboards is
distributed if the Tableau developer screen resolution is different from users screen
resolution.
Example:- If the dashboard is created on the screen resolution of 1920 X 1080 and
it viewed on 2560 X 1440, then the layout of the dashboard will be destroyed a little
bit, their dashboard is not responsive. So, you will need to create a dashboard for
desktop and mobile differently.
Data warehousing
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to
Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-
volatile collection of data. This data helps analysts to take informed decisions in an
organization.
An operational database undergoes frequent changes on a daily basis on account of
the transactions that take place. Suppose a business executive wants to analyze
previous feedback on any data such as a product, a supplier, or any consumer data,
then the executive will have no data available to analyze because the previous data
has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional
view. Along with generalized and consolidated view of data, a data warehouses also
provides us Online Analytical Processing (OLAP) tools. These tools help us in
interactive and effective analysis of data in a multidimensional space. This analysis
results in data generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and online analytical processing.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This
goal is to remove data redundancy. This architecture is not frequently used in
practice.
Two-tier architecture
Three-tier architecture
Datawarehouse Components
The data warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key components to make
the entire environment functional, manageable and accessible
These Extract, Transform, and Load tools may generate cron jobs,
background jobs, Cobol programs, shell scripts, etc. that regularly update data
in datawarehouse. These tools are also helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data
heterogeneity.
C. Metadata
The name Meta Data suggests some high- level technological concept.
However, it is quite simple. Metadata is data about data which defines the
data warehouse. It is used for building, maintaining and managing the data
warehouse.
What tables, attributes, and keys does the Data Warehouse contain?
Where did the data come from?
How many times do data get reloaded?
What transformations were applied with cleansing?
D. Query Tools
One of the primary objects of data warehousing is to provide information to
businesses to make strategic decisions. Query tools allow users to interact
with the data warehouse system.
1. Report writers: This kind of reporting tool are tools designed for end-
users for their analysis.
2. Production reporting: This kind of tools allows organizations to generate
regular operational reports. It also supports high volume batch jobs like
printing and calculating. Some popular reporting tools are Brio,
Business Objects, Oracle, PowerSoft, SAS Institute.
This kind of access tools helps end users to resolve snags in database and
SQL and database structure by inserting meta-layer between users and
database.
4. OLAP tools:
These tools are based on concepts of a multidimensional database. It allows
users to analyse the data using elaborate and complex multidimensional
views.
E. Data warehouse Bus Architecture
Data warehouse Bus determines the flow of data in your warehouse. The data
flow in a data warehouse can be categorized as Inflow, Upflow, Downflow,
Outflow and Meta flow.
While designing a Data Bus, one needs to consider the shared dimensions,
facts across data marts.
There are four different types of layers which will always be present in Data
Warehouse Architecture.
The Data Source Layer is the layer where the data from the source is
encountered and subsequently sent to the other layers for desired operations.
The data can be of any type.
The Source Data can be a database, a Spreadsheet or any other kinds of a
text file.
The Source Data can be of any format. We cannot expect to get data with the
same format considering the sources are vastly different.
In Real Life, Some examples of Source Data can be
Log Files of each specific application or job or entry of employers in a
company.
Survey Data, Stock Exchange Data, etc.
Web Browser Data and many more.
1. Data Extraction
The Data received by the Source Layer is feed into the Staging Layer where the
first process that takes place with the acquired data is extraction.
2. Landing Database
3. Staging Area
The Data in Landing Database is taken and several quality checks and
staging operations are performed in the staging area.
The Structure and Schema are also identified and adjustments are made to
data which are unordered thus trying to bring about a commonality among
the data that has been acquired.
Having a place or set up for the data just before transformation and changes
is an added advantage that makes the Staging process very important.
It makes data processing easier.
4. ETL
This Layer where the users get to interact with the data stored in the data
warehouse.
Queries and several tools will be employed to get different types of
information based on the data.
The information reaches the user through the graphical representation of
data.
Reporting Tools are used to get Business Data and Business logic is also
applied to gather several kinds of information.
Meta Data Information and System operations and performance are also
maintained and viewed in this layer.
Star schema
Star schema is the easiest approach and dimensional model where the
function tables, dimensions, and facts are arranged in an organized manner
and it is mostly applied in Business Intelligence and Data Warehousing. A
Star schema is formed by arranging each fact with its related dimensions that
resemble a star. A fact is an outcome that is infinite such as sales details and
login counts. A dimension is the collection of reference data including facts,
such as date, details about the product and customers. Star schema is
optimized for huge data queries in data warehousing, Online Analytical
Processing data cubes, and also ad-hoc queries.
Find the enterprise procedure from entity-relationship view and understand the
model which can be split into several dimensional models. An entity-relationship
consists of business data.
Find many to many tables in entity-relationship which explains the company
procedure and convert them into dimensional model reality tables. This table
contains data comprises of the fact table and a dimensional table with
numeric values and unique key attributes.
The idea behind this process is to differentiate the exchange-based
information tables or the information erased tables. So it is necessary to
design many to numerous relationships. For example, in the ERP database,
there are invoice details which are the exchange table. Details that are
updated and refreshed are exchange based tables. Now comparing both
tables, it’s derived that the data in genuinely static.
The reality table is a representation of a dimensional model that shows many
to numerous networks between finite measurements. This results that foreign
keys in reality tables share many to numerous that is a countable
relationship. most of this table falls under exchange based tables
The last step in designing star schema is to de-normalize the residing tables
into measurement tables. The mandatory key is to make a duplicate key.
This key relies on the reality table which helps in better understanding. Find
the date and time from entity-relationship design and fil the dimension table.
Dates are saved as the date and time stamps. A date dimension column
represents the year, month or date or time
Cubes are deployed to address the alerts at various levels and response time to
answer the query is minimum. It is available as a pre-built design and applicable in
required situations. Creating of Star schema is very easy and efficient to apply and
is adaptable too. Completion of the fact table and the dimensional table is
mandatory which in turn forms as star and can be formed using SQL queries or
running code. This design is made for better understanding and easy fetching of
data.
Characteristics of Star Schema
Characteristics of Star Schema
1. Star schema provides fast aggregations and calculations such as total items sold
and revenue of income gained at the end of every month. These details and process
can be filtered according to the requirements by framing suitable queries.
2. It has the capacity of filtering the data from normalized data and provide Data
warehousing needs. The associated information of the normalized table is stacked
in multiple dimensions tab. A unique key is generated for each fact table to identify
each row.
3. Fact Table is the measurement of specific events including finite number values
and consists of foreign keys related to dimensional tables. This table is framed with
facts values at the atomic level and permits to store multiple records at a time.
There are three different types of fact table.
4. Transaction fact tables consist of data about specific events such as holiday
events, sales events.
5. Recording facts for given periods like account information at the end of every
quarter.
7. Dimensional tables provide detailed attribute data, records found in fact table.
The dimension table can have varied features. Dimensional tables are used mainly
as Time and date Dimension table, Product and purchase order Dimensional table,
Employee and account details Dimensional table, Geography and locations
dimensional table. These tables are assigned with a single integer data type which
is the duplicate primary key.
8. The user can design his table according to requirements. For example, if he
needs a sales dimensional table with product and customer key, date and time key,
the revenue of income generated key. If the businessman frames a product
dimensional table with key attributes such as color, date of the purchased item,
promotion key and client key.
Advantages
It is formed with simple logic and queries easy to extract the data from the
transactional process.
Disadvantages
It has high integrity and a high de-normalized state. If the user fails to update
the values the complete process will be collapsed. The protections and
security are not reliable up to the limit. It is not as flexible as an analytical
model and does not extend its efficient support to many relationships.
Snowflake schema
Snowflake Schema is the Schema type widely used in the process of ‘Data
Warehousing Schema Architecture’ amongst other types available in today’s Data
warehousing methods. In this process, the Schema is structured into a sensible
design of tables/ data for the given Database. One can say that Snowflake Schema
is an expanded or fully normalized version of Star Schema. This Schema type is
picked based on the parameters abridged by the Project team that should
synchronize with the requirements they have received for the given particular
project.
Snowflake Schema must contain a single Fact Table in the center, with single or
multiple levels of Dimension Table. All the Dimension Tables are completely
Normalized that can lead to any number of levels. Normalization is nothing but
breaking down one Dimension table into two or more Dimension tables, to make
sure minimum or no redundancy. While all the first level Dimension tables are
linked to the center Fact table, all the other Dimension tables can be linked to one
another if required. This Structure resembles a Snowflake (Fig. 01), hence the
name ‘Snowflake Schema’.
Characteristics
This model can involve only one fact table and multiple dimension tables
which must be further normalized until there is no more room for further
normalization.
Snowflake Schema makes it possible for the data in the Database to be more
defined, in contrast to other schemas, as normalization is the main attribute
in this schema type.
Normalization is the key feature that distinguishes Snowflake schema from
other schema types available in the Database Management System
Architecture.
The Fact Table will have all the facts/ measures, while the Dimension Tables
will have foreign keys to connect with the Fact Table.
Snowflake Schema allows the Dimension Tables to be linked to other
Dimension tables, except for the Dimension Tables in the first level.
This Multidimensional nature makes it easy to implement on complex
Relational Database systems, thus resulting in effective Analysis &
Reporting processes.
In terms of Accessibility, Complex multiple levels of Join queries are
required to fetch aggregated data from the central fact table, using the
foreign keys to access all the required Dimension tables.
Multiple Dimension tables, which are created as a result of normalization,
serve as lookup tables when querying with Joins.
The process of breaking down all the Dimension tables into multiple small
Dimensions until it is completely normalized takes up a lot of storage space
compared to other schemas.
As the querying process is complex, the pace for Data Retrieval is by far
low.
Say our ‘A’ is a ‘Clothing Sales’, it could have the below dimensions as its ‘B, C,
D, E, F, G’, which has the scope for further normalization –
Employees
Customer
Store
Products
Sales
Exchange
Stores – Owned & Rented, which can be further broken into location,
country, state, region, city/ town, etc in each level, depending on the
available data and requirements.
Sales – Limited Editions & other Branded, which can be further broken into
seasonal, nonseasonal, etc.
Exchanges – Reasons as the second level, Exchange for ‘money-back’ &
‘different product’ as the third level of Dimensions.
Products – ‘Product Types’ table as second-level Dimensions, and levels for
each type of product. This can be continued until the last level of
normalization.
Customers – ‘customer types’ as Men & Women, which can be additionally
split as members, non-members, types of membership, etc.
Employees – the type of employees as ‘Permanent’, ‘Temporary/ Part-time’
employees. The next level here can be departments, location, Salary Grade,
etc.
This can be further normalized to its final level of dimension tables, as it helps in
reducing redundancy in final data. This Schema can be used for Analysis or
Reporting when the focus is mainly on the Clothing Sales alone (fact table), and
the first level dimensions as specified above.