Вы находитесь на странице: 1из 9

International Journal of Research e-ISSN: 2348-6848

p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

Big Data Analytics: A Classification of Data Quality


Assessment and Improvement Methods
K Deepthi Reddy &K Sinduja
1assistant Professor,Cse Department,Cvr College Of Engineering,Jntuh.
2assistant Professor,Cse Department,Cvr College Of Engineering,Jntuh.

Abstract: received from different sources to produce a


The categorization of data and information requires data product useful for the organizations
re-evaluation in the age of Big Data in order to and the users.
ensure that the appropriate protections are given to
Most big data problems can be categorized
different types of data. The aggregation of large
in the following ways −
amounts of data requires an assessment of the harms
and benefits that pertain to large datasets linked
 Supervised classification
together, rather than simply assessing each datum or
dataset in isolation. Big Data produce new data via  Supervised regression
inferences, and this must be recognized in ethical
 Unsupervised learning
assessments. We propose a Classification of Data
Quality Assessment and Improvement Methods The  Learning to rank
use of schemata such as this will assist decision-
Let us now learn more about these four
making by providing research ethics committees and
concepts.
information governance bodies with guidance about
the relative sensitivities of data. This will ensure that Supervised Classification
appropriate and proportionate safeguards are
Given a matrix of features X = {x1, x2, ...,
provided for data research subjects and reduce
inconsistency in decision making.
xn} we develop a model M to predict
different classes defined as y = {c1, c2, ...,

Keywords cn}. For example: Given transactional data


of customers in an insurance company, it is
Big data analytics; Massive data; Structured
data; Unstructured Data; DQ possible to develop a model that will
Introduction predict if a client would churn or not. The
The Big Data technology involves latter is a binary classification problem,
collecting data from different resources where there are two classes or target
merge it that is becomes available to deliver variables: churn and not churn.
a data product useful for the organization Other problems involve predicting more
business. The process of converting large than one class, we could be interested in
amount of data i.e unstructured raw data doing digit recognition, therefore the

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 607


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

response vector would be defined as: y = features as the number of SMS used in a
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, a-state-of-the- month, the number of inbound and
art model would be convolution neural outbound minutes, etc.
network and the matrix of features would
Learning to Rank
be defined as the pixels of the image.
This problem can be considered as a
Supervised Regression regression problem, but it has particular
In this case, the problem definition is rather characteristics and deserves a separate
similar to the previous example; the treatment. The problem involves given a
difference relies on the response. In a collection of documents we seek to find the
regression problem, the response y ∈ ℜ, most relevant ordering given a query. In
this means the response is real valued. For order to develop a supervised learning
example, we can develop a model to algorithm, it is needed to label how
predict the hourly salary of individuals relevant an ordering is, given a query.
given the corpus of their CV.
It is relevant to note that in order to
Unsupervised Learning develop a supervised learning algorithm, it
Management is often thirsty for new is needed to label the training data. This
insights. Segmentation models can provide means that in order to train a model that
this insight in order for the marketing will, for example, recognize digits from an
department to develop products for image, we need to label a significant
different segments. A good approach for amount of examples by hand. There are
developing a segmentation model, rather web services that can speed up this process
than thinking of algorithms, is to select and are commonly used for this task such
features that are relevant to the as Amazon mechanical tuck. It is proven
segmentation that is desired. that learning algorithms improve their
performance when provided with more
For example, in a telecommunications
data, so labeling a decent amount of
company, it is interesting to segment
examples is practically mandatory in
clients by their cell phone usage. This
supervised learning.
would involve disregarding features that
have nothing to do with the segmentation
objective and including only those that do. In large organizations, in order to
In this case, this would be selecting successfully develop a big data project, it is

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 608


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

needed to have management backing up


the project. This normally involves finding
a way to show the business advantages of
the project. We don’t have a unique
solution to the problem of finding sponsors
for a project, but a few guidelines are given
below −

 Check who and where are the


sponsors of other projects similar to
the one that interests you.

 Having personal contacts in key Figure 1: Big Data Properties with


management positions helps, so any Dataware Houses
contact can be triggered if the
Big Data Challenges
project is promising.
Adaptability
 Who would benefit from your
project? Who would be your client The information is developing and is being

once the project is on track? created as terra bytes of information. How


would I Store it? Where do I keep the
 Develop a simple, clear, and exiting
information? What calculations will be
proposal and share it with the key
utilized for handling it? Will any Data
players in your organization.
Mining method have the capacity to deal
The best way to find sponsors for a project with such tremendous information? A few
is to understand the problem and what versatile systems are being utilized by
would be the resulting data product once it associations, for example, Microsoft. The
has been implemented. This understanding exchange of information onto the cloud is a
will give an edge in convincing the moderate procedure and we require a
management of the importance of the big legitimate framework that does it at an
data project. extensive speed particularly when the
information is dynamic in nature and
immense. Information rebalance

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 609


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

calculations exist and depend on stack related to purchases in supermarkets and


adjustment and histogram assembles up. many more. Thus anonymization of this
data or its encryption comes as solutions to
Versatility exists at the three levels in the
this issue. Privacy approaches can be dealt
cloud stack. At the Platform level there is:
with user consent over its usage or sharing
even and vertical versatility.
on the globe. Several privacy and
Security and Access Control: Security is a
protection laws exist for this which is a
viewpoint that emerges as an issue from
part of regulatory framework.
inside an association or when an individual
uses a cloud to transfer "its own particular Big Data Difficulties

information". At the point when a Client 1. Information Protection:


transfers a data and pays too for the
Information Security is a significant
service, so who is responsible for access to
component that warrants examination.
the data, permissions to use the data, the
Undertakings are hesitant to purchase a
location of the data, its loss, authority to
confirmation of business information
use the data being stored on clusters, The
security from merchants. They fear losing
right of the cloud service provider to use
information to rivalry and the information
the client’s personal data and many others.
privacy of buyers. In many occurrences,
One of the major solutions was encrypting
the genuine capacity area isn't unveiled,
the data.
including onto the security worries of
Privacy and Integrity Issues: endeavors. In the current models, firewalls

The data being generated might be too crosswise over server farms (claimed by

personal for an individual or an undertakings) ensure this touchy data. In

organization. This big data might be the cloud show, Service suppliers are in

collected from Facebook accounts, charge of keeping up information security

WhatsApp applications each of these being and undertakings would need to depend on

more personal as compared to other them.

applications. In addition to this online data,


several data maybe pertaining to health
records purchases etc. these might lead to,
identification issues, profiling, loss of 2. Information Recovery and
control, location whereabouts of a person Availability

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 610


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

All business applications have Service physically situated outside the state or
level assertions that are stringently taken nation. Keeping in mind the end goal to
after. Operational groups assume a key part meet such prerequisites, cloud suppliers
in administration of administration level need to setup a server farm or a capacity
assertions and runtime administration of site solely inside the nation to conform to
uses. Underway conditions, operational directions. Having such a foundation may
groups bolster proper bunching and Fail not generally be practical and is a major
over challenge for cloud suppliers.

 Information Replication

 Framework observing (Transactions Proposed System: A Classification of


checking, logs checking and others) Data Quality Assessment and
Improvement Methods
 Support (Runtime Governance)

 Limit and execution administration

3. Administration Capabilities

Regardless of there being numerous cloud


suppliers, the administration of stage and
foundation is still in its earliest stages.
Highlights like „Auto-scaling‟ for instance,
is urgent prerequisite for some ventures.
There is colossal potential to enhance the
versatility and load adjusting highlights
gave today.

4. Administrative and Compliance


Restrictions

In a portion of the European nations, Figure 2: Data Quality Problems


Government directions don't permit client's
DQ Assessment and Improvement
individual data and other delicate data to be
Methods To obtain a list of DQ methods we

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 611


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

reviewed the existing software tools for and validity) by an expert with 10 year’s
both DQ assessment and improvement and practitioner’s experience of current
extracted the different methods provided practices in the data quality industry. The
within these tools. The landscape of DQ resulting methods are described in the
software tools is regularly reviewed by the following two subsections and have been
information technology research and split according to whether they are for DQ
advisory firm Gartner, and we used their assessment or improvement.
latest review to scope the search for DQ DQ Methods for Assessment As noted
tools from which to extract DQ methods. before, the aim of DQ assessment is to
The list of DQ tools reviewed is as follows: inspect data to determine the current level of
DQ and the extent of any DQ deficiencies.
The following DQ methods, obtained from
the review of the DQ tools above, support
this activity and provide an automated
means to detect DQ problems.
Column analysis typically computes the
following information: number of (unique)
values and the number of instances per
value as percentage from the total number
of instances in that column, number of null

To perform the extraction of methods, we values, minimal and maximal value, total

reviewed the actual tool (for those that were and standard deviation of a value for

freely available) and any documentation of numerical columns, median and average

the tool including information on the value scores, etc. In addition, column

organizations’ websites. We also augmented analysis also computes the inferred type

this review with a general review of DQ information.

literature that describes DQ methods, and For example, a column could be declared

have cited the relevant works in our as a ‘string’ column in the physical data

resulting list of DQ methods in the model, but the values found would lead to

following section. Once we had reviewed the inferred data type ‘date’. The frequency

each tool and extracted the DQ methods, the distribution of the values in a column is

methods were validated (for completeness another key metric which can influence the

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 612


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

weight factors in some probabilistic using a postal dictionary. It is not possible


matching algorithms. Another metric is to check if it is the correct address, but these
format distribution where only 5 digit algorithms verify that the address refers to a
numeric entries are expected for a column real, occupied address. The results depend
holding German zip codes. Some DQ on high quality input data. For example,
profiling tools (for example, Talend verification against the postal dictionary
profiler) differentiate between analyses that will only produce good results if the address
are applicable to a single column compared information has been standardized.
to a “column set”.
Column set analysis refers to how values Conclusion
from multiple columns can be compared This paper describes the data DQ problems,

against one another. For this research, we despite the fact that some problems can be
automatically detected and that the correction
include this functionality within the term
methods can also be automated, the whole process
“column analysis”.
cannot be carried out automatically without human
Cross-domain analysis (also known as intervention in most cases in finding a common
functional dependency analysis in some spelling error in many different instances of a word,

tools) can be applied to data integration a human is often used to develop the correct regular
expression to automatically find and replace all the
scenarios with dozens of source systems. It
incorrect instances. So between the application of the
enables the identification of redundant data
automated assessment and improvement methods,
across tables from different, and in some there often exists a manual analysis and
cases even the same, sources. Cross-domain configuration step. The aim of this research was to

analysis is done across columns from provide a review of methods for DQ assessment and
improvement and identify gaps where there are no
different tables to identify the percentage of
existing methods to address particular DQ problems.
values that are the same and hence indicates
whether the columns are redundant. References
Data verification algorithms verify if a [1] Briand, L. C., Daly, J., and Wüst, J., "A
unified framework for coupling
value or a set of values is found in a measurement in objectoriented systems",
reference data set ; these are sometimes IEEE Transactions on Software
Engineering, 25, 1, January 1999, pp. 91-
referred to as data validation algorithms in 121.
some DQ tools. A typical example for [2] Maletic, J. I., Collard, M. L., and
automated data verification is checking Marcus, A., "Source Code Files as
Structured Documents", in Proceedings
whether an address is a real address by 10th IEEE International Workshop on

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 613


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

Program Comprehension (IWPC'02), Paris, Information Sciences, 285 (2014), pp.112-


France, June 27-29 2002, pp. 289-292. 137.
[3] Marcus, A., Semantic Driven Program [9] MH. Kuo, T. Sahama, A. W. Kushniruk,
Analysis, Kent State University, Kent, OH, E. M. Borycki and D. K. Grunwell, Health
USA, Doctoral Thesis, 2003. big data analytics: current perspectives,
challenges and potential solutions,
[4] Marcus, A. and Maletic, J. I., "Recove
International Journal of Big Data
[1] M. K.Kakhani, S. Kakhani and S.
Intelligence, 1 (2014), pp.114-126.
R.Biradar, Research issues in big data
analytics, International Journal of [10] R. Nambiar, A. Sethi, R. Bhardwaj and
Application or Innovation in Engineering & R. Vargheese, A look at challenges and
Management, 2(8) (2015), pp.228-232. [2] opportunities of big data analytics in
A. Gandomi and M. Haider, Beyond the healthcare, IEEE International Conference
hype: Big data concepts, methods, and on Big Data, 2013, pp.17-22. [11] Z.
analytics, International Journal of Huang, A fast clustering algorithm to
Information Management, 35(2) (2015), cluster very large categorical data sets in
pp.137-144. [3] C. Lynch, Big data: How do data mining, SIGMOD Workshop on
your data grow?, Nature, 455 (2008), Research Issues on Data Mining and
pp.28-29. [4] X. Jin, B. W.Wah, X. Cheng Knowledge Discovery, 1997.
and Y. Wang, Significance and challenges [12] T. K. Das and P. M. Kumar, Big data
of big data research, Big Data Research, analytics: A framework for unstructured
2(2) (2015), pp.59-64. [5] R. Kitchin, Big data analysis, International Journal of
Data, new epistemologies and paradigm Engineering and Technology, 5(1) (2013),
shifts, Big Data Society, 1(1) (2014), pp.1- pp.153-156.
12. [6] C. L. Philip, Q. Chen and C. Y.
Zhang, Data-intensive applications, [13] T. K. Das, D. P. Acharjya and M. R.
challenges, techniques and technologies: A Patra, Opinion mining about a product by
survey on big data, Information Sciences, analyzing public tweets in twitter,
275 (2014), pp.314-347. [7] K. Kambatla, International Conference on Computer
G. Kollias, V. Kumar and A. Gram, Trends Communication and Informatics, 2014.
in big data analytics, Journal of Parallel [14] L. A. Zadeh, Fuzzy sets, Information
and Distributed Computing, 74(7) (2014), and Control, 8 (1965), pp.338- 353.
pp.2561-2573. ring Documentation-to-
Source-Code Traceability Links using [15] Z. Pawlak, Rough sets, International
Latent Semantic Indexing", in Proceedings Journal of Computer Information Science,
25th IEEE/ACM International Conference 11 (1982), pp.341-356.
on Software Engineering (ICSE'03), [16] D. Molodtsov, Soft set theory first
Portland, OR, May 3-10 2003, pp. 125-137. results, Computers and Mathematics with
[5] Salton, G., Automatic Text Processing: Aplications, 37(4/5) (1999), pp.19-31.
The Transformation, Analysis and Retrieval [17] J. F.Peters, Near sets. General theory
of Information by Computer, Addison- about nearness of objects, Applied
Wesley, 1989. Mathematical Sciences, 1(53) (2007),
pp.2609-2629.
[8] S. Del. Rio, V. Lopez, J. M. Bentez and [18] R. Wille, Formal concept analysis as
F. Herrera, On the use of mapreduce for mathematical theory of concept and concept
imbalanced big data using random forest, hierarchies, Lecture Notes in Artificial
Intelligence, 3626 (2005), pp.1-33.

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 614


International Journal of Research e-ISSN: 2348-6848
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018

[19] I. T.Jolliffe, Principal Component


Analysis, Springer, New York, 2002.
[20] O. Y. Al-Jarrah, P. D. Yoo, S.
Muhaidat, G. K. Karagiannidis and K.
Taha, Efficient machine learning for big
data: A review, Big Data Research, 2(3)
(2015), pp.87-93.
[21] Changwon. Y, Luis. Ramirez and
Juan. Liuzzi, Big data analysis using
modern statistical and machine learning
methods in medicine, International
Neurourology Journal, 18 (2014), pp.50-57.
[22] P. Singh and B. Suri, Quality
assessment of data using statistical and
machine learning methods. L. C.Jain, H.
S.Behera, J. K.Mandal and D. P.Mohapatra
(eds.), Computational Intelligence in Data
Mining, 2 (2014), pp. 89-97.
[23] A. Jacobs, The pathologies of big data,
Communications of the ACM, 52(8) (2009),
pp.36-44.

Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 615

Вам также может понравиться