Академический Документы
Профессиональный Документы
Культура Документы
p-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 05 Issue 17
July 2018
response vector would be defined as: y = features as the number of SMS used in a
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, a-state-of-the- month, the number of inbound and
art model would be convolution neural outbound minutes, etc.
network and the matrix of features would
Learning to Rank
be defined as the pixels of the image.
This problem can be considered as a
Supervised Regression regression problem, but it has particular
In this case, the problem definition is rather characteristics and deserves a separate
similar to the previous example; the treatment. The problem involves given a
difference relies on the response. In a collection of documents we seek to find the
regression problem, the response y ∈ ℜ, most relevant ordering given a query. In
this means the response is real valued. For order to develop a supervised learning
example, we can develop a model to algorithm, it is needed to label how
predict the hourly salary of individuals relevant an ordering is, given a query.
given the corpus of their CV.
It is relevant to note that in order to
Unsupervised Learning develop a supervised learning algorithm, it
Management is often thirsty for new is needed to label the training data. This
insights. Segmentation models can provide means that in order to train a model that
this insight in order for the marketing will, for example, recognize digits from an
department to develop products for image, we need to label a significant
different segments. A good approach for amount of examples by hand. There are
developing a segmentation model, rather web services that can speed up this process
than thinking of algorithms, is to select and are commonly used for this task such
features that are relevant to the as Amazon mechanical tuck. It is proven
segmentation that is desired. that learning algorithms improve their
performance when provided with more
For example, in a telecommunications
data, so labeling a decent amount of
company, it is interesting to segment
examples is practically mandatory in
clients by their cell phone usage. This
supervised learning.
would involve disregarding features that
have nothing to do with the segmentation
objective and including only those that do. In large organizations, in order to
In this case, this would be selecting successfully develop a big data project, it is
The data being generated might be too crosswise over server farms (claimed by
organization. This big data might be the cloud show, Service suppliers are in
WhatsApp applications each of these being and undertakings would need to depend on
All business applications have Service physically situated outside the state or
level assertions that are stringently taken nation. Keeping in mind the end goal to
after. Operational groups assume a key part meet such prerequisites, cloud suppliers
in administration of administration level need to setup a server farm or a capacity
assertions and runtime administration of site solely inside the nation to conform to
uses. Underway conditions, operational directions. Having such a foundation may
groups bolster proper bunching and Fail not generally be practical and is a major
over challenge for cloud suppliers.
Information Replication
3. Administration Capabilities
reviewed the existing software tools for and validity) by an expert with 10 year’s
both DQ assessment and improvement and practitioner’s experience of current
extracted the different methods provided practices in the data quality industry. The
within these tools. The landscape of DQ resulting methods are described in the
software tools is regularly reviewed by the following two subsections and have been
information technology research and split according to whether they are for DQ
advisory firm Gartner, and we used their assessment or improvement.
latest review to scope the search for DQ DQ Methods for Assessment As noted
tools from which to extract DQ methods. before, the aim of DQ assessment is to
The list of DQ tools reviewed is as follows: inspect data to determine the current level of
DQ and the extent of any DQ deficiencies.
The following DQ methods, obtained from
the review of the DQ tools above, support
this activity and provide an automated
means to detect DQ problems.
Column analysis typically computes the
following information: number of (unique)
values and the number of instances per
value as percentage from the total number
of instances in that column, number of null
To perform the extraction of methods, we values, minimal and maximal value, total
reviewed the actual tool (for those that were and standard deviation of a value for
freely available) and any documentation of numerical columns, median and average
the tool including information on the value scores, etc. In addition, column
organizations’ websites. We also augmented analysis also computes the inferred type
literature that describes DQ methods, and For example, a column could be declared
have cited the relevant works in our as a ‘string’ column in the physical data
resulting list of DQ methods in the model, but the values found would lead to
following section. Once we had reviewed the inferred data type ‘date’. The frequency
each tool and extracted the DQ methods, the distribution of the values in a column is
methods were validated (for completeness another key metric which can influence the
against one another. For this research, we despite the fact that some problems can be
automatically detected and that the correction
include this functionality within the term
methods can also be automated, the whole process
“column analysis”.
cannot be carried out automatically without human
Cross-domain analysis (also known as intervention in most cases in finding a common
functional dependency analysis in some spelling error in many different instances of a word,
tools) can be applied to data integration a human is often used to develop the correct regular
expression to automatically find and replace all the
scenarios with dozens of source systems. It
incorrect instances. So between the application of the
enables the identification of redundant data
automated assessment and improvement methods,
across tables from different, and in some there often exists a manual analysis and
cases even the same, sources. Cross-domain configuration step. The aim of this research was to
analysis is done across columns from provide a review of methods for DQ assessment and
improvement and identify gaps where there are no
different tables to identify the percentage of
existing methods to address particular DQ problems.
values that are the same and hence indicates
whether the columns are redundant. References
Data verification algorithms verify if a [1] Briand, L. C., Daly, J., and Wüst, J., "A
unified framework for coupling
value or a set of values is found in a measurement in objectoriented systems",
reference data set ; these are sometimes IEEE Transactions on Software
Engineering, 25, 1, January 1999, pp. 91-
referred to as data validation algorithms in 121.
some DQ tools. A typical example for [2] Maletic, J. I., Collard, M. L., and
automated data verification is checking Marcus, A., "Source Code Files as
Structured Documents", in Proceedings
whether an address is a real address by 10th IEEE International Workshop on