Вы находитесь на странице: 1из 8

MCA – 27

Q-1
Short for Online Analytical Processing, a category of software tools that provides analysis of data stored in a database.
OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series
and trend analysis views. OLAP often is used in data mining.
The chief component of OLAP is the OLAP server, which sits between a client and a database management systems
(DBMS). The OLAP server understands how data is organized in the database and has special functions for analyzing the
data. There are OLAP servers available for nearly all the major database systems.
Q-2
Meta learning is originally described by Donald B. Maudsley (1979) as "the process by which learners become aware of
and increasingly in control of habits of perception, inquiry, learning, and growth that they have internalized".[1] Maudsely
sets the conceptual basis of his theory as synthesized under headings of assumptions, structures, change process, and
facilitation. Five principles were enunciated to facilitate meta-learning. Learners must:
(a) have a theory, however primitive;
(b) work in a safe supportive social and physical environment;
(c) discover their rules and assumptions;
(d) reconnect with reality-information from the environment; and
(e) reorganize themselves by changing their rules/assumptions.
The idea of meta learning was later used by John Biggs (1985) to describe the state of "being aware of and taking control
of one’s own learning".[2] You can define meta learning as an awareness and understanding of the phenomenon of learning
itself as opposed to subject knowledge. Implicit in this definition is the learner’s perception of the learning context, which
includes knowing what the expectations of the discipline are and, more narrowly, the demands of a given learning task.
Within this context, meta learning depends on the learner’s conceptions of learning, epistemological beliefs, learning
processes and academic skills, summarized here as a learning approach. A student who has a high level of meta learning
awareness is able to assess the effectiveness of her/his learning approach and regulate it according to the demands of the
learning task. Conversely, a student who is low in meta learning awareness will not be able to reflect on her/his learning
approach or the nature of the learning task set. In consequence, s/he will be unable to adapt successfully when studying
becomes more difficult and demanding.
Q-5
Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational
database or other information repository. An example of an association rule would be "If a customer buys a dozen eggs,
he is 80% likely to also purchase milk."
An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data.
A consequent is an item that is found in combination with the antecedent.
Association rules are created by analyzing data for frequent if/then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the
items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true.
In data mining, association rules are useful for analyzing and predicting customer behavior. They play an important part in
shopping basket data analysis, product clustering, catalog design and store layout.
Programmers use association rules to build programs capable of machine learning. Machine learning is a type of artificial
intelligence (AI) that seeks to build programs with the ability to become more efficient without being explicitly
programmed.
Q-6
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset
selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.
Feature selection techniques are used for four reasons:
 simplification of models to make them easier to interpret by researchers/users,[1]
 shorter training times,
 to avoid the curse of dimensionality,
 enhanced generalization by reducing overfitting[2] (formally, reduction of variance[1])
The central premise when using a feature selection technique is that the data contains many features that are
either redundant or irrelevant, and can thus be removed without incurring much loss of
information.[2] Redundant or irrelevantfeatures are two distinct notions, since one relevant feature may be redundant
in the presence of another relevant feature with which it is strongly correlated.[3]
Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features
from functions of the original features, whereas feature selection returns a subset of the features. Feature selection
techniques are often used in domains where there are many features and comparatively few samples (or data points).
Archetypal cases for the application of feature selection include the analysis of written texts and DNA
microarray data, where there are many thousands of features, and a few tens to hundreds of samples.
Q-8
KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group
on Knowledge Discovery and Data Mining, the leading professional organization of data miners. Year to year archives
including datasets, instructions, and winners are available for most years.
KDD Cup 2016: Whose papers are accepted the most: towards measuring the impact of research institutions
Finding influential nodes in a social network for identifying patterns or maximizing information diffusion has been an
actively researched area with many practical applications. In addition to the obvious value to the advertising industry, the
research community has long sought mechanisms to effectively disseminate new scientific discoveries and technological
breakthroughs so as to advance our collective knowledge and elevate our civilization. For students, parents and funding
agencies that are planning their academic pursuits or evaluating grant proposals, having an objective picture of the
institutions in question is particularly essential. Partly against this backdrop we have witnessed that releasing a yearly
Research Institution or University Ranking has become a tradition for many popular newspapers, magazines and academic
institutes. Such rankings not only attract attention from governments, universities, students and parents, but also create
debates on the scientific correctness behind the rankings. The most criticized aspect of these rankings is: the data used and
the methodology employed for the ranking are mostly unknown to the public.
The 2016 KDD Cup will address this very important problem through publically available datasets, like the Microsoft
Academic Graph (MAG), a freely available dataset that includes information on academic publications and citations. This
dataset, being a heterogeneous graph, that can be used to study the influential nodes of various types including authors,
affiliations and venues; we choose to focus on affiliations in this competition. In effect, given a research field, we are
challenging the KDD Cup community to jointly develop data mining techniques to identify the best research institutions
based on their publication and how they are cited in research articles.
Join us in San Francisco!
Prizes
1st Place - $10,000
2nd Place - $6,500
3rd Place - $3,500

Q-9
Temporal Data Mining is a single step in the process of Knowledge Discovery in Temporal Databases that enumerates
structures (temporal patterns or models) over the temporal data, and any algorithm that enumerates temporal patterns
from, or fits models to, temporal data is a Temporal Data Mining Algorithm. Basically temporal data mining is concerned
with the analysis of temporal data and for finding temporal patterns and regularities in sets of temporal data. Also
temporal data mining techniques allow for the possibility of computerdriven, automatic exploration of the data. Temporal
data mining has led to a new way of interacting with a temporal database: specifying queries at a much more abstract level
than say, Temporal Structured Query Language (TSQL) permits (e.g., [17], [16]). It also facilitates data exploration for
problems that, due to multiple and multi-dimensionality, would otherwise be very difficult to explore by humans,
regardless of use of, or efficiency issues with, TSQL. Temporal data mining tends to work from the data up and the best
known techniques are those developed with an orientation towards large volumes of time related data, making use of as
much of the collected temporal data as possible to arrive at reliable conclusions. The analysis process starts with a set of
temporal data, uses a methodology to develop an optimal representation of the structure of the data during which time
knowledge is acquired. Once Temporal knowledge has been acquired, this process can be extended to a larger set of the
data working on the assumption that the larger data set has a structure similar to the sample data.
A relevant and important question is how to apply data mining techniques on a temporal database. According to
techniques of data mining and theory of statistical time series analysis, the theory of temporal data mining may involve
the following areas of investigation since a general theory for this purpose is yet to be developed: 1. Temporal data mining
tasks include: • Temporal data characterization and comparison, • Temporal clustering analysis, • Temporal classification,
• Temporal association rules, • Temporal pattern analysis, and • Temporal prediction and trend analysis.
2. A new temporal data model (supporting time granularity and time-hierarchies) may need to be developed based on: •
Temporal data structures, and • Temporal semantics.
3. A new temporal data mining concept may need to be developed based on the following issues: • the task of temporal
data mining can be seen as a problem of extracting an interesting part of the logical theory of a model, and • the theory of
a model may be formulated in a logical formalism able to express quantitative knowledge and approximate truth.
Q-11
Decision Tree Classifier, repetitively divides the working area(plot) into sub part by identifying lines. (repetitively because
there may be two distant regions of same class divided by other as shown in image below).

So when does it terminate?


1. Either it has divided into classes that are pure (only
containing members of single class )
2. Some criteria of classifier attributes are met.
1. Impurity
In above division, we had clear separation of classes. But what if
we had following case?
Impurity is when we have a traces of one class division into other.
This can arise due to following reason
1. We run out of available features to divide the class upon.
2. We tolerate some percentage of impurity (we stop further
division) for faster performance. (There is always trade off
between accuracy and performance).
For example in second case we may stop our division when we have x number of fewer number of elements left. This is
also known as gini impurity.

Division based on some features.


2. Entropy
Entropy is degree of randomness of elements or in other words it is measure
of impurity. Mathematically, it can be calculated with the help of
probability of the items as:

p(x) is
probability of item x.
It is negative summation of probability times the log of probability of item
x.
For example,

if we have items as number of dice face occurrence in a throw event as 1123,


the entropy is
p(1) = 0.5
p(2) = 0.25
p(3) = 0.25
entropy = - (0.5 * log(0.5)) - (0.25 * log(0.25)) -(0.25 * log(0.25)
= 0.45
Q-12
Active learning is a form of learning in which teaching strives to involve students in the learning process more directly
than in other methods.
The term active learning "was introduced by the English scholar R W Revans (1907–2003)."[1] Bonwell (1991) "states that
in active learning, students participate in the process and students participate when they are doing something besides
passively listening." (Weltman, p. 7) Active learning is "a method of learning in which students are actively or
experientially involved in the learning process and where there are different levels of active learning, depending on
student involvement. (Bonwell & Eison 1991). In the Association for the Study of Higher Education (ASHE) report the
authors discuss a variety of methodologies for promoting "active learning". They cite literature that indicates that to learn,
students must do more than just listen: They must read, write, discuss, or be engaged in solving problems. It relates to the
three learning domains referred to as knowledge, skills and attitudes (KSA), and that this taxonomy of learning behaviours
can be thought of as "the goals of the learning process" (Bloom, 1956). In particular, students must engage in such higher-
order thinking tasks as analysis, synthesis, and evaluation.[2] Active learning engages students in two aspects – doing
things and thinking about the things they are doing.
Q-13
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called
a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of
exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer
graphics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various
algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular
notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals
or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem.
The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a
density threshold or the number of expected clusters) depend on the individual data set and intended use of the results.
Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-
objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model
parameters until the result achieves the desired properties.
Q-14
A planning tool that helps management in its attempts to cope with the uncertainty of the future, relying mainly on data
from the past and present and analysis of trends.

Forecasting starts with certain assumptions based on the management's experience, knowledge, and judgment. These
estimates are projected into the coming months or years using one or more techniques such as Box-Jenkins models,
Delphi method, exponential smoothing, moving averages, regression analysis, and trend projection. Since any error in the
assumptions will result in a similar or magnified error in forecasting, the technique of sensitivity analysis is used which
assigns a range of values to the uncertain factors (variables).
Q-15
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of
management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a
particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have
different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12
months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most
recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never
be altered.

MCA_ 28
Q-1
In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated
with loops. It plays an important role in improving cache performance and making effective use of parallel
processingcapabilities. Most execution time of a scientific program is spent on loops; as such, many compiler
optimization techniques have been developed to make them faster.
Common loop transformations include:
 Fission or distribution – loop fission attempts to break a loop into multiple loops over the same index range, but each new
loop takes only part of the original loop's body. This can improve locality of reference, both of the data being accessed in
the loop and the code in the loop's body.
 Fusion or combining – this combines the bodies of two adjacent loops that would iterate the same number of times
(whether or not that number is known at compile time), as long as they make no reference to each other's data.
 Interchange or permutation – these optimizations exchange inner loops with outer loops. When the loop variables index
into an array, such a transformation can improve locality of reference, depending on the array's layout.
Q-2
In compiler theory, dead code elimination (also known as DCE, dead code removal, dead code stripping, or dead code strip) is
a compiler optimization to remove code which does not affect the program results. Removing such code has several benefits: it
shrinks program size, an important consideration in some contexts, and it allows the running program to avoid executing
irrelevant operations, which reduces its running time. It can also enable further optimizations by simplifying program
structure. Dead code includes code that can never be executed (unreachable code), and code that only affects dead
variables(written to, but never read again), that is, irrelevant to the program.
Examples
Consider the following example written in C.

int foo(void)
{
int a = 24;
int b = 25; /* Assignment to dead variable */
int c;
c = a * 4;
return c;
b = 24; /* Unreachable code */
return 0;
}

Simple analysis of the uses of values would show that the value of b after the first assignment is not used inside foo .
Furthermore, b is declared as a local variable inside foo , so its value cannot be used outside foo . Thus, the
variable b is dead and an optimizer can reclaim its storage space and eliminate its initialization.
Furthermore, because the first return statement is executed unconditionally, no feasible execution path reaches the second
assignment to b . Thus, the assignment is unreachable and can be removed. If the procedure had a more complex control flow,
such as a label after the return statement and a goto elsewhere in the procedure, then a feasible execution path might exist to the
assignment to b .
Also, even though some calculations are performed in the function, their values are not stored in locations accessible outside
the scope of this function. Furthermore, given the function returns a static value (96), it may be simplified to the value it returns
(this simplification is called constant folding).
Q-3
An Intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to
represent source code. An IR is designed to be conducive for further processing, such as optimization and translation.[1] A
"good" IR must be accurate – capable of representing the source code without loss of information[2] – and independent of any
particular source or target language.[1] An IR may take one of several forms: an in-memory data structure, or a special tuple-
or stack-based code readable by the program.[3] In the latter case it is also called an intermediate language.
A canonical example is found in most modern compilers, where the linear human-readable text representing a program is
transformed into an intermediate graph structure that allows flow analysis and re-arrangement before creating a sequence of
actual CPU instructions. Use of an intermediate representation such as this allows compiler systems like the GNU Compiler
Collection and LLVM to be used by many different source languages to generate code for many different target architectures.
Intermediate language
An intermediate language is the language of an abstract machine designed to aid in the analysis of computer programs. The term
comes from their use in compilers, where the source code of a program is translated into a form more suitable for code-
improving transformations before being used to generate object or machine code for a target machine. The design of an
intermediate language typically differs from that of a practical machine language in three fundamental ways:
 Each instruction represents exactly one fundamental operation; e.g. "shift-add" addressing modes common
in microprocessors are not present.
 Control flow information may not be included in the instruction set.
 The number of processor registers available may be large, even limitless.

Q-6
The compilation process is a sequence of
various phases. Each phase takes input from
its previous stage, has its own
representation of source program, and feeds
its output to the next phase of the compiler.
Let us understand the phases of a compiler.

Q-7
Syntax-directed translation refers to a method
of compiler implementation where the source
language translation is completely driven by
the parser.
A common method of syntax-directed translation
is translating a string into a sequence of actions by
attaching one such action to each rule of
a grammar.[1] Thus, parsing a string of the grammar
produces a sequence of rule applications. SDT
provides a simple way to attach semantics to any
such syntax.
Overview
Syntax-directed translation fundamentally works by adding actions to the productions in a context-free grammar, resulting in a
Syntax-Directed Definition (SDD).[2] Actions are steps or procedures that will be carried out when that production is used in a
derivation. A grammar specification embedded with actions to be performed is called a syntax-directed translation
scheme[1](sometimes simply called a 'translation scheme'.)
Each symbol in the grammar can have an attribute, which is a value that is to be associated with the symbol. Common attributes
could include a variable type, the value of an expression, etc. Given a symbol X, with an attribute t, that attribute is referred to
as X.t
Thus, given actions and attributes, the grammar can be used for translating strings from its language by applying the actions and
carrying information through each symbol's attribute.
Q-9
directed acyclic graph (DAG!) is a directed graph that contains no cycles. A rooted tree is a special
kind of DAG and a DAG is a special kind of directed graph. For example, a DAG may be used to
represent common subexpressions in an optimising compiler.
+ +
. . . .
. . . .
* () *<---| ()
.. . . .. | . .
. . . . . . | . |
a b f * a b | f |
.. ^ v
. . | |
a b |--<----

Tree DAG

expression: a*b+f(a*b)

Example of Common Subexpression.

The common subexpression a*b need only be compiled once but its value can be used twice.
A DAG can be used to represent prerequisites in a university course, constraints on operations to be
carried out in building construction, in fact an arbitrary partial-order `<'. An edge is drawn from a
to b whenever a<b. A partial order `<' satisfies:
(i) transitivity, a<b and b<c implies a<c
(ii) non-reflexive, not(a < a)
These condition prevent cycles because v1<v2<...<vn<v1 would imply that v1<v1. The word `partial'
indicates that not every pair or values are ordered. Examples of partial orders are numerical less-
than (also a total order) and `subset-of'; note that {1,2} is a subset of {1,2,3} but that {1,2} and
{2,3} are incomparable, i.e. there is no order relationship between them.
Constraints for a small building example are given below.

|-->roof--->--->plaster----->|
foundation-->frame-->| ^ |-->paint
| | |-->windows->|
|-->brickwork-->| |
|-->doors--->|

Simplified Construction Constraints.

Note that no order is imposed between `roof' and `brick-work', but the plaster cannot be applied
until the walls are there for it to stick to and the roof exists to protect it.
Topological Sorting
A topological-sort of a DAG is a (total) linear ordering of the vertices such that v i appears before
vj whenever there is an edge <vi,vj> (or whenever vi<vj).

-------> ----------> -------->


foundation->frame->roof brick-work->windows plaster doors->paint
----------------->
---------------------->
------------------->

Example Topological Sort.

Topological sorting can obviously be useful in the management of construction and manufacturing
tasks. It gives an allowable (total) order for carrying out the basic operations one at a time. There
may be several different topological sorts for a given DAG, but there must be at least one. Note that
there may be reasons to prefer one ordering to another and even to do some tasks simultaneously.
Q-11
n computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure
of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The
syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit
in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node
with three branches.
This distinguishes abstract syntax trees from concrete syntax trees, traditionally designated parse trees, which are typically built
by a parser during the source code translation and compiling process. Once built, additional information is added to the AST by
means of subsequent processing, e.g., contextual analysis
CPM Phase Tree
This model shows how to use the Eye Diagram block to view the phase trajectory, phase tree, and instantaneous frequency of a CPM
modulated signal.
Structure of the Example
This example, doc_cpm_phase_tree, uses various Communications System Toolbox, DSP System Toolbox, and
Simulink blocks to model a baseband CPM signal.
In particular, the example model includes these blocks:
 Random Integer Generator block
 Integer to Bit Converter block
 CPM Modulator Baseband block
 Complex to Magnitude-Angle block
 Phase Unwrap block

Q-13
n computing, a hash table (hash map) is a data structure which implements an associative array abstract data type, a structure
that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which
the desired value can be found.
Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash
function, which might cause hash collisions where the hash function generates the same index for more than one key. Such
collisions must be accommodated in some way.
In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of
elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at
(amortized[2]) constant average cost per operation.[3][4]
In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For
this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database
indexing, caches, and sets.
Q-16
In computer science, LR parsers are a type of bottom-up parser that efficiently read deterministic context-free languages, in
guaranteed linear time.[1] There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical
LR(1) parsers, Minimal LR(1) parsers, GLR parsers. LR parsers are generated by a parser generator which reads a formal
grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.
The name LR is an initialism. The L means that the parser reads input text in one direction without backing up; that direction is
typically Left to right within each line, and top to bottom across the lines of the full input file. (This is true for most parsers.)
The R means that the parser produces a Rightmost derivation in reverse: it does a bottom-up parse - not a top-down LL parse or
ad-hoc parse. The name LR is often followed by a numeric qualifier, as in LR(1) or sometimes LR(k). To avoid backtracking or
guessing, the LR parser is allowed to peek ahead at k lookahead input symbols before deciding how to parse earlier symbols.
Typically k is 1 and is not mentioned. The name LR is often preceded by other qualifiers, as in SLR and LALR.
LR parsers are deterministic; they produce a single correct parse without guesswork or backtracking, in linear time. This is ideal
for computer languages, but LR parsers are not suited for human languages which need more flexible but inevitably slower
methods. Some methods which can parse arbitrary context-free languages (e.g., Cocke-Younger-Kasami, Earley, GLR) have
worst-case performance of O(n3) time. Other methods which backtrack or yield multiple parses may even take exponential time
when they guess badly.
Q-14
Context-free grammars can generate context-free languages. They do this by taking a set of variables which are
defined recursively, in terms of one another, by a set of production rules. Context-free grammars are named as such
because any of the production rules in the grammar can be applied regardless of context—it does not depend on any
other symbols that may or may not be around a given symbol that is having a rule applied to it.
Context-free grammars have the following components:
 A set of terminal symbols which are the characters that appear in the language/strings generated by the
grammar. Terminal symbols never appear on the left-hand side of the production rule and are always on the
right-hand side.
 A set of nonterminal symbols (or variables) which are placeholders for patterns of terminal symbols that can
be generated by the nonterminal symbols. These are the symbols that will always appear on the left-hand side
of the production rules, though they can be included on the right-hand side. The strings that a CFG produces
will contain only symbols from the set of nonterminal symbols.
 A set of production rules which are the rules for replacing nonterminal symbols. Production rules have the
following form: variable string of variables and terminals.
 A start symbol which is a special nonterminal symbol that appears in the initial string generated by the
grammar.

Вам также может понравиться