Вы находитесь на странице: 1из 26

Page No | 1

Cloudera

DS-200 PRACTICE EXAM


Data Science Essentials

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 2

Question 1

Why should stop an interactie machine learning algorithm as soon as the performance of the model on a test set
stops improiing?

A. To aioid the need for cross-ialidatng the model


B. To preient oierfing
C. To increase the VC (VAPNIK-Cherionenkis) dimension for the model
D. To keep the number of terms in the model as possible
E. To maintain the highest VC (Vapnik-Cherionenkis) dimension for the model

Aoswern B

Question 2

What is default delimiter for Hiie tables?

A. ^A (Control-A)
B. , (comma)
C. \t (tab)
D. : (colon)

Aoswern A

Explanaton:
Reference:
htp:::blog.spryinc.com:/201:02:four-useful-tricks-for-working-with-hiie.html(change the delimiter when exportng
hiie table)

Question 3

Certain indiiiduals are more susceptble to autsm if they haie partcular combinatons of genes expressed in their
DNA . Giien a sample of DNA from persons who haie autsm and a sample of DNA from persons who do not haie
autsm, determine the best technique for predictng whether or not a giien indiiidual is susceptble to deieloping
autsm?

A. Natie Bayes
B. Linear Regression
C. Suriiial analysis
D. Sequencealignment

Aoswern B

Question 4

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 3

You are working with a logistc regression model to predict the probability that a user will click on an ad. Your model
has hundreds of features, and you’re not sure if all of those features are helping your predicton. Which regularizaton
technique should you use to prune features that aren’t contributng to the model?

A. Coniex
B. Uniform
C. L/
D. L0

Aoswern A

Question 5

Refer to the exhibit.

Which point in the fgure is the median?

A. A
B. B
C. C

Aoswern A

Question 6

Refer to the exhibit.

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 4

Which point in the fgure is the mode?

A. A
B. B
C. C

Aoswern C

Question 7

Refer to the exhibit.

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 5

Which point in the fgure is the mean?

A. A
B. B
C. C

Aoswern B

Question 8

Under what two conditons does stochastc gradient descent outperform /nd-order optmizaton techniques such as
iteratiely reweighted least squares?

A. When the iolume of input data is so large and diierse that a /nd-order optmizaton technique can be ft to a
sample of the data
B. When the model’s estmates must be updated in real-tme in order to account for newobseriatons.
C. When the input data can easily ft into memory on a single machine, but we want to calculate confdence interials
for all of the parameters in the model.
D. When we are required to fnd the parameters that return the optmal ialue of the objectie functon.

Aoswern A,B

Question 9

What is the result of the following command (the database username is foo and password is bar)?
$ sqoop list-tables - - connect jdbc : mysql : : : localhost:databasename - - table - - username foo -- password bar

A. sqoop lists only those tables in the specifed MySql database that haie not already been imported into FDFS
B. sqoop returns an error

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 6

C. sqoop lists the aiailable tables from the database


D. sqoopimports all the tables from SQLHDFS

Aoswern C

Explanaton:
Reference:
htps:::www.inkling.com:read:hadoop-defnitie-guide-tom-white-1rd:chapter-05:geingsqoop

Question 10

What is the most common reason for a k-means clustering algorithm to returns a sub-optmal clustering of its input?

A. Non-negatie ialues for the distance functon


B. Input data set is too large
C. Non-normal distributon of the input data
D. Poor selecton of the inital controls

Aoswern C

Question 11

There are /2 patents with acute lymphoblastc leukemia (ALL) and 1/ patents with acute myeloid leukemia (AML),
both iariants of a blood cancer.
The makeup of the groups as follows:

Each indiiidual has an expression ialue for each of 02222 diferent genes. The expression ialue for each gene is a
contnuous ialue between -0 and 0.
You’ie built your model for discriminatng between AML and ALL patents and you fnd that it works quite well on
your current data. One month later, a collaboraton tells you she has fresh data from 022 new AML:ALL patents. You
run the samples through your model, and turns out your model has iery poor predictie accuracy on the new
samples; specifcally, your model predicts that all males haie ALL. What is the most reliable way to fx this problem?

A. Change the distance metric


B. Reduce the number of dimensions
C. Use a Gibbs sampler on a Bayesian network
D. Perform matched sampling across other proiided iariables

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 7

Aoswern D

Question 12

There are /2 patents with acute lymphoblastc leukemia (ALL) and 1/ patents with acute myeloid leukemia (AML),
both iariants of a blood cancer.
The makeup of the groups as follows:

Each indiiidual has an expression ialue for each of 02222 diferent genes. The expression ialue for each gene is a
contnuous ialue between -0 and 0.
You want to use the data from the 5/ patents in the scenario to improie the ability of doctors being able to
distnguish between ALL and AML. What type of data science problem is this?

A. Classifcaton
B. Regression
C. Clustering
D. Filtering

Aoswern D

Question 13

There are /2 patents with acute lymphoblastc leukemia (ALL) and 1/ patents with acute myeloid leukemia (AML),
both iariants of a blood cancer.
The makeup of the groups as follows:

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 8

Each indiiidual has an expression ialue for each of 02222 diferent genes. The expression ialue for each gene is a
contnuous ialue between -0 and 0.
With which type of plot can you encode the most amount of the data iisually?

A. A heat map sortng the indiiiduals by group


B. A histogram of the expression ialues
C. A scater plot of two largest principal components

Aoswern C

Question 14

There are /2 patents with acute lymphoblastc leukemia (ALL) and 1/ patents with acute myeloid leukemia (AML),
both iariants of a blood cancer.
The makeup of the groups as follows:

Each indiiidual has an expression ialue for each of 02222 diferent genes. The expression ialue for each gene is a
contnuous ialue between -0 and 0.
With which type of plot can you encode the most amount of the data iisually?
Rather than use all 02,222 features to separate AML from ALL, you pick a small subnet of features to separate them
optmally. You feature iectors haie 02,222 dimensions while you only haie 5/ data points. You use cross-ialidaton to
test your chosen set of features. What three methods will choose the features in an optmal way?

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 9

A. Singular ialue Decompositon


B. Bootstrapping
C. Markoi chain Monte Carlo
D. Hidden Markoi
E. Bayesian Informaton Criterion
F. Mutual Informaton

Aoswern C,D,F

Question 15

There are /2 patents with acute lymphoblastc leukemia (ALL) and 1/ patents with acute myeloid leukemia (AML),
both iariants of a blood cancer.
The makeup of the groups as follows:

Each indiiidual has an expression ialue for each of 02222 diferent genes. The expression ialue for each gene is a
contnuous ialue between -0 and 0.
With which type of plot can you encode the most amount of the data iisually?
You choose to perform agglomeratie hierarchical clustering on the 02,222 features. How much RAM do you need to
hold the distance Matrix, assuming each distance ialue is 64-bit double?

A. ~ 822 MB
B. ~ 422 MB
C. ~ 062 KB
D. ~ 4 MB

Aoswern B

Question 16

You haie a large m x n data matrix M. You decide you want to perform dimension reducton:clustering on your data
and haie decide to use the singular ialue decompositon (SVD;
also called principal components analysis PCA)
You performed singular ialue decompositon (SVD; also called principal components analysis or PCA) on you data
matrix but you did not center your data frst. What does your frst singular component describe?

A. The mean of the data set

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 10

B. The iariance of the data set


C. The standard deiiaton of the data set
D. The maximum of the data set
E. The median of the data set

Aoswern C

Question 17

You haie a large m x n data matrix M. You decide you want to perform dimension reducton:clustering on your data
and haie decide to use the singular ialue decompositon (SVD;
also called principal components analysis PCA)
Refer to the passage aboie.
What represents the SVD of the Matrix standard M giien the following informaton:
U is m x m unitary
V is n x n unitary
S is m x n diagonal
Q is n x n iniertble
D is n x n diagonal
L is m x m lower triangular
U is m x m upper triangular

A. M = U S V
B. M = U P
C. M = Q D Q-0
D. M = L U

Aoswern A

Question 18

You haie a large m x n data matrix M. You decide you want to perform dimension reducton:clustering on your data
and haie decide to use the singular ialue decompositon (SVD;
also called principal components analysis PCA)
For the moment, assume that your data matrix M is 522 x /. The fgure below shows a plot of the data.

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 11

Which line represents the second principal component?

A. Blue
B. Yellow

Aoswern A

Question 19

Many machine learning algorithm iniolie fnding the Global minimum of a coniex loss functon, primarily because:

A. The additie inierse of a coniex functon is concaie


B. The deriiatie of coniex functon is always defned
C. The second deriiatie of a coniex functon is a constant
D. Any local minimum of a coniex is also a global minimum

Aoswern B

Question 20

Which two techniques should you use to aioid oierfing a classifcaton model to a data set?

A. Include a small number “noise” features that are not through to be correlated with the dependent iariable.
B. Replicate features that are through to be signifcant predicators of the dependent iariable multple tme for each
obseriaton.
C. Separate your input data into a training set that is used for fing and a test set that is used foreialuatng the
model’s performance
D. Include a regularizaton term in the model’s objectie functon to control how precisely the model fts the data
E. Preprocess the data to exclude a typical obseriaton from the model input

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 12

Aoswern A,E

Question 21

You are building a k-nearest neighbor classifer (k-NN) on a labeled set of points in a highdimensional space. You
determine that the classifer has a large error on the training data. What is the most likely problem?

A. High-dimensional spaces efectiely make local neighborhoods global


B. k-NN compotaton does not coierage in high dimensions
C. k was too small
D. The VC-dimension of a k-NN classifer is too high

Aoswern B

Question 22

Which best describes the primary functon of Flume?

A. Flume is a platorm for analyzing large data sets that consists of a high-leiel language for expressing data analysis
programs, coupled with an infrastructure consistng of sources and sinks for importng and eialuatng large data sets
B. Flume acts as a Hadoop flesystem for log fles
C. Flume Imports data from SQL:relatonal database into your Hadoop cluster
D. Flume proiides a query languages for Hadoop similar to SQL
E. Flume is a distributed serier for collectng and moiing large amount of data into HDFS as it’s produced from
streaming data fows

Aoswern D

Question 23

You haie a directory containing a number of comma-separated fles. Each fle has three columns and each flename
has a .csi extension. You want to haie a single tab-separated fle (all .tsi) that contains all the rows from all the fles.
Which command is guaranteed to produce the desired output if you haie more than /2,222 fles to process?

A. Find . – name ‘*, CSV’ – print2 | sargs -2 cat | tr ‘,’ ‘\t’ > all.tsi
B. Find . –name ‘name * .CSV’ | cat | awk ‘BEGIN {FS = “,” OFS = “\t”} {print $0, $/, $1}’ > all.tsi
C. Find . – name ‘*.CSV’ | tr ‘,’ ‘\t’ | cat > all.tsi
D. Find . –name ‘*.CSV’ | cat > all.tsi
E. Cat *.CSV > all.tsi

Aoswern B

Question 24

What are three benefts of running feature selecton analysis before fltering a classifcaton model?

A. Eliminates the need to include a regularizaton term


B. Reduces the number of subjectie decisions required to construct the model

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 13

C. Guarantee the optmally of the fnal model


D. Speeds up the model fing process
E. Deielops an understanding of the importance of diferent features
F. Improies the predictie performance of the model

Aoswern D,E,F

Question 25

When optmizing a functon using stochastc gradient descent, how frequently should you update your estmate of the
gradient?

A. Once afer eiery pass through the data set


B. Once per obseriaton
C. For each obseriaton with a probability that you choose ahead of tme
D. Afer a random number of obseriatons
E. Once eiery N obseriatons, where you decide N ahead of tme

Aoswern A,C

Question 26

In what format are web serier log fles usually generated and how must you transform them in order to make them
usable for analysis in Hadoop?

A. XML fles that you need to coniert to JSON


B. Text fles that require parsing into useful felds
C. CSV fles that require parsing into useful felds
D. HTML fles that you need to coniert to plain text or CSV
E. Binary fles that may require decompression and coniersion using AVRO

Aoswern A,B

Question 27

Which recommender system technique is domain specifc?

A. Content-based collaboraton fltering


B. Item-based collaboratie fltering
C. User-based collaboratie fltering
D. Naïie Bayes classifer

Aoswern C

Explanaton:
Reference:
htp:::www.cs.cmu.edu:~srosenth:papers:RosenthallRecSys20.pdf

Question 28

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 14

You are about to sample a 022-dimensinal unit-cube. To adequately sample any single giien dimension, you need only
capture 02 points. How many points do you need to order to sample the complete 022-dimensional unit cube
adequately?

A. 02202
B. 0202
C. Log/(022)
D. 022
E. 0222
F. 0202

Aoswern E

Question 29

You haie acquired a new data source of millions of customer records, and you’ie this data into HDFS. Prior to analysis,
you want to change all customer registraton to the same date format, make all addresses uppercase, and remoie all
customer names (for anonymizaton). Which process will accomplish all three objecties?

A. Adapt the data cleansing module in Mahout to your data, and inioke the Mahout library when you run your
analysis
B. Pull this data into an RDBMS using sqoop and scrub records using stored procedures
C. Write a script that receiies records on stdin, corrects them, and then writes them to stdout.
Then, inioke this script in a map-only Hadoop Streaming Job
D. Write a MapReduce job with a mapper to change words to uppercase and to reduce diferent forms of dates to a
single form

Aoswern C

Question 30

A company has /2 sofware engineers working to fx on a project. Oier the past week, the team has fxed 022 bugs.
Although the aierage number of bugs. Although the aierage number of bugs fxed per engineer id fie. None of the
engineer fxed exactly fie bugs last week.
You want to understand how productie each engineer is at fxing bugs. What is the best way to iisualize the
distributon of bug fxes per engineer?

A. A bar chart of engineers is. number of bugs fxed


B. A scater plot of engineers is. number of bugs fxed
C. A normal distributon of the mean and standard deiiaton of bug fxes per engineer
D. A histogram that groups engineers to together based on the number of bugs they fxed

Aoswern A

Question 31

A company has /2 sofware engineers working to fx on a project. Oier the past week, the team has fxed 022 bugs.
Although the aierage number of bugs. Although the aierage number of bugs fxed per engineer id fie. None of the

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 15

engineer fxed exactly fie bugs last week.


One engineer points out that some bugs are more difcult to fx than others. What metric should you use to estmate
how hard a partcular bug is to fx?

A. The tech lead’s estmate of how many hours would be needed to fx the bug.
B. The priority of the bug according to the project manager
C. The number of years that the engineer who was assigned the bug has worked at the company
D. The number of bugs that had been found in each sub-component of the project

Aoswern D

Question 32

In what way can Hadoop be used to improie the performance of LIoyd’s algorithm for k-means clustering on large
data sets?

A. Parallelizing the centroid computatons to improie numerical stability


B. Distributng the updates of the cluster centroids
C. Reducing the number of iteratons required for the centroids to conierge
D. Mapping the input data into a non-Euclidean metric space

Aoswern B

Question 33

You haie a data fle that contains two trillion records, one record per line (comma separated).
Each record lists two friends and unique message sent between them. Their names will not haie commas.
Michael, John, Pabst, Blue Ribbon
Tifany, James, BMX Racing
John, Michael, Natural Lemon Flaior
Analyze the pseudo code examples below and determine which set of mappers and reducers in the below pseudo
code snippets will solie for the mean number of messages each user sends to all of the friends?
For example pseudo code may haie three friends to whom he sends 6, 02, and /22 messages, respectiely, so
Michael’s mean would be (6+02+/22):1. The soluton may require a pipeline of two MapReduce jobs.

A. def mapper0 (line):


key0, key/, message = line.split (‘ , ’)
emit ( (key0, key/) , 0)
def reducer0(key, ialues):
emit (key, sum(ialues))
def mapper/(key, ialue):
key0, key/ = key : : unpack both friends name into separate keys
emit (key0, ialue)
def reducer/(key, ialues):
emit (key, mean (ialues) )
B. def mapper0 (line):
key0, key/, message = line.split (‘ , ’)
emit ( (key0, key/) , 0)
emit ( (key0, key/) , 0)
def reducer0(key, ialues):

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 16

emit (key, sum(ialues))


def mapper/(key, ialue):
key0, key/ = key : : unpack both friends name into separate keys
emit (key0, ialue)
def reducer/(key, ialues):
emit (key, mean (ialues) )
C. def mapper0 (line):
key0, key/, message = line.split (‘ , ’)
emit ( (key0, key/) , 0)
emit ( (key0, key/) , 0)
def reducer0(key, ialues):
emit (key, sum(ialues))
D. defmapper (line):
Key0, key/, message =line.split(‘ , ’)
Sort (key0, key/) : :a fien pair will always besorted the same
Emit( ( key 0, key/), 0)
Def reducer0(key, ialues) :
Emit (key, sum (ialues) )
Def Mapper/ (key, ialue)
Key0, key/ = key : : unpack both friends names into separate keys
Emit (key0, ialue)
Emit(key/, ialue)
Def reducer/(key, ialues);
Emit (key, mean (ialues) )

Aoswern B

Question 34

You haie just run a MapReduce job to flter user messages to only those of a selected geographical region. The output
for this job in a directory named westUsers, located just below your home directory in HDFS. Which command gathers
these records into a single fle on your local fle system?

A. Hadoop fs – getmerge westUsers WestUsers.txt


B. Hadoop fs –get westUsers WestUsers.txt
C. Hadoop fs – cp westUsers:* westUsers.txt
D. Hadoop fs –getmerge –R westUsers westUsers.txt

Aoswern B

Question 35

Functon is coniex if the line segment between two points, a and b is greater than equal to the ialue of the a x b

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 17

Which two functons are coniex?

A. X0:/
B. Ex
C. /x-0
D. 0-x/

Aoswern A

Question 36

You need to analyze 62,222,222 images stored in JPEG format, each of which is approximately /5 KB. Because your
Hadoop cluster isn't optmized for storing and processing many small fles you decide to do the following actons:
0. Group the indiiidual images into a set of larger fles
/. Use the set of larger fles as input for a MapReduce job that processes them directly with Python using Hadoop
streaming
Which data serializaton system giies you the fexibility to do this?

A. CSV
B. XML
C. HTML
D. Airo
E. Sequence Files
F. JSON

Aoswern B,F

Question 37

You haie user profle records in an OLTP database that you want to join with web serier logs which you haie already
ingested into HDFS. What is the best way to acquire the user profle for use in HDFS?

A. Ingest with Hadoop streaming


B. Ingest with Apache Flume
C. Ingest using Hiie’s LOAD DATA command
D. Ingest using Sqoop
E. Ingest using Pig’s LOAD command

Aoswern B,D

Explanaton:
Reference:
htps:::thinkbiganalytcs.com:leadinglbigldataltechnologies:ingeston-and-streamingwith-storm-kafa-fume:

Question 38

You are building a system to perform outlier detecton for a large online retailer. You need to build a system to detect
if the total dollar ialue of sales are outside the norm for each U.S. state, as determined from the physical locaton of
the buyer for each purchase. The retailer's data sources are scatered across multple systems and databases and are

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 18

unorganized with litle coordinaton or shared data or keys between the iarious data sources.
Below are the sources of data aiailable to you. Determine which three will giie you the smallest set of data sources
but stll allow you to implement the outlier detector by state.

A. Database of employees that Includes only the employee ID, start date, and department
B. Database of users that contains only their user ID, name, and a list of eiery Item the user has iiewed
C. Transacton log that contains only basket ID, basket amount, tme of sale completon, and a session ID
D. Database of user sessions that includes only session ID, corresponding user ID, and the corresponding IP address
E. External database mapping IP addresses to geographic locatons
F. Database of items that includes only the item name, item ID, and warehouse locaton
G. Database of shipments that includes only the basket ID, shipment address, shipment date, and shipment method

Aoswern A,D,F

Question 39

How can the naiieté of the naiie Bayes classifer be adiantageous?

A. It does not require you to make strong assumptons about the data because it is a nonparametric
B. It signifcantly reduces the size of the parameter space, thus reducing the risk of oier fing
C. It allows you to reduce bias with no tradeof in iariance
D. It guarantees coniergence of the estmator

Aoswern A

Question 40

What are two defning features of RMSE (root-mean square error or root-mean-square deiiaton)?

A. It is sensitie to outliers
B. It is the mean ialue of recommendatons of the K-equal parttons in the input data
C. It is the square of the median ialue of the error where error is the diference between predicted ratng and actual
ratngs
D. It is appropriate for numeric data
E. It considers the order of recommendatons

Aoswern B,D

Question 41

Consider the following sample from a distributon that contains a contnuous X and label Y that is either A or B:

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 19

Which is the best cut point for X if you want to discretze these ialues into two buckets in a way that minimizes the
sum of chi-square ialues?

A. X8
B. X6
C. X5
D. X4
E. X/

Aoswern D

Question 42

Consider the following sample from a distributon that contains a contnuous X and label Y that is either A or B:

Which is the best choice of cut points for X if you want to discretze these ialues into three buckets that minimizes the
sum of chi-square ialues?

A. X5 and X8
B. X4 and X6
C. X1 and X8
D. X1 and X6
E. X/ and X0

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 20

Aoswern E

Question 43

You want to understand more about how users browse your public website. For example, you war know which pages
they iisit prior to placing an order. You haie a serier farm of /22 web serier hostng your website. Which is the most
efcient process to gather these web seriers access logs into your Hadoop cluster for analysis?

A. Sample the web serier logs web seriers and copy them into HDFS using curl
B. Channel these click streams into Hadoop using Hadoop Streaming
C. Write a MapReduce job with the web seriers for mappers and the Hadoop cluster nodes for reducers
D. Import all user clicks from your OLTP databases Into Hadoop using Sqoop
E. Ingest the serier web logs into HDFS using Flume

Aoswern C

Question 44

You haie a large fle of N records (one per line), and want to randomly sample 02% them. You haie two functons that
are perfect random number generators (through they are a bit slow):
Randomluniform () generates a uniformly distributed number in the interial [2, 0]
randomlpermotaton (M) generates a random permutaton of the number O through M -0.
Below are three diferent functons that implement the sampling.
Method A
For line in fle:
If randomluniform () < 2.0;
Print line
Method B
i=2
for line in fle:
if i % 02 = = 2;
print line
i += 0
Method C
idxs = randomlpermotaton (N) [: (N:02)]
i=2
for line in fle:
if i in idxs:
print line
i +=0
Which method will haie the best runtme performance?

A. Method A
B. Method B
C. Method C

Aoswern A

Question 45

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 21

You haie a large fle of N records (one per line), and want to randomly sample 02% them. You haie two functons that
are perfect random number generators (through they are a bit slow):
Randomluniform () generates a uniformly distributed number in the interial [2, 0]
randomlpermotaton (M) generates a random permutaton of the number O through M -0.
Below are three diferent functons that implement the sampling.
Method A
For line in fle:
If randomluniform () < 2.0;
Print line
Method B
i=2
for line in fle:
if i % 02 = = 2;
print line
i += 0
Method C
idxs = randomlpermotaton (N) [: (N:02)]
i=2
for line in fle:
if i in idxs:
print line
i +=0
Which method requires the most RAM?

A. Method A
B. Method B
C. Method C

Aoswern B

Question 46

You haie a large fle of N records (one per line), and want to randomly sample 02% them. You haie two functons that
are perfect random number generators (through they are a bit slow):
Randomluniform () generates a uniformly distributed number in the interial [2, 0]
randomlpermotaton (M) generates a random permutaton of the number O through M -0.
Below are three diferent functons that implement the sampling.
Method A
For line in fle:
If randomluniform () < 2.0;
Print line
Method B
i=2
for line in fle:
if i % 02 = = 2;
print line
i += 0
Method C
idxs = randomlpermotaton (N) [: (N:02)]
i=2

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 22

for line in fle:


if i in idxs:
print line
i +=0
Which method might introduce unexpected correlatons?

A. Method A
B. Method B
C. Method C

Aoswern C

Question 47

You haie a large fle of N records (one per line), and want to randomly sample 02% them. You haie two functons that
are perfect random number generators (through they are a bit slow):
Randomluniform () generates a uniformly distributed number in the interial [2, 0]
randomlpermotaton (M) generates a random permutaton of the number O through M -0.
Below are three diferent functons that implement the sampling.
Method A
For line in fle:
If randomluniform () < 2.0;
Print line
Method B
i=2
for line in fle:
if i % 02 = = 2;
print line
i += 0
Method C
idxs = randomlpermotaton (N) [: (N:02)]
i=2
for line in fle:
if i in idxs:
print line
i +=0
Which method is least likely to giie you exactly 02% of your data?

A. Method A
B. Method B
C. Method C

Aoswern B

Question 48

Assuming the trends shown in this chart contnue, what would we expect the ialue of the reienue to be in Q0 of
/201?

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 23

A. $0/5,222
B. $072,222
C. $//2,222
D. $/52,222

Aoswern A

Question 49

From historical data, you know that 52% of students who take Cloudera’s Introducton to Data Science: Building
Recommenders Systems training course pass this exam, while only /5% of students who did not take the training
course pass this exam. You also know that 52% of this exam’s candidates also take Cloudera’s Introducton to Data
Science: Building Recommendatons Systems training course.
If we know that a person has passed this exam, what is the probability that they took cloudera’s introducton to Data
Science: Building Recommender Systems training course?

A. /:1
B. 0:/
C. 1:4
D. 1:5

Aoswern B

Question 50

From historical data, you know that 52% of students who take Cloudera’s Introducton to Data Science: Building
Recommenders Systems training course pass this exam, while only /5% of students who did not take the training
course pass this exam. You also know that 52% of this exam’s candidates also take Cloudera’s Introducton to Data
Science: Building Recommendatons Systems training course.
What is the probability that any indiiidual exam candidate will pass the data science exam?

A. 1:8
B. 0:4
C. 0:8
D. 0:/

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 24

Aoswern C

Question 51

You want to build a classifcaton model to identfy spam comments on a blog. You decide to use the words in the
comment text as inputs to your model. Which criteria should you use when deciding which words to use as features in
order to contribute to making the correct classifcaton decision?

A. Choose words for your sample that are most correlated with the Spam label
B. Choose wordsfor your sample thatoccur most frequently in the text
C. Choose words, for your sample that haie the largest mutual informaton with the spam label
D. Choose words for your sample that are least correlated with the spam label

Aoswern A

Question 52

Giien the following sample of numbers from a distributon:


0, 0, /, 1, 5, 8, 01, /0, 14, 55, 80
What are the fie numbers that summarize this distributon (the fie number summary of sample
percentles)?

A. 0, 1, 8, 14, 80
B. 0, 4, 01, 14, 80
C. 0, 0.5, 5, /4.5, 80
D. 0, /.5, 8, /7.5, 80

Aoswern A

Question 53

Giien the following sample of numbers from a distributon:


0, 0, /, 1, 5, 8, 01, /0, 14, 55, 80
What are two benefts of using the fie-number summary of sample percentles to summarize a data set?

A. You can calculate unbiased estmators for the parameters of the distributon
B. It’s robust to outliers
C. It’s well-defned for any probability distributon
D. You can calculate it quickly using a relatonal database like MySQL, eien when we haie a large sample

Aoswern D

Question 54

Giien the following sample of numbers from a distributon:


0, 0, /, 1, 5, 8, 01, /0, 14, 55, 80
How do high-leiel languages like Apache Hiie and Apache Pig efciently calculate approximately percentles for a
distributon?

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 25

A. They sort all of the input samples and the lookup the samples for each percentle
B. They maintain index of input data as it is loaded into HDFS and load them into memory
C. They use piiots to assign each obseriatons to the reducer that calculate each percentle
D. They assign sample obseriatons to buckets and then aggregate the buckets to compute the approximatons

Aoswern C

Question 55

What is the best way to determine the learning rate parameters for stochastc gradient descent when the distributon
of the input data shifs oier tme?

A. The learning rate should be adjusted periodically based on the seing that optmizes the objectie functon oier a
sample of recent obseriatons
B. The learning rate should be fxed number that decays as the number of obseriatons in the data set increases
C. The learning rate should be the ialue that optmizes the ialue of the objectie functon oier the frst N samples in
the dataset
D. The learning rate should be a fxed number with a constant decay factor
E. The learning rate should be contnuously adjusted based on the ialue that optmizes the objectie functon for the
most recent obseriaton from the input data

Aoswern C

Question 56

Which two machine learning algorithm should you consider as likely to beneft from discretzing contnuous features?

A. Support iector machine


B. Naïie Bayes
C. Decision trees
D. Logistc regression
E. Singular ialue decompositon

Aoswern A,B

Explanaton:
Reference:
htp:::www.ncbi.nlm.nih.goi:pmc:artcles:PMC/65628/:

Question 57

You’ie built a model that has ten diferent iariables with complicated independence relatonships between them, and
both contnuous and discrete iariables that haie complicated, mult-parameter distributons.
Computng the joint probability distributon is complex, but it turns out that computng the conditonal probabilites
for the iariables is easy. What is the most computatonally efcient for computng the expected ialue?

A. Method of moments
B. Markoi Chain Monte Carlo

________________________________________________________________________________________________

https://www. pass4sures.com/
Page No | 26

C. Gibbs sampling
D. Numerical quadrature

Aoswern B

Question 58

What is one limitaton encountered by all systems that employ collaboratie fltering and use preferences as input. In
order to output product recommendatons to consumers?

A. Consumers do not haie stable ratngs for the same product oier tme
B. There are too many consumers and too few products
C. Not eiery product has been rated by eiery consumer
D. There are too few consumers and too many products

Aoswern A

Question 59

Why is the naiie Bayes classifer "naiie"?

A. It generally performs worsethan more complex methods


B. It Is an unbiased estmator
C. It assumes Independence between all features
D. It makes no assumptons on the underlying distributons (i.e., it is non-parametric)

Aoswern C

Explanaton:
Reference:
htp:::www.mathworks.com:help:stats:naiie-bayes-classifcaton.html

Question 60

Which three metrics are useful in measuring the accuracy and quality of a recommender system?

A. Mutual Informaton
B. RMSF
C. Tanimoto coefcient
D. Pearson correlaton
E. Precision
F. Recall

Aoswern C,D,E

Explanaton:
Reference:
htps :::lirias.kuleuien.be:bitstream:0/1456780:/80821:1:datasets-cameraready.pdf

________________________________________________________________________________________________

https://www. pass4sures.com/

Вам также может понравиться