Вы находитесь на странице: 1из 7

1 Curso: Nome: N ______________________________________________________________________________

Computational Biology
2nd Test, 14th January 2012 Duration 2h00

Written exam, closed book Remark: Answer the questions in the spaces reserved for this purpose. GROUP I Clustering and Biclustering (11.0 points)

Problem 1 (3.0 points + 1.0 points + 3.0 points) Consider the following expression matrix where the expression levels of 2 genes (G1 and G2) were analyzed in 8 experimental conditions (C1 to C8). Consider also the problem of grouping the conditions given the expression profiles of the genes G1 and G2 using clustering algorithms.
C1 G1 G2 1 3 C2 1 2 C3 -1 1 C4 1 0 C5 3 0 C6 3 2 C7 3 0 C8 5 1

a. Determine the dendogram found by a hierarchical clustering algorithm (HCA) using a bottom-up approach, the Euclidean distance to compute the distance between conditions, and the complete-link distance to compute the distance between groups (intercluster distance). Justify the decisions taken at each step of the HCA. b. How would you use the dendogram to group the conditions in 3 clusters and which will be those clusters? c. Determine the groups found by the K-means (K=3) algorithm when the centroids are initialized with C1 = (1,3) and C4 = (1,0) and C8 = (5,1). In each iteration of the algorithm present the centroids and the conditions in each group (cluster).

3 d. Problem 2 (2.0 points + 2.0 points) Consider the matrix below corresponding to a discretized version of a gene expression matrix where 5 genes were analysed in 3 consecutive time points. The symbols D, N e U identify the following expression levels: down-regulated, no-change e up-regulated.
T1 G1 G2 G3 G4 G5 D N D D N T2 U D U D U T3 N U N U N

a. Construct a generalized suffix tree with all the information needed for the application of the CCC-Biclustering algorithm. b. Apply the CCC-Biclustering algorithm and identify all maximal CCC-Biclusters. For each maximal CCC-Bicluster indicate the node in the generalized suffix tree that identifies it, together with the set of genes, the set of time points, and the expression pattern that defines the bicluster.

GROUP II Data Mining (9.0 points)

Problem 1 (2.0 points) What is the difference between supervised and unsupervised learning? Give an example of a biomedical problem where supervised learning could be used; and an example of a biomedical problem where unsupervised learning could be used but supervised learning could not.

6 Problem 2 (7.0 points) Consider the following set of examples (individuals) describing the probability of having a lung cancer (LungCancer in {High, Low}) based on three attributes collected for each person: family history of lung cancer (History in {Yes, No}), active smoking (ActiveSmoker in {Yes, No}), and passive smoking (PassiveSmoking in {High, Moderate, Low}).
History No Yes No No No Yes Yes No Yes No ActiveSmoking PassiveSmoking Yes High No Moderate No High Yes Low Yes Moderate No Moderate No Low No Moderate Yes Moderate No Low LungCancer High High Low High High High High Low High Low

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10

a. Compute a classifier based on decision trees using the ID3 algorithm. Justify all the options taken by the algorithm while computing the decision tree.

Вам также может понравиться