Вы находитесь на странице: 1из 39

Functional Annotation

Presented by: Deepika Jaggi

2 Main Steps
IDENTIFICATION OF GENES (find ORFs)
(encrypted in DNA sequence and deduce functionality of encoded proteins and RNA proteins) GeneFinder- Glimmer (Delcher et al 1999) GeneMark(Lukashin Borodovsky, 1998)

GENE PREDICTION
Take the set of prediction and search for hits against one or more proteins/ protein domains databases using BLAST,

HMMer

( for each gene that has a significant match, the Blast output together with annotation of the hit can be used to assign a name and function to the protein)

An annotation is a note that is made while reading any form of text.

(This can be as simple as underlined or highlighted passages.)

Annotation is the process by which information about raw DNA sequences is added to the genome databases. This involves describing different regions of the code and identifying which regions can be called genes.

The diagram below represents a tiny fragment of DNA, a single hypothetical gene. Notice that there are various parts within the gene. Some of these parts will code for a protein, others contain regulatory information, some will not be translated and their function is still unclear.

One important aspect of annotation is identifying which parts of a genome are transcribed into mRNA. Annotation can occur at many levels.(Nucleotide, Structure and Function) ANNOTATION connects the raw sequence data to the gene structures, gene products, and the function of those products. It is the annotation that bridges the gap from the sequence to the biology of the organism. (Stein, L. 2001)

Nucleotide annotation involves placing landmarks in the genome. Genetic markers Location of regulatory elements, mRNA, tRNA and rRNA

Structural annotation consists of the identification of genomic elements. ORFs and their localisation gene structure coding regions location of regulatory motifs
Functional annotation consists of attaching biological information to genomic elements. biochemical function biological function involved regulation and interactions expression

Nomenclature of genes in Annotation: Known gene- If a predicted gene matches with the entire length of a gene on the database. It is a KNOWN GENE. Putative gene- If the predicted gene shows regions of similarity with the gene in the database, it is referred to be PUTATIVE. Unknown gene- If a predicted gene matches with an EST sequence whose function is not known, then it is referred as UNKNOWN. Hypothetical gene- If the predicted gene doesnt match or show any similarity with a gene in the database, then it is known as HYPOTHETICAL.

GENE ANNOTATION (in detail.)


Gene annotation is the process of finding genes within a genome and

defining structures and/or function of the genes. It consists of two main steps: 1. identifying elements on the genome, a process called gene prediction, and 2. attaching biological information to these elements.
For DNA annotation, a previously unknown sequence representation

of genetic material is enriched with information relating genomic position to intron exon boundaries, regulatory sequences, repeats, gene names and protein products. This annotation is stored in genomic databases as Mouse Genome Informatics, FlyBase, and WormBase.

The National Center for Biomedical Ontology (www.bioontology.org)

develops tools for automated annotation of database records based on the textual descriptions of the records.

GENE PREDICTION
Ab initio and splice alignment programs are used for gene prediction and verifying gene locations within genomes.
Ab initio gene prediction programs are based on our prior knowledge

of genes. They use statistical analysis to predict the most probable gene models.

Instead of searching the genomic DNA for specific sequences that might

indicate the presence of gene, Splice Alignment programs begins with a nucleotide sequence that has been acquired from a genetic transcript. Splice sites are locations along the sequence where the order of the nucleotide base indicates an exon and intron are adjacent to each other. 5 end DONOR SITE , 3 end ACCEPTOR SITE. Splice alignment programs utilize these sites when predicting gene locations. Gene function annotation relies heavily on sequence similarity searching techniques with protein sequence database, automatically annotated entries based on BLAST hits to NCBI database.

FUNCTIONAL ANNOTATION Functional category: Gene Ontology: www.geneontology.org Pathways: KEGG: www.genome.jp/kegg GenMAPP: www.genmapp.org BioCarta: www.biocarta.com Protein Interactions: STRING: http://string.embl.de BIND: http://bind.ca

GENE ONTOLOGY CONSORTIUM


The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains: cellular component, the parts of a cell or its extracellular environment; molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis; and biological process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. For example, the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

The GO is structured as a directed acyclic graph, and each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. The GO vocabulary is designed to be species-neutral, and includes terms applicable to prokaryotes and eukaryotes, single and multicellular organisms.

Annotation is the practice of capturing the activities and localization of a gene product with GO terms, providing references and indicating what kind of evidence is available to support the annotations.

Inflammatory response GO:0006954 Response to wounding GO:0009611


Is a Is a

Defense response GO:0006592

Is a

Is a

Response to external stimulus Go:0009605


Is a

Response to stress GO:0006950


Is a

Response to stimulus GO:0050896


Is a

DIRECTED ACYCLIC GRAPH (DAG) GENES CAN BE ASSOCIATED WITH MANY GO TERMS INCREASED SPECIFICITY

Biological_process GO:0008150

GO Inflammatory Response

What is a Gene Ontology annotation? Gene Ontology annotation is a manually or automatically assigned text file containing the following information: Gene name, symbol Gene Ontology term Published reference Evidence code Date of the annotation Optional information (Qualifiers such as "NOT", "contributes_to", "colocalizes_with"; Notes with both free text and controlled vocabulary information)

What is manual annotation? Scientific curators use published literature to assign GO terms based on evidence codes, which support the annotation.

GO Tools
Blast2GO (B2G) Bioinformatics Department at the Centro de Investigacion Principe Felipe, Valencia, Spain Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Easy start up and low maintenance. User-friendly High-throughput and interactive Data mining. Gene Ontology, KEGG maps, InterPro and Enzyme Codes are supported by Blast2GO. For Gene Ontology functional annotation follow the 3 application steps: Blast against public or private databases. Mapping against GO resources to fetch functional data. Annotation to generate trustful functional assignments.

GOanna
GOanna is used to find annotations for proteins using a similarity search. The input can be a list of IDs or it can be a list of sequences in FASTA format. GOanna will retrieve the sequences if necessary and conduct the specified BLAST search against a user-specified database. The resulting file contains GO annotations of the top BLAST hits. The sequence alignments are also provided so the user can use these to access the quality of the match.

g:Profiler
University of Tartu, Estonia
g:Profiler is a public web server for characterising and manipulating gene lists resulting from mining high-throughput genomic data. g:Profiler has a simple, user-friendly web interface with powerful visualisation for capturing Gene Ontology, pathway, or transcription factor binding site enrichments down to individual gene levels. Other important aspects of practical data analysis are supported by modules tightly integrated with g:Profiler. These are: g:Convert for converting between different database identifiers; g:Orth for finding orthologous genes from other species; and g:Sorter for searching a large body of public gene expression data for coexpression. g:Profiler supports 31 different species, and underlying data is updated regularly from sources like the Ensembl database.

Gene Ontology for Motifs (GOMO)

GOMO is an alignment- and threshold-free comparative genomics approach for assigning functional roles to DNA regulatory motifs from DNA sequence. The algorithm detects associations between a user-specified DNA regulatory motif (expressed as a position weight matrix;PWM) and Gene Ontology terms. The original method for predicting the roles of transcription factors (TFs starts with a PWM motif describing the DNA-binding affinity of the TF. GOMO uses the PWM to score the promoter region of each gene in the genome for its likelihood to be bound by the TF. The resulting affinity scores are then used to test each term in the Gene Ontology for association with high-scoring genes.

Gene Ontology Categorizer (GOCat)


Text Mining group, University of Geneva and Swiss-Prot group, Swiss Institute of Bioinformatics The GoCat is an automatic text categorizer. The tool classifies any input text (a few words, an abstract, a set of PubMed Identifiers...) into Gene Ontology categories. The system, originally developed for the first BioCreative evaluation campaign, aims at facilitating functional annotation of gene and gene products using text mining methods. For every predicted category, a confidence score and a short text passage, extracted from the input text, are provided.

PubSearch
The Arabidopsis Information Resource
PubSearch is a web-based literature curation tool, allowing curators to search and annotate genes to keywords from articles. It has a simple mySQL database backend and uses a set of Java Servlets for querying, modifying, and adding gene, geneannotation, and literature information.

Manatee
The J. Craig Venter Institute
Manatee is a web-based tool used to perform manual functional annotation. It has been specifically designed to optimize the ability of curators to evaluate all available sequence-based and experimental data to assign the best possible annotation to a given gene product. Manatee allows users to view, modify, and store annotation through interactions with an underlying relational database where all of the information is stored.

Manatee supports the storage of multiple types of functional annotation including protein names, gene symbols, EC numbers, Gene Ontology terms, and associated supporting evidence.
In addition, Manatee provides summary views of statistics and information from the genome as a whole.

InterProScan
European Bioinformatics Institute InterProScan is a perl-based program which combines different protein signature recognition methods into one resource.

Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.

Automated annotation
The most powerful current approach for inferring the function of new proteins is by studying the annotations of their homologues, since their common origin is assumed to be reflected in their structure and function. Unfortunately, as proteins evolve they acquire new functions, so annotation based on homology must be carried out in the context of orthologues or subfamilies.

Evolution adds new complications through domain shuffling: homology (or orthology) frequently corresponds to domains rather than complete proteins. Moreover, the function of a protein may be seen as the result of combining the functions of its domains. Additionally, automatic annotation has to deal with problems related to the annotations in the databases: errors (which are likely to be propagated), inconsistencies, or different degrees of function specification.
Automated genome sequence analysis and annotation may provide ways to understand genomes

METHOD:
Sequence relationships are detected and measured to obtain a map of the sequence space, which is searched for differentiated groups of proteins (similar to islands on the map), which are expected to have a common function and correspond to groups of orthologues or subfamilies. This mapmaking is done by applying a clustering algorithm based on Normalized cuts in graphs. Pairwise local alignments are analyzed to determine the extent to which they cover the entire sequence lengths of the two proteins. This analysis determines both what homologues are preferred for functional inheritance and the level of confidence of the annotation.

Eg: Buchnera aphidicola

The remaining genes are either (i) homologous to genes of unknown function, and are typically referred to as conserved hypothetical genes, or (ii) do not have any known homologs termed hypothetical or non characterized or unknown because it is unclear whether they encode actual proteins.

The "Gene Function Predictor" is an automatic tool that aims to help biologists by providing them hypothetical functional predictions out of genomic context characteristics.
AutoFACT: An Automatic Functional Annotation and Classification Tool. It is a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and

(4) generates output in HTML, text formats for the user's convenience.

Protein function can be predicted using : similarity, phylogenetic profiles, protein-protein interactions, protein complexes and gene expression profiles.
The classical way to infer function is based on sequence similarity using sequence database searching programs such as FASTA and PSI-BLAST. Lack of sequence similarity in the database to the protein of interest creates difficulties for functional predictions.

Methods based on protein-protein interactions Proteins often interact with one another in a mutually dependent way to perform a common function. As an example, the transcription factors interact among themselves to bring about transcription. It is therefore possible to infer the functions of proteins based on their interaction partners. The Rosetta-Stone approach is a method to predict function based on protein fusion events. Two polypeptides A and B in one organism are likely to interact if their homologs are expressed as a single polypeptide AB in another organism. The latter polypeptide (AB) is called a Rosetta stone protein, as it contains information about both A and B. This method can be effective because a biochemical function in many cases depends on the action of a multi-meric complex demonstrating a correlation between co-interacting proteins and their functions.

Methods based on comparative genomics Comparative genomics is the study of relationships between genomes of different species. Assumption : Proteins that function together either in a metabolic pathway or in structural complex are expected to evolve together. During evolution, all such functionally linked proteins tend to be either preserved or eliminated in a new species. Proteins within these groups are defined as functionally linked. For example, two proteins are functionally linked if they have homologs in a group of organisms. Phylogenetic Profiling can detect such functionally linked Proteins. Proteins with identical or similar profiles are functionally linked. This method can identify functionally linked proteins with no amino acid sequence similarity so that the function of the hypothetical protein can be known. This method was tested using three proteins the ribosome protein RL7 and the flagellar structural protein FlgL, as well as a protein known to participate in a metabolic pathway, the histidine biosynthetic protein HIS5.

Function assignment based on 3D structures Structures of hypothetical proteins may provide a hint for their biochemical or biophysical functions. 3D structure can aid the assignment of function for uncharacterized proteins. During evolution, the folding patterns of proteins are often preserved and hence structure based comparisons can identify homologs where the sequence based comparisons become futile. As an example, the crystal structure of a hypothetical protein, MJ0577, from a hyperthermophile,Methanococcus jannaschii, at 1.7 resolutions contains a bound ATP, suggesting MJ0577 is an ATPase. The structure also shows different ATP binding motifs that are shared among many homologous hypothetical proteins in this family. Prediction of protein function from sequence and structure is a difficult problem, because homologous proteins do different functions in several cases.

Clustering approaches
Clustering is the process of grouping on the basis that genes of the same cluster are involved in similar function. Hence, the protein that is coded by this gene will also have the same function. COG includes proteins that are orthologs. This also involves one-to-many and many-to-many relationships. However, it should be noted that the COG database has a large set of uncharacterized proteins.

Genome context methods They predict functional associations between protein coding genes by analyzing gene fusion events, the conservation of gene neighborhood, or the significant co-occurrence of genes across different species. Unlike homology-based annotation, genomic context methods predict functional associations between proteins, such as physical interactions, or comembership in pathways, regulators or other cellular processes The types of genomic context that they use are (1) fusion genes; (2) conservation of gene-order or co-occurrence of genes in potential operons; and (3) co-occurrence of genes across genomes (phylogenetic profiles).

Genome annotation is an active area of investigation and involves a

number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means. Here is a listing of on-going projects relevant to genome annotation: ENCyclopedia Of DNA Elements (ENCODE) Entrez Gene Ensembl GENCODE Gene Ontology Consortium GeneRIF RefSeq Uniprot Vertebrate and Genome Annotation Project (Vega)

NEED: With the accelerated accumulation of genomic sequence data, there is a pressing need to develop computational methods and advanced bioinformatics infrastructure for reliable and largescale protein annotation and biological knowledge discovery.

The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery.
It provides an integrated public resource of protein informatics to support genomic and proteomic research. PIR produces the Protein Sequence Database of functionally annotated protein sequences. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins.

The approach allows sensitive identification, consistent and rich annotation, and systematic detection of annotation errors. The knowledge base consists of two new databases, sequence analysis tools, and graphical interfaces. 1. PIR-NREF, a non-redundant reference database. 2. iProClass, an integrated database of protein family, function, and structure information, provides extensive value-added features for about 830,000 proteins with rich links to over 50 molecular databases.

http://www.plantgdb.org/tutorial/annotatemodule/studentsection.ht ml?searchFor=functional+annotation+&goButton=go# http://www.mad-cow.org/00/annotation_tutorial.html

Thank you

Вам также может понравиться