Вы находитесь на странице: 1из 31

INTRODUCTION

• A database is a collection of data that is


organized so that its contents can easily
be accessed, managed, and modified
by a computer. In the biosciences, a
database is a curated repository of raw
data containing annotations, further
analysis, and links to other databases.
• Currently, a lot of bio informatics work is concerned
with the technology of databases.
• These databases include both "public" repositories of
gene data like GenBank or the Protein DataBank (the
PDB), and private databases like those used by
research groups involved in gene mapping projects or
those held by biotech companies.
THE “UNITS OF INFORMATION”

 DNA
 RNA
 PROTEIN
 SEQUENCE
 STRUCTURE
 EVOLUTION
 PATHWAYS
 STRUCTURE
 MUTATION
BASIC STRUCTURE

CORE DATA-It is data the database was


generated to organize.

ANNOTATION-Extra information that rounds


out our picture of the core data.
THE MAIN SEQUENCE DATABASES

• There are three main nucleic acid


sequence databases and one main
protein sequence database in
widespread general use.
• For nucleic acid these are EMBL, Genbank
and DDBJ and
• For protein this is SWISS-PROT.
THE DNA DATABASES
• DATA SOURCES FOR DNA
DATABASES
• Direct scientific submission
• Genome sequencing labs and groups
• Scientific literature
• Patent applications
DIFFERENT DNA DATABASES

 Important DNA databases are:


• Genbank at NCBI
(http://www.ncbi.nlm.nih.gov/)
• EMBL at EBI
(http://www.ebi.ac.uk/embl/)
• DDBJ in Japan
(http://www.ddbj.nig.ac.jp/)
GenBank (Genetic Sequence Databank)

• One of the fastest growing repositories of


known genetic sequences.
• It has a flat file structure that is an ASCII text
file, readable by both humans and computers.
• In addition to sequence data, GenBank files
contain information like accession numbers
and gene names, phylogenetic classification
and references to published literature.
The EMBL Nucleotide Sequence
Database
• The EMBL Nucleotide Sequence
Database is a comprehensive database
of DNA and RNA sequences collected
from the scientific literature and patent
applications and directly submitted from
researchers and sequencing groups.
REFSEQ from NCBI
(Reference sequence database)

• The Reference Sequence (RefSeq) collection


aims to provide a comprehensive, integrated,
non-redundant set of sequences, including
genomic DNA, transcript (RNA), and protein
products, for major research organisms.
Entrez Gene
• Entrez Gene is a database for gene-specific
information.
• It does not include all known or predicted
genes;
• Instead Entrez Gene focuses on the
genomes that have been completely
sequenced, that have an active research
community to contribute gene-specific
information, or that are scheduled for intense
sequence analysis.
HOW THE DATA IS ENTERED

• Sequences are placed in the databases from


published papers describing them, or
• More commonly, they are submitted directly
by their authors to the database
organisations.
• The sequence is then deemed to belong to
author and only they can update or amend it.
Contd.
 Webin is the WWW site for submitting
nucleotide sequence data and associated
biological information to EMBL database at
theEBI.
 BankIt is the NCBI equivalent site WWW site
for submitting to Genbank.
 Sequin is a programme, which can be
downloaded and run on the authors’ local
computer for preparing a sequence for
submission. The result is then sent by e-mail
to the NCBI or the EBI.
DATA RELIABILITY IN DATABASES

• The huge amount data collected in DNA


databases present a lot of problems:
• Data accuracy
• Redundancy
• Inconsistent nomenclature
• Inaccurate annotation
• Sequence contamination(vectors, bacterial)
THE MAIN PROTEIN DATABASES

• SOURCES OF PROTEIN
•Proteins that have been worked on experimentally
•mRNA whose product has been worked on
experimentally (no actual protein sequencing
done)
•Translated DNA (mRNA) sequences
INFORMATION CONTAINED IN
PROTEIN DATABASES

• Primary amino acid sequences


• Secondary structure
• 3D structure
• Prtein family domains
• Consensus active sites
PROTEIN PRIMARY SEQUENCE
DATABASES
• SWISS-PROT
• SWISS-PROT is the main protein sequence database.
• Produced collaboratively by Amos Bairos (University of
Geneva) and EBI.
• The data in SWISS-PROT are derived from translations of
DNA sequences from the EMBL Nucleotide Sequence
Database. SWISS-PROT is a curated protein database,
which strives to provide a high level of annotation, a
minimal level of redundancy and high level of integration
with other databases
• The core data consists of the
sequence; the citation information
(bibliographical references) and the
taxonomic data (description of the
biological source of the protein).
The annotation consists of the description of:
•Function(s) of the protein
•Post-translational modification(s).
•Domains and sites.
•Secondary structure
•Quaternary structure.
•Similarities to other proteins
•Disease(s) associated with deficiency(s) of/in the
protein
•Sequence conflicts, variants, etc.
UniProt (Universal Protein Resource)

• UniProt (Universal Protein Resource) is the world's


most comprehensive catalog of information on
proteins. It is a central repository of protein sequence
and function created by joining the information
contained in Swiss-Prot, TrEMBL, and PIR. The
UniProt Knowledgebase (UniProt) is the central
access point for extensive curated protein
information, including function, classification, and
cross-reference. The UniProt Non-redundant
Reference (UniRef) databases combine closely
related sequences into a single record to speed
searches. The UniProt Archive (UniParc) is a
comprehensive repository, reflecting the history of all
protein sequences
TrEMBL

• TrEMBL is a computer-annotated protein sequence database supplementing the SWISS-PROT database.


It contains translations of all coding sequences present in the EMBL database that are not yet integrated
into SWISS-PROT. TrEMBL can be considered as a preliminary section of SWISS-PROT.It is split into two
sections: SP- TrEMBL and REM- TrEMBL.
• SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries, which should be eventually incorporated into
SWISS-PROT. REM-TrEMBL (REMainning TrEMBL) contains the entries that are not to be incorporated in
SWISS-PROT.It includes immunoglobulins and T-cell receptors, synthetic sequences, patent application
sequences, small sequences and coding sequence translations where is strong evidence to believe that
proteins are not real.
NR DATABASE
(Primary Database from NCBI)

• The NR Protein database contains


sequence data from the translated
coding regions from DNA sequences in
GenBank, EMBL and DDBJ as well as
protein sequences submitted to PIR,
SWISSPROT, PRF, PDB (sequences
from solved structures).
HOW THE DATA IS ENTERED

• The entries in SWISS-PROT are derived from much the same


sources as the nucleotide database entries, with addition of
translations of the coding sequences in EMBL entries.
• Submissions to SWISS-PROT of directly sequenced
peptides should be made via the site at the EBI. EBI do not
provide accession numbers, in advance, for protein
sequences that are the result of translation of nucleic acid
sequences. These translations will automatically be
forwarded to SWISS-PROT from the EMBL nucleotide
database and are assigned SWISS-PROT accession
numbers on incorporation into TrEMBL.
OTHER PROTEIN DATABASES

• GenPept is Genbank’s equivalent of TrEMBL.It is automatic translation


of all coding sequences present in the Genbank database.
• OWL is a nonredudant protein sequence database produced from
SWIISS-PROT, PIR, NRL-3D and GenPept.
• The International Protein Sequence Database (PIR) is a
collaborative database from PIR, MIPS and JIPID that contains much
the same sequence information as SWISS-PROT. However, it has a
substantial amount of duplicated sequence entries, is hard to read and
is not well annotated. In particular, it lacks SWISS-Port’s superb cross-
referencing to other databases.
• NRL-3D is produced by PIR from sequence and annotation information
extracted from Brookhaven Protein Databank (PDB) of crystallographic
3D protein structures. It is useful for similarity searches.
• Kabat Database of Sequences of Proteins of Immunological
Interest is a database of sequences involved in the immune system.
DATA RELIABILITY IN PROTEIN
DATABASES
•About 30% of the proteins in the
databases have erroneous sequences
due to:
– Missing exons in the DNA translation.
–Introns mistakenly translated.
ACCESSING THE DATABASE

Following are the publicly available WWW sites for keyword and
similarity searches:
• (1) Entrez provides good cross-linking between the nucleic,
and protein databases with the Medline bibliographic database.
A very powerful feature is the ability to find other entries like the
one already found.
• (2) SRS, the Sequence Retrieval System provides a powerful
means of finding entries in related sets of databases.
• (3) BLAST and FASTA are publicly available sites providing
access to these popular sequence similarity searching
programmes.
MISCELLANEOUS OTHER DATABASES
GENOMES
• GOLD-Genome Online Database
• KEGG-Kyoto Encyclopedia of genes and
Genomes
• FlyBase

Вам также может понравиться