Вы находитесь на странице: 1из 6

Figure S1.

Prediction of ORFs
(A) Schematic diagram for the prediction of ORFs.

This diagram illustrates the ORF prediction method used on all H-Inv cDNAs. The method was based upon the alignment of similarity searches using FASTY and BLASTX. Gene prediction was carried out using GeneMark. Prior to the prediction of ORFs, we judged if a sequence had any frameshift errors or remaining introns. During ORF prediction, we corrected those sequence irregularities computationally. Details of how sequence irregularities were predicted are described in (B) and (C).

41,118 H-Invitational cDNAs

No

Top hit in BLASTx to curated SwissProt/Refseq (status = review)@ of Human % identity = 100% %length coverage = 100%

Yes

No

Does the translation includes both a Met and a stop codon?

Yes

Assign predicted ORF = ORF in SwissProt/Refseq

Assign predicted ORF = by translating the aligned region of cDNA at the frame of BLASTx alignment

No

Top hit in FASTY (excluding the hit to itself) -10 E-value 10

Yes

based on the alignment with the target, correcting predicted immatures and frameshifts. Consider 3 frames and choose the longest frame.

Top hit in FASTY % identity = 100% %length coverage = 100% Translation start with Met

No

Yes

Assign predicted ORF = Translation of the aligned target, Met to stop codon

Assign predicted ORF = Translation of the aligned target

No

Is the ORF predicted by Genemark?

Yes

Assign predicted ORF = the longest ORF among Genemark predictions, correcting predicted immatures and frameshifts. Consider 3 frames and choose the longest.

No

ORF information registered in DNA_db

Yes

Assign predicted ORF = ORF in DNA_db

No

Is the longest ORF of length > 80aa

Yes

Assign predicted ORF = the longest ORF of length > 80aa, correcting predicted immatures and frameshifts

No ORF Predicted

Figure S1. Prediction of ORFs


(B) Schematic diagram for prediction of unspliced introns. This schematic diagram illustrates the prediction method used for unspliced introns.

Gap information from BLASTx result No Gap information from FASTY result (gap >= 2 aa) Yes (length of query > length of subject; no overlap between query; correspondent order and direction of query and subject)

No

Yes No No

Identity > 90% Gap length >= 60 Yes

Identity > 80% Gap length >= 60 Yes

No

Inframe of exon boundary (3 bp of exon boundary)

No

Yes Unspliced intron not predicted We predict that the sequence contains an unspliced intron Unspliced intron not predicted

Figure S1. Prediction of ORFs


(C) Schematic diagram for prediction of frameshift errors. Frameshift errors were inferred from cDNAgenome pairwise alignment gaps due to insertion or deletion, exception of multiple of 3 bp, or over 10 bp in either the query cDNA or genome.
Gap supported by genome 1. Only insertion/deletion 2. The insertion/deletion is not a multiple of three 3. The insertion/deletion is > 10bp

FASTY support > 80% identity and range 9 bp of gap

No

Yes

The GeneMark support Overlaps the predicted genes (stop - stop)

No

Yes

Frameshift error predicted

No frameshift error predicted

Figure S1. Prediction of ORFs


(D) The statistics for the predicted frameshifts and unspliced introns.
Irregularity Frame shifts Unspliced intron Category I protein 48 290 Category II protein 117 456 Category III protein 71 131 Category IV protein 60 87 Category V protein 38 46 Total 372 1056

Вам также может понравиться