Вы находитесь на странице: 1из 73

Analiza strukturalna i modelowanie biaek spliceosomu ludzkiego

Iga Korneta

Praca doktorska wykonana w Laboratorium Bioinformatyki i Inynierii Biaka Midzynarodowego Instytutu Biologii Molekularnej i Komrkowej w Warszawie

Promotor: prof. dr hab. Janusz M. Bujnicki

Warszawa, 2012

Spis treci
Autoreferat rozprawy ............................................................................................................................2 Wprowadzenie ......................................................................................................................................3 Projekt badawczy..................................................................................................................................8 Analiza strukturalna regionw uporzdkowanych biaek spliceosomu ludzkiego i stworzenie biblioteki modeli strukturalnych .........................................................................................................10 Analiza strukturalna regionw nieuporzdkowanych biaek spliceosomu ludzkiego ........................13 Porwnanie biaek i regionw strukturalnych biaek spliceosomu ludzkiego i G. lamblia ...............19 Publikacja danych...............................................................................................................................21 Podsumowanie wynikw projektu .....................................................................................................23 Bibliografia .........................................................................................................................................24 Publikacje ..............................................................................................................................................26 Structural Bioinformatics of the Human Spliceosomal Proteome ..................................................27 Intrinsic Disorder in the Human Spliceosomal Proteome ...............................................................47 Owiadczenia ........................................................................................................................................62 Informacja od promotora ...................................................................................................................63 Wkad pracy doktorantki w publikacje ..............................................................................................64 Owiadczenie, Janusz M. Bujnicki ....................................................................................................65 Owiadczenie, Marcin Magnus .........................................................................................................66 Owiadczenie, Iga Korneta ................................................................................................................67 Summary in English .............................................................................................................................68

Autoreferat rozprawy

Autoreferat |3

Wprowadzenie
Spliceosom Spliceosom to wielkoczsteczkowa maszyna molekularna, ktra w komrkach eukariotycznych przeprowadza proces splicingu - usuwania intronw (sekwencji niekodujcych) i czenia egzonw (sekwencji kodujcych) z prekursorowego mRNA (pre-mRNA). Ludzki spliceosom wystpuje w dwch formach, spliceosomu wikszego, ktry przeprowadza u czowieka >99% reakcji splicingu, oraz spliceosomu mniejszego, ktry przeprowadza pozostae <1% [1]. Dwie formy spliceosomu wystpujce u czowieka s spokrewnione i maj podobn struktur. aden ze spliceosomw nie jest kompleksem stabilnym ani jednolitym wprost przeciwnie, skadaj si one z wielu elementw, ktre cz si i oddzielaj w trakcie procesu splicingu. Spliceosom wikszy skada si gwnie z czterech podjednostek biakowo-RNA nazwanych wedug zawartych w nich czsteczek (sn)RNA: U1 snRNP, U2 snRNP (z podkompleksami SF3A i SF3B), U4/U6 di-snRNP oraz U5 snRNP. W spliceosomie mniejszym zamiast podjednostek U1 i U2 wystpuje pojedyncza podjednostka U11/U12, a zamiast podjednostki U4/U6 wystpuje podjednostka U4atac/U6atac. Na cztery podjednostki ludzkiego spliceosomu wikszego skada si w sumie 45 unikatowych biaek, przy czym siedem z nich (tzw. biaka Sm) wystpuje w czterech kopiach, po jednej w kadej podjednostce, gdzie te biaka tworz platform wspierajc RNA tej podjednostki. W podjednostce U4/U6, biaka Sm powizane s z U4 snRNA, natomiast U6 snRNA stowarzyszone jest z platform stworzon z siedmiu biaek Lsm (like-Sm podobne do Sm) spokrewnionych ewolucyjnie z biakami Sm (przegld biaek: [2]). Oprcz biaek podjednostek spliceosomu, okoo 70-80 dodatkowych biaek wystpuje w zgeneralizowanym kompleksie spliceosomalnym (bez rozdziau na spliceosom mniejszy lub wikszy) licznie, natomiast do ponad 100 biaek wystpuje dodatkowo nielicznie. Zestaw biaek wystpujcych licznie, w powizaniu z zestawem biaek i RNA podjednostek spliceosomalnych, mona uzna za eksperymentalne przyblienie zestawu biaek niezbdnych do funkcjonowania spliceosomu u czowieka [3]. Dodatkowe biaka spliceosomalne wystpujce nielicznie mog uczestniczy w jego funkcji tylko w szczeglnych warunkach lub te poredniczy pomidzy procesem usuwania intronw a innymi procesami obrbki mRNA, takimi jak transkrypcja mRNA, przyczanie czapeczki na kocu 5' mRNA, poliadenylacja koca 3' mRNA, eksport, lokalizacja i niszczenie mRNA oraz tworzenie kompleksw snoRNP rodziny C/D [4]. Dodatkowe biaka mog stanowi cz kompleksw biakowych lub stanowi niezalene czynniki splicingu. Wrd kompleksw biakowych zwizanych funkcjonalnie ze spliceosomem znajduj si: kompleks hPrp19/CDC5L, kompleksy EJC (exonjunction complex kompleks zcza egzonw), CBP (cap-binding proteins kompleks przyczania czapeczki do RNA), TREX (transport and exchange transportu i wymiany) i RES (retention and splicing zatrzymywania i splicingu). Oprcz kompleksu TREX, kompleksy te wystpuj skadaj si z biaek (a zatem te same wystpuj) w spliceosomie licznie [3]. Proces splicingu (przegld: [5]) Proces splicingu ma trzy gwne etapy, co ma swoje odbicie w dynamice zachowania podjednostek spliceosomu: Pierwszym etapem jest definiowanie granic pomidzy wycinanym intronem a otaczajcymi go egzonami. Moliwe s tutaj alternatywne granice intronu, co prowadzi do zachodzenia zjawiska tzw. alternatywnego splicingu. W wikszym spliceosomie ludzkim, definiowanie granic intronu dokonywane jest przez podjednostki U1 i U2 spliceosomu, przez niezalene biako SF1 oraz przez dwubiakowy kompleks U2AF65/U2AF35, ktre skanuj pre-mRNA w poszukiwaniu funkcjonalnych miejsc intronu [(koce 5' i 3' intronu oraz tzw. punkt rozgazienia intronu (BPS, branch point site)]. Ten etap ma dwa stadia: o podjednostka U1 przycza si do koca 5' intronu, kompleks U2AF65/U2AF35 do koca 3' intronu, a biako SF1 do punktu BPS [kompleks E (entry wejcia)]; o biako SF1 zastpowane jest przez podjednostk U2 spliceosomu (kompleks presplicingowy, kompleks A).

Autoreferat |4 W mniejszym spliceosomie, rol podjednostek U1 i U2 przejmuje podjednostka U11/U12. Drugim etapem jest waciwy proces splicingu, ktrego czci jest katalityczna reakcja splicingu. W wikszym spliceosomie ludzkim, ten proces rozpoczyna si poprzez przyczenie potrjnej podjednostki U4/U6.U5 tri-snRNP, powstaej z poczenia podjednostek U4/U6 oraz U5, do kompleksu A, i rwnie ma kilka stadiw: o do pre-mRNA z przyczonymi podjednostkami U1 i U2 (kompleks A) przycza si podjednostka U4/U6.U5 tri-snRNP (kompleks prekatalityczny, kompleks B); o podjednostki U1 i U4 odczaj si (aktywowany kompleks B, kompleks B*); o nastpuje pierwszy krok katalityczny splicingu na interfejsie U2, U5 i U6 snRNA atak nukleofilowy miejsca BPS na koniec 5' intronu (dajc kompleks katalityczny, kompleks C); o nastpuje drugi krok katalityczny splicingu na interfejsie U2, U5 i U6 snRNA atak uwolnionego egzonu 5' na koniec 3' intronu (dajc kompleks postsplicingowy). W mniejszym spliceosomie, rol podjednostki U4/U6 przejmuje podjednostka U4atac/U6atac. Trzecim etapem jest recykling podjednostek i odtworzenie podjednostek z pocztku fazy pierwszej. Kompleks poreakcyjny dzieli si na podjednostki U2, U5 i U6. Podjednostka U6 czy si z podjednostk U4, tworzc stosunkowo trwa podjednostk U4/U6 di-snRNP. Nastpnie podjednostka U4/U6 di-snRNP czy si z podjednostk U5, odtwarzajc podjednostk U4/U6.U5 tri-snRNP.

Kade ze stadiw reakcji splicingu (kompleks A, B, B*, C) jest stowarzyszone z wasnym garniturem dodatkowych biaek i kompleksw biaek splicingu [3]. Wieloetapowo procesu splicingu powoduje, e podstawowa funkcja spliceosomu, czyli katalityczne wycinanie intronw oraz czenie egzonw, zaley od waciwego dziaania wielu dodatkowych funkcjonalnoci maszyny spliceosomalnej, takich jak: rozpoznanie kocw 5' i 3' intronu (definicja intronu i egzonu), wzajemne rozpoznanie podjednostek spliceosomu, waciwe zejcie si podjednostek, dynamika i regulacja aktywnego spliceosomu. W przeciwiestwie do samej reakcji splicingu, wczesne fazy procesu splicingu (rozpoznawanie) nie wymagaj reakcji katalitycznych, a za to bazuj na nawizywaniu wielu sabych kontaktw pomidzy jednostkami uczestniczcymi. Z kolei pniejsze katalityczne przejcia pomidzy rnymi ukadami wiza RNA-RNA pomidzy snRNA spliceosomalnymi a pre-mRNA intronu s wspomagane przez biaka, takie jak helikazy RNA DDX23 i hBrr2 z podjednostki U5. Pokazano rwnie pojedyncze przypadki, gdzie za kontrol dynamiki procesu splicingu odpowiadaj elementy bazujce na modyfikacjach posttranslacyjnych biaek spliceosomalnych [6], w tym na procesie ubikwitynacji i domenach z nim zwizanych [7], oraz elementy nieuporzdkowane strukturalnie (patrz niej) [8]. Spliceosom sprawdza te na bieco poprawno produktw porednich oraz produktu kocowego procesu (czyli mRNA po splicingu). Sprawdzanie poprawnoci rwnie wykonywane jest przez helikazy RNA, tym razem niezwizane trwale z adnym kompleksem, takie jak hPrp5, hPrp16 i hPrp22 [9].

Badania strukturalne spliceosomu W momencie pisania tego autoreferatu, wrd wynikw bada dowiadczalnych struktury spliceosomu dostpne s mapy penego spliceosomu (ludzkiego) z dowiadcze kriomikroskopii elektronowej [10] oraz modele w wyszej (atomowej) rozdzielczoci rnych jego fragmentw, takich jak prawie kompletna podjednostka U1 spliceosomu ludzkiego [11] czy U4 snRNA zwizane z platform biaek Sm [12]. Jednak dla wielu regionw i penych biaek spliceosomalnych nie istnieje aden model dowiadczalny. Z drugiej strony, wiele z dostpnych modeli dowiadczalnych obejmuje te same fragmenty biaek.

Autoreferat |5 Dla wyczerpujcego poznania funkcji spliceosomu niezbdne jest poznanie jego struktury. Pierwszym z zada mojego projektu badawczego byo stworzenie biblioteki modeli dowiadczalnych biaek spliceosomu ludzkiego oraz wykonanie modeli strukturalnych dla regionw biaek bez struktur rozwizanych dowiadczalnie. W duej mierze, za motywacj do stworzenia takiej biblioteki staa wizja stworzenia modelu struktury penego spliceosomu w reprezentacji atomowej. Ot poprzez poczenie modeli struktur pojedynczych regionw kompleksu w wysokiej rozdzielczoci z wynikami analiz dowiadczalnych, takich jak spektrometria mas i kriomikroskopia elektronowa, moliwe byoby osignicie modelu strukturalnego spliceosomu w skali atomowej, ktry nastpnie mgby posuy do dalszych analiz [13]. Takie projekty powiody si wczeniej np. w przypadku projektu stworzenia modelu polimerazy RNA z bakterii Escherichia coli [14]. Nieuporzdkowanie strukturalne w spliceosomie Analizy ontologii procesw biologicznych i funkcji biaek pokazay, e splicing jest jednym z procesw silnie terminologicznie zwizanych z biakowym nieuporzdkowaniem strukturalnym [15]. Nieuporzdkowanie strukturalne w biakach oznacza brak stabilnej struktury trzeciorzdowej w danym fragmencie biaka w roztworze, chocia moliwe jest istnienie elementw struktury drugorzdowej i/lub nabycie przez region struktury w pewnych warunkach (na przykad kiedy biako jest zwizane w kompleksie). Nieuporzdkowane regiony biaek s znajdowane w biakach w wielorakich funkcjach jako czniki midzy uporzdkowanymi domenami, miejsca modyfikacji posttranslacyjnych i miejsca nawizywania kontaktw biako-biako i biako-RNA. W szczeglnoci, jednym z miejsc, gdzie nieuporzdkowanie strukturalne peni znaczc rol, jest rybosom. Wiele z biaek rybosomalnych skada si z fragmentw uporzdkowanych z doczonymi dugimi ogonami nieuporzdkowanymi. W dojrzaym rybosomie, te ogony wnikaj w gb kompleksu, tworzc spoiwo podtrzymujce struktur rRNA [16]. Dua zdolno do nawizywania kontaktw przez biaka nieuporzdkowane sprawia te, e czsto maj one funkcje przy tworzeniu wikszych kompleksw biakowych. Biaka stanowice centralne wzy sieci biakowych czsto s w czci lub caoci nieuporzdkowane [15]. Mimo silnego terminologicznego zwizku splicingu z nieuporzdkowaniem strukturalnym, kwestia nieuporzdkowania strukturalnego w spliceosomie nie bya tematem systematycznej analizy przed rozpoczciem przeze mnie projektu badawczego. Dlatego dane literaturowe dotyczce nieuporzdkowania byy rozproszone po publikacjach, niekiedy na zupenie inne tematy, i musiaam je najpierw samodzielnie odnale. Przez to nie mog odwoa si w tym miejscu do jednolitego rda przegldowego. Oto syntetyczne opisy regionw nieuporzdkowanych spliceosomu ludzkiego, ktre wykorzystaam w projekcie: domeny RS oraz regiony nieuporzdkowane podobne do domen RS: regiony bogate w reszty argininy i seryny, najpierw znalezione w czynnikach splicingu z grupy biaek SR (domeny RS) a nastpnie w innych biakach spliceosomalnych (regiony podobne do domen RS). W przypadku domen RS, wykazano, e regiony te s nieuporzdkowane strukturalnie. Domeny RS porednicz w rozmaitych typach kontaktw midzyczsteczkowych, w tym pomidzy biakami SR a pre-mRNA, rnymi biakami SR, oraz biakami SR i innymi biakami. W szczeglnoci, domeny RS mog poredniczy w definiowaniu granic intronu poprzez stabilizacj interakcji poprzezintronalnych pomidzy podjednostk U1 snRNP na kocu 5' intronu a biakiem U2AF65 na kocu 3' intronu. Domeny RS i regiony podobne do domen RS mog by fosforylowane na resztach serynowych, przy czym fosforylacja promuje nawizywanie kontaktw midzyczsteczkowych, a defosforylacja promuje przejcie do etapu katalizy splicingu. Wrd regionw podobnych do domen RS znalezionych w innych biakach, pokazano, e fosforylacja takiego regionu w biaku DDX23 podjednostki U5 promuje stabilne zwizanie tego biaka z podjednostk U4/U6.U5 tri-snRNP i wczenie podjednostki U4/U6.U5 trisnRNP do spliceosomu (gwne rdo: [17][18]);

Autoreferat |6 regiony bogate w motywy poliprolinowe i poliglutaminowe: wystpuj np. w biaku SmB/B' platformy Sm; posiadaj zdolno do tworzenia helis poliprolinowych (poliglutaminowych); mog zawiera motywy wice uporzdkowane domeny biakowe GYF i WW; pokazano, e wysycenie motyww poliprolinowych za pomoc domeny GYF hamuje splicing na poziomie kompleksu A; nieuporzdkowanie strukturalne tych regionw pokazane zostao na biakach niezwizanych z procesem splicingu (gwne rdo: [19]); regiony bogate w glicyn (i arginin): zawieraj tryplety RGG i spokrewnione (np. YGG, RAG). Mona je podzieli na dugie (~100 reszt aminokwasowych) i krtkie. Te pierwsze wystpuj w czynnikach splicingu z grupy hnRNP, natomiast te drugie zostay znalezione w innych biakach splicingu, jak np. biaku SmB/B' platformy Sm, biaku SF2/ASF z grupy biaek SR, oraz biaku U1-70K podjednostki U1. W przeciwiestwie do regionw z poprzednich dwch grup, te regiony s przewidywane jako wykazujce due zagrzebanie (tzn. brak kontaktu z roztworem, wskazujcy na pewien typ lokalnej zwartoci regionu), natomiast podobnie do poprzednich dwch grup nie wykazuj typowych przewidywa struktury drugorzdowej. Reszty argininowe mog by metylowane. Pokazano, e region bogaty w glicyn biaka hnRNP A1 wie in vitro sam siebie oraz inne biaka hnRNP. Ten region jest rwnie niezbdny do wizania biaka hnRNP A1 do podjednostek U2 i U4 oraz wycisza proces splicingu. Metylacja argininy w regionie bogatym w glicyn i arginin homologa drodowego biaka U1-70K obnia wizanie tego biaka przez biako Npl3 (ktre samo zawiera tryplety RGG, ale jest uwaane za homolog biaek z grupy SR) (gwne rdo: [20][21]); ULMy (UHM ligand motifs ligandy dla domen UHM) krtkie motywy biakowe (~20 aminokwasw), ktre s przewidywane jako nieuporzdkowane, ale znajdowane w modelach dowiadczalnych biaek spliceosomalnych jako ligandy wice domeny UHM, czyli domeny RRM o strukturze zmienionej tak, e wi nie RNA, a biako. Zawarte w bazie danych motyww biakowych ELM jako rekord LIG_ULM_U2AF65_1, o wzorze [KR]{1,4}[KR].[KR]W. . Wystpuj np. w biakach U2AF65 i U2AF35, ktre wi koniec 3' intronu UHM biaka U2AF35 wie ULM biaka U2AF65 (gwne rdo: [22]).

Rys. 1 (nastpna strona): Modele dowiadczalne spliceosomu ludzkiego. A: Mapa cryo-EM (kriomikroskopii elektronowej) spliceosomu ludzkiego w rozdzielczoci 22 (EMD ID EMD-1294)[10]. B: Struktura krystalograficzna podjednostki U1 w rozdzielczoci 5.5 (PDB ID 3CW1)[11]. W przypadku elementw biakowych podjednostki pokazane s jedynie wgle C.

Autoreferat |7

Autoreferat |8

Projekt badawczy
Mj projekt badawczy skupi si na analizie strukturalnej i modelowaniu struktury 252 biaek spliceosomu ludzkiego, w tym wszystkich biaek podjednostek spliceosomu wikszego i licznie wystpujcych dodatkowych biaek. Inicjatorem projektu by prof. dr hab. Janusz M. Bujnicki, ktry jest rwnie promotorem rozprawy doktorskiej. Praca wykonana w ramach projektu zostaa sfinansowana z grantu LSHG-CT-2005-518238 szstego programu ramowego UE EURASNET. Niektre obliczenia wykonane zostay w Interdyscyplinarnym Centrum Modelowania Matematycznego i Komputerowego Uniwersytetu Warszawskiego w ramach grantu obliczeniowego G27-4. Wykonanie przeze mnie projektu mona podzieli na cztery czci: Analiz strukturaln uporzdkowanych regionw biaek spliceosomu ludzkiego oraz przegld rozwizanych dowiadczalnie struktur biaek ludzkiego spliceosomu posiadajcych modele w reprezentacji atomowej i wykonanie modeli strukturalnych dla regionw biaek bez struktur rozwizanych dowiadczalnie. Ta cz projektu leaa w jego pocztkowym zaoeniu, a gwn motywacj do jej wykonania bya wizja stworzenia modelu struktury penego spliceosomu w reprezentacji atomowej. Prace nad stworzeniem modelu penego spliceosomu s kontynuowane przez inne osoby w grupie badawczej prof. Bujnickiego. Wyniki analizy uporzdkowanych regionw biaek spliceosomu ludzkiego zostay opisane w zaczonej publikacji Structural Bioinformatics of the Human Spliceosomal Proteome (Korneta I., Magnus M., Bujnicki JM., 2012, doi: 10.1093/nar/gks347, PMID: 22573172). Analiz nieuporzdkowania strukturalnego w biakach spliceosomu. Etap badania nieuporzdkowania strukturalnego nie by czci oryginalnego planu projektu badawczego. Konieczno jego wykonania wynika dopiero w momencie, kiedy po wstpnej analizie zorientowaam si, e ponad jedna trzecia cznej dugoci biaek podjednostek wikszego spliceosomu ludzkiego, i ponad poowa cznej dugoci wszystkich biaek spliceosomalnych, jest przewidywana jako strukturalnie nieuporzdkowana. Wyniki analizy nieuporzdkowania strukturalnego biaek spliceosomu ludzkiego zostay opisane w zaczonej publikacji Intrinsic Disorder in the Human Spliceosomal Proteome (Korneta I., Bujnicki JM., 2012 doi: 10.1371/journal.pcbi.1002641, PMID: 22912569). Porwnanie garnituru biaek i domen wystpujcych w biakach proteomu spliceosomu ludzkiego oraz znanego garnituru biaek i domen proteomu spliceosomalnego pierwotniaka Giardia lamblia. Pierwotniak ten cechuje si genomowym minimalizmem, w tym rwnie minimaln iloci intronw w genomie [23]. Wynik analizy moe pomc w ustaleniu priorytetw w modelowaniu struktury spliceosomu ludzkiego, poniewa regiony znajdujce si zarwno w spliceosomie ludzkim jak i G. lamblia naley najprawdopodobniej potraktowa pierwszoplanowo podczas modelowania struktury. Wyniki analizy porwnawczej proteomw spliceosomalnych ludzkiego i G. lamblia zostay opisane w zaczonej publikacji Structural Bioinformatics of the Human Spliceosomal Proteome (Korneta I., Magnus M., Bujnicki JM., 2012, doi: 10.1093/nar/gks347, PMID: 22573172). Publikacj danych w serwisie internetowym. Wykonane przeze mnie w ramach projektu dane oraz modele strukturalne s dostpne w Internecie pod adresem http://iimcb.genesilico.pl/SpliProt3D. Programist serwisu jest mgr Marcin Magnus, ja przy jego tworzeniu braam udzia jako projektantka. Serwis z danymi zosta opisany w zaczonej publikacji Structural Bioinformatics of the Human Spliceosomal Proteome (Korneta I., Magnus M., Bujnicki JM., 2012, doi:

Autoreferat |9 10.1093/nar/gks347, PMID: 22573172). Jest to jedyny wynik w tej publikacji, ktrego nie jestem autork. Wszystkie wyniki projektu zostay opisane w zaczonych publikacjach Structural Bioinformatics of the Human Spliceosomal Proteome (Korneta I., Magnus M., Bujnicki JM., 2012, doi: 10.1093/nar/gks347, PMID: 22573172) oraz Intrinsic Disorder in the Human Spliceosomal Proteome (Korneta I., Bujnicki JM., 2012 doi: 10.1371/journal.pcbi.1002641, PMID: 22912569), ktre skadaj si na rozpraw doktorsk.

A u t o r e f e r a t | 10

Analiza strukturalna regionw uporzdkowanych biaek spliceosomu ludzkiego i stworzenie biblioteki modeli strukturalnych
Metodologia: Wykrywanie domen: Uporzdkowane domeny strukturalne wykryam za pomoc oprogramowania, gwnie metaserwera GeneSilico (https://genesilico.pl/meta2/) [24], a nastpnie granice domen poprawiam rcznie. W przypadkach, gdy byo to moliwe, domenom przyporzdkowaam numery w klasyfikacji strukturalnej SCOP [25] oraz identyfikatory w klasyfikacji domen konserwowanych ewolucyjnie PFAM [26]. Analiza miejsc wystpowania domen uporzdkowanych w spliceosomie ludzkim: List domen porwnaam z list ludzkich biaek spliceosomalnych podzielon na grupy wedug danych proteomicznych. Tworzenie biblioteki modeli: Przyporzdkowanie modeli do regionw strukturalnych wykonywaam zgodnie z nastpujc procedur: 1. jeeli dla regionu istnia model dowiadczalny (krystalograficzny lub NMR), przyporzdkowywaam ten model regionowi; 2. jeeli modelu dowiadczalnego nie byo, ale mona byo stworzy model porwnawczy, tworzyam model porwnawczy na szablonie wskazanym podczas etapu wykrywania domen; 3. jeeli modelu porwnawczego nie dao si stworzy, ale dany region by krtki (do ok. 100 aminokwasw), tworzyam model de novo; 4. jeeli modelu de novo nie dao si stworzy, tworzyam konstrukcje pro forma, w ktrym wiarygodnie odtworzona bya jedynie struktura pierwszo- i drugorzdowa. Wikszo konstrukcji pro forma, ktre stworzyam, przedstawiaa regiony nieuporzdkowane. Jako wszystkich modeli (cznie z dowiadczalnymi) zostaa oceniona oprogramowaniem MetaMQAPII [27] oraz na serwerze QMEAN [28]. Najwaniejsze wyniki: W 252 biakach spliceosomu ludzkiego wykryam 465 uporzdkowanych domen strukturalnych, w tym 80 domen w biakach podjednostek wikszego spliceosomu. Uporzdkowane domeny strukturalne stanowi ~90% uporzdkowanej czci biaek spliceosomu ludzkiego, i ~50% penej dugoci biaek (okoo poowa dugoci biaek jest przewidywana jako nieuporzdkowana strukturalnie). Znalazam rwnie 25 regionw, ktrych niektre waciwoci wskazuj na to, e mog stanowi potencjalne uporzdkowane domeny strukturalne, ale ktrych nie mona przyporzdkowa do adnych znanych grup domen. W kocu, przegldajc modele dowiadczalne kompleksw biaek spliceosomalnych znalazam rwnie 9 regionw, ktre mona nazwa domenami nieuporzdkowanymi, ktre nabywaj struktur maj potwierdzon niezalen funkcj, i posiadaj struktur w modelach dowiadczalnych, ale s przewidywane jako nieuporzdkowane w odosobnieniu (patrz niej). Gwne typy uporzdkowanych domen strukturalnych w biakach spliceosomu ludzkiego to: o mae domeny wice RNA (np. RRM, PWI); o mae domeny wice biakowe nieuporzdkowanie strukturalne (np. GYF, WW); o domeny zoone z powtrze strukturalnych wice biaka (np. TPR, WD40); o domeny zwizane z ubikwityn i procesem ubikwitynacji (np. zf-UBP, U-box); o domeny zwizane z szokiem termicznym (np. HSP20); o domeny izomeraz prolinowych (Pro_isomerase); o domeny wchodzce w skad stabilnych architektur helikaz RNA (np. DEAD); o mae domeny, ktre funkcjonuj jako ligandy wice wiksze domeny (np. PRP4); o domeny LSM stanowice podstaw planu strukturalnego biaek Sm/Lsm. Nowoci jest tutaj wykrycie przeze mnie znacznej liczby domen zwizanych z ubikwitynacj tzn. domen zazwyczaj wystpujcych w biakach funkcjonujcych w procesie ubikwitynacji. Znaczenie ubikwitynacji biaek spliceosomalnych dla kontroli procesu splicingu pokazano dowiadczalnie jedynie w pojedynczych przypadkach.

A u t o r e f e r a t | 11

Domeny zwizane z ubikwitynacj wystpuj w biakach spliceosomalnych gwnie w biakach zwizanych z drugim etapem splicingu (kompleksy B-C). Poniewa w opisanych przypadkach odwracalny proces ubikwitynacji reguluje dziaanie spliceosomu, mona postawi hipotez o istnieniu podsystemu regulacyjnego maszyny spliceosomalnej opartego o (de)ubikwitynacj. W takim wypadku, fakt, e biaka zawierajce domeny zwizane z ubikwitynacj wystpuj raczej na pnym stadiach splicingu mgby by wynikiem faktu, e te etapy wymagaj wikszej precyzji kontroli ni stadia wczesne (rozpoznawania). Stworzyam bibliotek 104 modeli dowiadczalnych (43 krystalograficznych, 61 NMR), 297 wiarygodnych modeli stworzonych na komputerze (255 porwnawczych, 43 de novo) oraz ponad 500 konstrukcji pro forma. Wykonane przeze mnie modele (poza konstrukcjami pro forma) i dostpne struktury rozwizane dowiadczalnie pokrywaj ponad 90% cznej dugoci sekwencji biakowej przewidywanej jako uporzdkowana (~50% cznej oglnej sekwencji biakowej). Wykonane przeze mnie modele (znw poza konstrukcjami pro forma) posiadaj parametry odpowiednie do tego, by zosta wykorzystane do dalszych bada spliceosomu, m.in. do poczenia ich z wynikami analiz kriomikroskopii elektronowej podjednostek spliceosomu w celu poznania struktury caego kompleksu. Wrd domen, ktre nie byy wczeniej znalezione przez automatyczne serwisy tworzce adnotacje, ani opisane literaturze, a ktre znalazam w ludzkich biakach spliceosomalnych, znajduj si domeny z biaek podjednostek spliceosomu oraz wanych biaek wystpujcych licznie [np.: domena zdegenerowanego palca cynkowego C2H2 biaka SF3a120 podjednostki U2, domena BLUF biaka hPrp3 podjednostki U4/U6 di-snRNP (adnotowana jako domena DUF1115), domena PWI biaka hBrr2 podjednostki U5 oraz helikaz RNA hPrp2 i hPrp22].

Najciekawszy wynik: Wykrycie domen i stworzenie modeli strukturalnych byo dla mnie najtrudniejsz czci analizy, z kilku wzgldw. Po pierwsze, zadanie to byo ogromnie pracochonne i zajo lwi cz czasu projektu. Po drugie, wymagao wicej rzemiosa naukowego ni kreatywnoci intelektualnej. Po trzecie wreszcie, ostateczn cezur wartoci modeli bdzie dopiero ich wykorzystanie w praktyce. Niemniej jednak, byway emocjonujce momenty najbardziej satysfakcjonujcym wynikiem tej czci analizy by, oczywicie, fakt, e znalazam nowe domeny strukturalne w niektrych z najwaniejszych biaek spliceosomu, ktre wczeniej byy wielokrotnie analizowane (np. hBrr2). To satysfakcjonujce, znale co, co inni wczeniej przeoczyli.

Rys. 2 (nastpna strona): Modele domen ludzkich biaek spliceosomalnych. A: Domena BLUF (DUF1115)biaka U4/U6-90K (hPrp3) (aminokwasy 540683). Zaznaczono pozycj konserwowanej reszty W604. Przewidywane RMSD 3.7, QMEAN Z-score -3.06. B: Konserwowane jdro domeny PRO8NT biaka hPrp8. Model de novo. Przewidywane RMSD 2.4 , QMEAN Z-score -1.93. C-E: Domeny PWI: C: Domena PWI z helikazy hPrp22 (DHX8; pokazane reszty 1120 ale domena moe koczy si na aminokwasie 92). Przewidywane RMSD 2.4, QMEAN Z-score -2.76. D: Domena PWI z helikazy hPrp2 (DHX16; reszty 195). Przewidywane RMSD 5.8, QMEAN Z-score -2.19. E: Domena PWI z helikazy U5-200K (hBrr2; reszty 259338). Przewidywane RMSD 3.8, QMEAN Z-score -0.79.

A u t o r e f e r a t | 12

A u t o r e f e r a t | 13

Analiza strukturalna regionw nieuporzdkowanych biaek spliceosomu ludzkiego


Metodologia: Wykrywanie granic przewidywanych regionw nieuporzdkowanych: Granice regionw nieuporzdkowanych w ludzkich biakach spliceosomalnych wykryam za pomoc oprogramowania, gwnie metaserwera GeneSilico (https://genesilico.pl/meta2/), a nastpnie poprawiam rcznie. Podzia regionw nieuporzdkowanych na typy: Regiony nieuporzdkowane podzieliam na nastpujce typy: 1. regiony nieuporzdkowania z przewidywanymi elementami struktury drugorzdowej (z podtypem wykazujcym przewidywania splecionych helis); 2. dusze regiony nieuporzdkowania ( reszt aminokwasowych) z silnym odchyleniem skadu aminokwasowego; 3. inne regiony. Wrd regionw z silnym odchyleniem skadu aminokwasowego wyrniam trzy podtypy odpowiadajce syntetycznym opisom, ktre przedstawiam w czci Wstp tego autoreferatu [typ podobny do domen RS, bogaty w poliprolin/poliglutamin oraz bogaty w glicyn (i arginin)] oraz dodaam dwa typy uzupeniajce (typ naadowany i nienaadowany). Analiza wystpowania regionw nieuporzdkowanych w spliceosomie ludzkim: List miejsc wystpowania rnych typw regionw nieuporzdkowanych porwnaam z list ludzkich biaek spliceosomalnych podzielon na grupy wedug danych proteomicznych. Analiza modyfikacji posttranslacyjnych modyfikacji posttranslacyjnych w biakach biakowych UniProt [29]. Nastpnie posttranslacyjnych w biakach spliceosomu regionw nieuporzdkowanych. miejsc nieuporzdkowanych: List pozycji miejsc spliceosomalnych pobraam z bazy danych sekwencji porwnaam miejsca wystpowania modyfikacji ludzkiego z list miejsc wystpowania rnych typw

Analiza przewidywanych regionw nieuporzdkowanych znalezionych w modelach dowiadczalnych oraz przewidywanie dodatkowych regionw z tych klas domen: List domen nieuporzdkowanych, ktre nabywaj struktur znalezionych podczas tworzenia biblioteki modeli biaek (tzn. przewidywanych regionw nieuporzdkowanych znalezionych w modelach dowiadczalnych i majcych niezalen funkcj) porwnaam z list ludzkich biaek spliceosomalnych podzielon na grupy wedug danych proteomicznych, aby dowiedzie si, w ktrych grupach biaek domeny te wystpuj najczciej. Wykorzystujc metody rozpoznawania wzorca (dla motyww <30 reszt aminokwasowych) oraz wykrywania domen przewidziaam dodatkowe potencjalne miejsca wystpowania tych domen w biakach. Przewidywanie i analiza dodatkowych regionw, ktre potencjalnie nabywaj struktur: Porwnujc list konserwowanych ewolucyjnie domen PFAM z list regionw nieuporzdkowanych, wytypowaam dodatkowe potencjalne domeny nieuporzdkowane. Wybierajc z listy najbardziej nieuporzdkowanych biaek zawierajcych potencjalne domeny nieuporzdkowane biaka jednoczenie ewolucyjnie konserwowane i licznie wystpujce w spliceosomie ludzkim, wytypowaam biaka istotne. Analiza wzgldnego wieku regionw nieuporzdkowanych i uporzdkowanych w biakach spliceosomu ludzkiego: [Uwaga: Poniewa biaka spliceosomalne s silnie konserwowane (zwaszcza biaka liczne) [30], mona wnioskowa o ewolucji caego proteomu spliceosomalnego na podstawie konserwowanych domen obecnych w biakach ludzkich.] List konserwowanych ewolucyjnie domen PFAM wystpujcych w ludzkich biakach spliceosomalnych porwnaam z list domen, ktre przewiduje si, e wystpoway w ostatnim wsplnym przodku eukariontw (last eukaryotic common ancestor LECA), oraz sprawdziam, czy s to domeny obecnie rozpowszechnione u bakterii i/lub Archaea. Nastpnie porwnaam stosunkowy wiek i powszechno domen PFAM przypadajcych na regiony uporzdkowane i nieuporzdkowane biaek. Analiza porwnawcza nieuporzdkowania strukturalnego w podjednostkach ludzkiego spliceosomu oraz podjednostkach rybosomu ludzkiego i Escherichia coli: Biaka rybosomw ludzkiego i E. coli podzieliam na regiony uporzdkowane, nieuporzdkowane z przewidywan struktur drugorzdow

A u t o r e f e r a t | 14 oraz nieuporzdkowane bez przewidywanej struktury drugorzdowej. Nastpnie porwnaam parametry nieuporzdkowania strukturalnego w tych dwch rybosomach z parametrami nieuporzdkowania strukturalnego w ludzkim spliceosomie. W kocu, dla rybosomu E. coli sprawdziam, jaka cz przewidywanych regionw nieuporzdkowanych jest znaleziona w modelu dowiadczalnym rybosomu. Dla rybosomu ludzkiego taka analiza bya niemoliwa, gdy nie istnia model dowiadczalny tego rybosomu. Najwaniejsze wyniki: Ludzkie biaka spliceosomalne s w wysokim stopniu nieuporzdkowane. >30% dugoci biaek podjednostek wikszego spliceosomu, >40% dugoci 122 najwaniejszych biaek wikszego spliceosomu (119 licznie wystpujcych + 3 dodatkowe dobrane do uzupenienia kompleksw biakowych) i >50% dugoci wszystkich ludzkich biaek spliceosomalnych jest przewidywane jako strukturalnie nieuporzdkowane. Wrd podjednostek spliceosomu, biaka U1 snRNP, U2 podkompleksu SF3A, U11/U12 disnRNP, biaka powizane z U2 i biaka specyficzne dla kompleksu U4/U6.U5 tri-snRNP s bardziej nieuporzdkowane ni biaka U2 podkompleksu SF3B, biaka U4/U6 di-snRNP, U5 snRNP, oraz biaka Sm i Lsm. Oznacza to, e, poza biakami specyficznymi dla kompleksu U4/U6.U5 tri-snRNP, wczesne biaka spliceosomalne podjednostek (biaka obecne na etapie rozpoznawania) s bardziej nieuporzdkowane ni pne biaka (biaka obecne na etapie katalizy). Podobnie jest dla biaek stanowicych niezalene czynniki splicingu biaka charakterystyczne dla kompleksu A s bardziej nieuporzdkowane ni te charakterystyczne dla kompleksw B-C. Wczesne biaka spliceosomalne zawieraj wicej nieuporzdkowania strukturalnego bez przewidywanej struktury drugorzdowej, ale z silnie odchylonym skadem aminokwasowym ni pne biaka, natomiast pne biaka (w tym biaka podjednostki U4/U6.U5 tri-snRNP) zawieraj wicej nieuporzdkowania strukturalnego z przewidywan struktur drugorzdow. Nieuporzdkowanie strukturalne z przewidywaniami struktury drugorzdowej zazwyczaj znajdowane jest w modelach dowiadczalnych kompleksw spliceosomalnych, natomiast dugie regiony nieuporzdkowania bez przewidywanej struktury drugorzdowej, a wykazujcego silne odchylenie skadu aminokwasowego, nie. To oznacza, e wikszo nieuporzdkowania strukturalnego wczesnych biaek splicingu moe nie naby adnej struktury w czasie trwania procesu splicingu, natomiast wiksza cz nieuporzdkowania strukturalnego pnych biaek moe potencjalnie naby struktur w trakcie procesu splicingu. Dla podjednostki U5, dla ktrej biaek tylko ~20% reszt jest przewidywanych jako nieuporzdkowana, ponad poowa z przewidywanych nieuporzdkowanych reszt ma przewidywan struktur drugorzdow. To oznacza, e ta podjednostka moe by prawie w caoci uporzdkowana. Wrd rnych typw nieuporzdkowania wykazujcego silne odchylenie skadu aminokwasowego, wszystkie trzy typy, ktre zdefiniowaam na podstawie syntezy wczeniej opublikowanych informacji [podobny do domen RS, bogaty w poliprolin/poliglutamin oraz bogaty w glicyn (i arginin)] wystpuj powszechnie w biakach wczesnych, natomiast jedynie typ podobny do domen RS wystpuje powszechnie w biakach pnych. Powszechno wystpowania tych regionw w biakach wczesnych, w poczeniu z wynikami bada dowiadczalnych dotyczcych ich roli, sugeruje, e stanowi one wany element w pierwszym etapie splicingu (etapie definiowania granic intronu). Natomiast wystpowanie regionw podobnych do domen RS w biakach pnych, rwnie w poczeniu z wynikami bada dowiadczalnych, sugeruje, e mog by one rwnie odpowiedzialne za kontrol dynamiki procesu splicingu.

A u t o r e f e r a t | 15 Dodatkowo, zwaywszy na to, e regiony podobne do domen RS oraz regiony bogate w glicyn i arginin czsto wspwystpuj w tych samych biakach oraz wystpuj w biakach, ktre wzajemnie ze sob reaguj (nawzajem hamujc swoje dziaanie w szczeglnoci biaka SR i biako hnRNP A1), zachodzi moliwo, e te dwa typy nieuporzdkowania strukturalnego ze sob oddziauj, zarwno w tych, jak i w innych biakach. Biaka ludzkiego spliceosomu wystpujce nielicznie, oprcz tego, e zawieraj przecitnie wicej nieuporzdkowania strukturalnego ni biaka wystpujce licznie, zawieraj rwnie wicej nieuporzdkowania wykazujcego silne odchylenie skadu aminokwasowego. Te biaka zawieraj wszystkie trzy typy regionw nieuporzdkowanych zdefiniowanych na podstawie wczeniejszych opisw literaturowych. Wrd modyfikacji posttranslacyjnych, fosforylacja seryny jest systematycznie zwizana z regionami nieuporzdkowane ludzkich biaek spliceosomalnych podobnymi do domen RS, a metylacja argininy z regionami nieuporzdkowane bogate w glicyn i arginin. N-acetylacja lizyny acetylacja N-kocowych reszt aminokwasowych ludzkich biaek spliceosomalnych nie zaley od stopnia uporzdkowania. Domeny nieuporzdkowane, ktre mona znale w modelach dowiadczalnych struktury kompleksw biaek spliceosomalnych, mona podzieli na dwa typy: ULMy i inne. ULMy wystpuj w wielu kopiach w ludzkim proteomie spliceosomalnym, wrd biaek wczesnych (podjednostki U2, biaek stowarzyszonych z podjednostk U2, biaek kompleksu A). Za pomoc metod rozpoznawania wzorca znalazam kilka dodatkowych potencjalnych miejsc wystpowania ULMw, rwnie gwnie w biakach wczesnych. Powszechno ULMw w biakach wczesnych sugeruje, e one rwnie stanowi wany element w pierwszym etapie splicingu (etapie definiowania granic intronu). Natomiast inne domeny poza ULMami wystpuj w mniejszej iloci kopii w proteomie spliceosomalnym, i zazwyczaj wi si ze specyficznym partnerem. Oglnie rzecz biorc, w ludzkich proteomie spliceosomalnym jest 51 konserwowanych ewolucyjnie domen PFAM, ktre obejmuj regiony nieuporzdkowane biaek (46 rnych typw domen). Te domeny PFAM mog wskazywa pooenie domen nieuporzdkowanych, w tym take domen nieuporzdkowanych, ktre mog nabywa struktur. W szczeglnoci, kilka wysoce nieuporzdkowanych biaek, dla ktrych te domeny PFAM s jedyn konserwowan czci biaka, jest silnie konserwowanych ewolucyjnie i wystpuje licznie w ludzkim proteomie spliceosomalnym. Te konserwowane wysoce nieuporzdkowane biaka wystpuj raczej na pnym etapie splicingu (s to dwa z trzech biaek podjednostki U4/U6.U5 tri-snRNP oraz kilka niezalenych czynnikw splicingu) i mog stanowi potencjalne biaka centralne spliceosomalnej sieci biakowej. Zarwno wikszo nieuporzdkowanych, jak i uporzdkowanych konserwowanych domen PFAM wystpujcych w ludzkich biakach spliceosomalnych bya obecna w ostatnim wsplnym przodku eukariontw. Jednak prawie adna z domen nieuporzdkowanych nie wystpuj obecnie powszechnie poza Eukaryota, podczas gdy okoo 1/3 domen uporzdkowanych wystpuje powszechnie. W szczeglnoci, grupa biaek skoncentrowanych wok podjednostek U4/U6 di-snRNP i U5 (w tym biaka Sm/Lsm oraz C-kocowe domeny helikaz RNA hPrp2/22/16/43) albo posiada homologi bakteryjne, albo skada si z domen powszechnie wystpujcych we wszystkich trzech superkrlestwach organizmw. Ta grupa biaek moe by najstarsz czci spliceosomu i stanowi jego trzon, na ktry pniej nadbudowywane byy m.in. regiony nieuporzdkowane. Nieuporzdkowanie strukturalne spliceosomu ludzkiego rni si znacznie od nieuporzdkowania strukturalnego rybosomu ludzkiego i E. coli. Regiony nieuporzdkowane w rybosomach s znacznie krtsze, i wikszo z nich wykazuje przewidywania struktury

A u t o r e f e r a t | 16 drugorzdowej. W rybosomie E. coli, wikszo reszt aminokwasowych przewidywanych jako nieuporzdkowane w odosobnionym biaku mona znale w strukturze kompleksu. Podjednostki obu rybosomw wykazuj te mniejsze zrnicowanie w stopniu przewidywanego nieuporzdkowania strukturalnego biaek ni podjednostki spliceosomu ludzkiego. Przyczyn mniejszego zrnicowania nieuporzdkowania strukturalnego w rybosomach jest zapewne fakt, e wikszo nieuporzdkowania strukturalnego biaek rybosomalnych ma podobn funkcj tworzy spoiwo wspierajce struktur rRNA. Na podstawie analiz bioinformatycznych nie mogam przewidzie, czy funkcja spoiwa RNA jest powszechna w spliceosomie. W modelach dowiadczalnych biaek spliceosomalnych znalazam tylko jeden przewidywany fragment nieuporzdkowany, ktry wie snRNA jest to fragment na kocu N biaka U1-70K. Jednak jest wany powd, dla ktrego funkcja spoiwa moe by mniej powszechna w spliceosomie ni w rybosomie. Rybosomalne RNA jest o wiele dusze ni RNA spliceosomalne (np. ludzkie 28S rRNA ma 5070 nukleotydw, a najdusze ludzkie snRNA, U2 snRNA, ma ich 188). To oznacza, e snRNA moe o wiele prociej (zapewne) zwin si samo , bez pomocy spoiwa. Na podstawie powyszych analiz, stworzyam model konceptualny podziau spliceosomu ludzkiego na trzy warstwy: o warstw wewntrzn (twardego jdra) biaka (domeny) wysoce uporzdkowane, bezporednio wspierajce kataliz przeprowadzan przez snRNA; precyzyjne mechanizmy dziaania; w wikszym spliceosomie gwnie biaka podjednostek U2 snRNP SF3B, U4/U6 disnRNP i U5 snRNP, biaka Sm/Lsm i uporzdkowane domeny C-kocowe helikaz RNA hPrp2/22/16/43; potencjalnie najstarsze ewolucyjnie regiony spliceosomu; o warstw poredni (paszcza) gwnie nieuporzdkowanie strukturalne, ktre moe przybiera struktur w niektrych warunkach (gwnie takiego, ktre wykazuje przewidywania struktury drugorzdowej), w tym konserwowane nieuporzdkowane domeny PFAM; funkcjonalnie gwnie stowarzyszone z dynamik spliceosomu; w wikszym spliceosomie gwnie biaka charakterystyczne dla podjednostki U4/U6.U5 tri-snRNP, niezalene czynniki splicingu charakterystyczne dla kompleksw B-C; rwnie domeny RS i by moe domeny zwizane z ubikwitynacj; warstw zewntrzn (atmosfery) gwnie regiony nieuporzdkowania strukturalnego, ktre nie przybiera struktury w adnych warunkach, zwaszcza dugie regiony nieuporzdkowania strukturalnego o silnie odchylonym skadzie aminokwasowym [w tym regiony podobne do domen RS, bogate w poliprolin/poliglutamin oraz bogaty w glicyn (i arginin)] oraz ULMy; te regiony mog suy jako sensory lub wypustki, ktre kontaktuj si ze sob nawzajem, z pre-mRNA i z maymi domenami uporzdkowanymi rwnie obecnymi w tej warstwie (np. GYF, WW, UHM). Inne mae domeny uporzdkowane (np. RRM, PWI) mog rwnie czy si z pre-mRNA. Gwnie funkcjonalno rozpoznawania i definiowania granic intronu (a co za tym idzie, regulacja alternatywnego splicingu). Gwnie wczesne biaka spliceosomalne (kompleksu A, podjednostek U1, U2 SF3A, U11/U12 di-snRNP, biaka powizane z podjednostk U2; wrd biaek nielicznych, biaka SR, hnRNP, SRm160/300, kompleksu RES). Funkcjonalno moe by regulowana poprzez modyfikacje posttranslacyjne fosforylacj seryn w regionach podobnych do domen RS i metylacj arginin w regionach bogatych w glicyn (i arginin).

Najciekawszy wynik: Ta cz projektu bya dla mnie znacznie bardziej interesujca ni cz pierwsza, z uwagi na to, e po poczeniu danych z rnych analiz, byam pod jej koniec w stanie stworzy pojedynczy spjny model

A u t o r e f e r a t | 17 (konceptualny nie strukturalny) dla zjawiska, ktre poprzednio nie byo systematycznie opisane: wystpowania i funkcji nieuporzdkowania strukturalnego w caym spliceosomie ludzkim. Mj model moe posuy jako punkt odniesienia do dalszych bada zjawiska nieuporzdkowania w spliceosomie. Jednoczenie szczegowe wyniki moich analiz dotyczce konkretnych regionw biakowych, ktre opublikowaam razem z oglnym modelem, bd mogy by sprawdzone dowiadczalnie.

Rys. 3 (nastpna strona): Model trzech warstw spliceosomu (ludzkiego).

A u t o r e f e r a t | 18

A u t o r e f e r a t | 19

Porwnanie biaek i regionw strukturalnych biaek spliceosomu ludzkiego i G. lamblia


Metodologia: Za pomoc wariantw BLASTP i PSI-BLAST narzdzia BLAST dostpnego ze strony NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi) znalazam homologi G. lamblia ludzkich biaek spliceosomalnych w bazie danych sekwencji biaek Protein. Nastpnie wykryam uporzdkowane domeny strukturalne w biakach G. lamblia i podzieliam biaka na regiony za pomoc metodologii opisanej wczeniej dla biaek ludzkich. W kocu, porwnaam garnitury biaek i domen ludzkich i G. lamblia. Najwaniejsze wyniki: Znany proteom spliceosomalny G. lamblia ATCC50803 zawiera homologi 30 ludzkich biaek spliceosomalnych; znany proteom G. lamblia P15 zawiera homologi dwch dodatkowych biaek. Dla porwnania, znany proteom droda S. cerevisiae zawiera homologi 61 ze 119 licznie wystpujcych biaek spliceosomu ludzkiego. Wrd biaek, ktre posiadaj homologi w G. lamblia, s gwnie biaka Sm i Lsm, biaka podjednostek U2 i U5 spliceosomu oraz spliceosomalne helikazy RNA. Inaczej mwic, s to gwnie biaka twardego jdra, ktre bezporednio wspieraj katalityczny proces splicingu przeprowadzany przez RNA spliceosomalne. Biaka spliceosomalne G. lamblia s zazwyczaj krtsze ni ludzkie. Regiony, ktrych brakuje, mona zaliczy do trzech gwnych typw: o domen zwizanych z ubikwitynacj; o regionw nieuporzdkowanych o przewidywanej samodzielnej roli (tzn. nie cznikw midzydomenowych); o krtkich fragmentw biakowych ktre su jako ligandy wice domeny biakowe (zazwyczaj fragmentw przewidywanych jako nieuporzdkowane w odosobnieniu, ale ustrukturalizowanych w kompleksie), oraz ich partnerw. Inaczej mwic, przyjmujc hipotezy wczeniej przedstawione w pracy dotyczce roli domen uporzdkowanych zwizanych z ubikwitynacj oraz nieuporzdkowania strukturalnego biaek spliceosomalnych, wrd biaek i domen znanego proteomu spliceosomalnego G. lamblia zdegenerowane lub brakujce s regiony odpowiedzialne za funkcjonalnoci takie jak pocztkowe rozpoznanie granic intronu, wzajemne rozpoznanie podjednostek spliceosomu oraz kontrol dynamiki spliceosomu. Natomiast niemale niezmienione s biaka bezporednio wspierajce podstawow aktywno spliceosomu, czyli katalityczny proces splicingu biaka, ktre ewolucyjnie s potencjalnie najstarszymi elementami spliceosomu zapoyczonymi z systemw bakteryjnych. Poniewa listy biaek i regionw strukturalnych biaek obecnych i nieobecnych w G. lamblia w stosunku do czowieka przedstawiaj spjny obraz, mona zaoy, e analiza ma sens. To oznacza, e lista regionw strukturalnych obecnych w G. lamblia prezentuje dobry punkt startowy dla modelowania spliceosomu jest to lista regionw, ktre powinny si znale w modelu spliceosomu (z jakiegokolwiek organizmu).

Najciekawszy wynik: Caa ta analiza! Jej pomys wzi si std, e modelowym organizmem w badaniu spliceosomu s drode S. cerevisiae. Spliceosom drodowy jest prostszy ni ludzki, ale nie o wiele a z drugiej strony, drode Saccharomycetes posiadaj wasne biaka spliceosomalne, ktre nie maj homologw nawet u innych grzybw. Podczas poszukiwania homologw biaek ludzkich w innych organizmach, zauwayam, e organizmem o minimalnym zbiorze biaek spliceosomalnych jest G. lamblia. Jest po temu dobra przyczyna, bo ten pierwotniak cechuje si oglnie minimalizmem genomowym oraz niewielk liczb intronw w genomie, co oznacza, e nie powinien mu by potrzebny spliceosom o

A u t o r e f e r a t | 20 skomplikowanym mechanizmie regulacyjnym. Dlatego uznaam, e dla okrelenia minimalnego zbioru sekwencji biakowych w proteomie splicesomalnym, powinnam porwna proteom ludzki i G. lamblia drodowy jest jeszcze zbyt skomplikowany. Poniewa G. lamblia jest pasoytem, i istnieje zawsze moliwo, e wikszo swoich biaek spliceosomalnych pozyskuje od gospodarza, nie mona twierdzi (przynajmniej do czasu bada dowiadczalnych, jeeli takowe kiedykolwiek si odbd), e G. lamblia posiada minimalny spliceosom. Niemniej jednak, fakt, e zarwno lista biaek i regionw, ktre s obecne w proteomie splicesomalnym G. lamblia, jak i lista biaek i regionw, ktre s w nim nieobecne, prezentuj zbiory spjne pod wzgldem funkcjonalnym, wskazuj, e istotnie tak moe by. W kadym razie, lista regionw wsplnych dla G. lamblia i czowieka stanowi dobry zbir pocztkowych regionw dla modelowania spliceosomu. Drugim interesujcym wynikiem tej analizy jest odkrycie, e silnie konserwowane biako Prp8 ma w G. lamblia na kocu C inn domen ni w prawie wszystkich pozostaych organizmach: zamiast domeny potencjalnie wicej ubikwityn posiada domen o zwoju domeny ubikwitynowej (jest to jedyna domena zwizana z ubikwityn obecna w zestawie biaek spliceosomalnych G. lamblia). Prp8 ley niemale w samym sercu spliceosomu wie si na stae lub przejciowo z wikszoci katalitycznych snRNA spliceosomu i peni centraln rol w jego poprawnym funkcjonowaniu. Analiza przyczyn tak drastycznej zmiany w strukturze tego biaka, i jej wpywu na ogln struktur spliceosomu, moe przynie ciekawe rezultaty.

A u t o r e f e r a t | 21

Publikacja danych
Metodologia: Dane zostay opublikowane w serwisie internetowym stworzonym przez mgr Marcina Magnusa pod adresem http://iimcb.genesilico.pl/SpliProt3D. Serwis jest jedynym wynikiem w projekcie badawczym, ktrego nie jestem gwn autork. Braam natomiast udzia w jego stworzeniu jako projektantka, wspautorka opisu itp.. Najwaniejsze wyniki: Serwis zawiera wszystkie modele strukturalne biaek spliceosomu ludzkiego wchodzce w skad biblioteki opisanej w sekcji Analiza strukturalna regionw uporzdkowanych, w tym modele dowiadczalne, porwnawcze, de novo i konstrukcje regionw nieuporzdkowanych i niemoliwych do wymodelowania w/w metodami. Modele mona przeszukiwa, pobiera na lokalny komputer itp.. Dodatkowo, serwis zawiera przyrwnania sekwencji homologw biaek ludzkich z reprezentatywnych gatunkw eukariotycznych, adnotowane za pomoc wynikw przewidywa oraz danych strukturalnych dla biaka ludzkiego. Uwzgldnione adnotacje to: przewidywanie nieuporzdkowania strukturalnego, struktury drugorzdowej, nieuporzdkowania o potencjale do wizania biaka, zagrzebania i splecionych helis, oraz dane dotyczce miejsc modyfikacji posttranslacyjnych pobrane z bazy danych UniProt.

Kompletny opis serwisu (w jzyku angielskim) dostpny jest pod adresem http://iimcb.genesilico.pl/SpliProt3D/home/. Pene archiwum plikw do cignicia zajmuje okoo 250 MB.

Najciekawszy wynik: Najciekawszym wyzwaniem przy projektowaniu serwisu bya dla mnie konieczno czytelnego zwizualizowania kombinacji przyrwnania sekwencji homologw biaek ludzkich z opisem waciwoci ludzkiego biaka. Istniejce narzdzia, ktre agreguj rne typy danych dotyczcych wasnoci biaka (np. metaserwer GeneSilico, https://genesilico.pl/meta2/), s potne, ale zazwyczaj zorientowane na uwzgldnienie jak najwikszej iloci danych kosztem estetyki przekazu i, co za tym idzie, nieprzydatne do wizualizacji. Natomiast dla potrzeb serwisu oraz na potrzeby publikacji wynikw w artykuach, konieczna bya integracja danych z przyrwna i przewidywa w zwartej formie. Myl, e kocowy efekt mojego dziaania, do ktrego wykorzystaam program Jalview [31], dobrze odpowiada postawionemu zadaniu.

Rys. 4 (nastpna strona): Wizualizacja przyrwnania sekwencji i przewidywa strukturalnych dla biaka SNRPD3 w bazie danych SpliProt3D (wykonane w programie Jalview).

A u t o r e f e r a t | 22

A u t o r e f e r a t | 23

Podsumowanie wynikw projektu


W ramach projektu udao mi si: systematycznie przeanalizowa uporzdkowane regiony biaek proteomu ludzkiego spliceosomu; wskaza regiony, ktre z punktu widzenia struktury s trywialne, interesujce lub na chwil obecn niemoliwe do wymodelowania; stworzy bibliotek modeli eksperymentalnych i stworzonych komputerowo, ktra moe zosta wykorzystana w dalszych badaniach; stwierdzi istnienie interesujcego, a wczeniej prawie zupenie nieopisanego w literaturze zjawiska nieuporzdkowania strukturalnego w biakach spliceosomu; zgromadzi rozproszone na ten temat informacje, a nastpnie systematycznie przeanalizowa ludzkie biaka spliceosomalne pod ktem rnych aspektw nieuporzdkowania strukturalnego; w kocu, stworzy spjny model, ktry opisuje to zjawisko i moe posuy jako podwalina dalszych bada; jednoczenie przedstawi konkretne przewidywania dotyczce okrelonych fragmentw biaek, ktre przewidywane s jako nieuporzdkowane; porwnujc proteom spliceosomalny ludzki z proteomem spliceosomalnym z G. lamblia stworzy list regionw biaek, ktre powinny znale si w modelu strukturalnym spliceosomu.

By moe najwaniejszym odkryciem, jakiego dokonaam w ramach projektu, jest to najprostsze to, e ludzkie biaka spliceosomalne s w wysokim stopniu nieuporzdkowane. Co wicej, zgodnie z przewidywaniami, znaczna cz nieuporzdkowania strukturalnego biaek spliceosomalnych nie nabierze struktury nawet po zwizaniu w kompleksach biakowych. Dla regionw biaek, ktre s nieuporzdkowane strukturalnie, nie mona stworzy pojedynczych rzetelnych modeli strukturalnych (zarwno dowiadczalnych, jak i przede wszystkim komputerowych) wysokiej jakoci. To oznacza, e tworzenie modelu struktury spliceosomu (zarwno dowiadczalnego, jak i przede wszystkim komputerowego) moe by znacznie utrudnione. Na pewno charakteryzacja strukturalna nieuporzdkowanych fragmentw biaek spliceosomu ludzkiego wymaga bdzie specjalnych metod dowiadczalnych i komputerowych nakierowanych na badanie biaek nieuporzdkowanych, takich jak metoda EOM (Ensemble Optimization Method metoda optymalizacji wynikw z dowiadcze SAXS/SANS) [32].

A u t o r e f e r a t | 24

Bibliografia
1. Tarn WY, Steitz JA (1996) A novel spliceosome containing U11, U12, and U5 snRNPs excises a minor class (AT-AC) intron in vitro. Cell 84: 801811. PMID: 8625417. 2. Valadkhan S, Jaladat Y (2010) The spliceosomal proteome: at the heart of the largest cellular ribonucleoprotein machine. Proteomics 10: 41284141. PMID: 21080498. 3. Agafonov DE, Deckert J, Wolf E, Odenwalder P, Bessonov S, et al. (2011) Semiquantitative proteomic analysis of the human spliceosome via a novel two-dimensional gel electrophoresis method. Mol Cell Biol 31: 26672682. PMID: 21536652. 4. Collins LJ, Penny D (2009) The RNA infrastructure: dark matter of the eukaryotic cell? Trends Genet. 25:120-128. PMID: 19171405. 5. Wahl MC, Will CL, Luhrmann R (2009) The spliceosome: design principles of a dynamic RNP machine. Cell 136: 701718. PMID: 19239890. 6. McKay SL, Johnson TL (2010) A birds-eye view of post-translational modifications in the spliceosome and their roles in spliceosome dynamics. Mol Biosyst 6: 20932102. PMID: 20672149. 7. Bellare P, Small EC, Huang X, Wohlschlegel JA, Staley JP, et al. (2008) A role for ubiquitin in the spliceosome assembly pathway. Nat Struct Mol Biol 15: 444451. PMID: 18425143. 8. Mathew R, Hartmuth K, Mohlmann S, Urlaub H, Ficner R, et al. (2008) Phosphorylation of human PRP28 by SRPK2 is required for integration of the U4/U6-U5 tri-snRNP into the spliceosome. Nat Struct Mol Biol 15: 435443. PMID: 18425142. 9. Cordin O, Hahn D, Beggs JD (2012) Structure, function and regulation of spliceosomal RNA helicases. Curr Opin Cell Biol. 24: 431-438. PMID: 22464735. 10. Azubel M, Wolf SG, Sperling J, Sperling R (2004) Three-dimensional structure of the native spliceosome by cryo-electron microscopy. Mol Cell 15: 833-839. PMID: 15350226. 11. Pomeranz Krummel,DA, Oubridge C, Leung AK, Li J, Nagai, K (2009) Crystal structure of human spliceosomal U1snRNP at 5.5 A resolution. Nature, 458, 475480. PMID: 19325628. 12. Leung AK, Nagai K, Li J (2011) Structure of the spliceosomal U4 snRNP core domain and its implication for snRNP biogenesis. Nature, 473, 536539. PMID: 21516107. 13. Jurica, MS (2008) Detailed close-ups and the big picture of spliceosomes. Curr. Opin. Struct. Biol., 18, 315320. PMID: 18550358. 14. Opalka N, Brown J, Lane WJ, Twist KA, Landick R, Asturias FJ, Darst SA (2010) Complete structural model of Escherichia coli RNA polymerase from a hybrid approach. PLoS Biol. 8(9). pii: e1000483. PMID: 20856905. 15. Tompa P (2009) Structure and Function of Intrinsically Disordered Proteins. Chapman & Hall. 16. Wimberly BT, Brodersen DE, Clemons WM, Jr., Morgan-Warren RJ, Carter AP, et al. (2000) Structure of the 30S ribosomal subunit. Nature 407: 327339. PMID: 11014182. 17. Haynes C, Iakoucheva LM (2006) Serine/arginine-rich splicing factors belong to a class of intrinsically disordered proteins. Nucleic Acids Res 34: 305312. PMID: 16407336.

A u t o r e f e r a t | 25 18. Long JC, Caceres JF (2009) The SR protein family of splicing factors: master regulators of gene expression. Biochem J 417: 1527. PMID: 19061484. 19. Kofler M, Schuemann M, Merz C, Kosslick D, Schlundt A, et al. (2009) Proline-rich sequence recognition: I. Marking GYF and WW domain assembly sites in early spliceosomal complexes. Mol Cell Proteomics 8: 24612473. PMID: 19483244. 20. Han SP, Tang YH, Smith R (2010) Functional diversity of the hnRNPs: past, present and perspectives. Biochem J 430: 379392. PMID: 20795951. 21. Bedford MT, Richard S (2005) Arginine methylation - an emerging regulator of protein function. Mol Cell 18: 263272. PMID: 15866169. 22. Kielkopf CL, Rodionova NA, Green MR, Burley SK (2001) A novel peptide recognition mode revealed by the X-ray structure of a core U2AF35/U2AF65 heterodimer. Cell 106: 595605. PMID: 11551507. 23. Morrison HG, McArthur AG, Gillin FD, Aley SB, Adam RD, Olsen GJ, Best AA, Cande WZ, Chen F, Cipriano MJ et al. (2007) Genomic minimalism in the early diverging intestinal parasite Giardia lamblia. Science, 317, 19211926. PMID: 17901334. 24. Kurowski MA, Bujnicki JM (2003) GeneSilico protein structure prediction meta-server. Nucleic Acids Res 31: 33053307. PMID: 12824313. 25. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540. PMID: 7723011. 26. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein families database. Nucleic Acids Res 38: D211222. PMID: 22127870. 27. Pawlowski M, Gajda MJ, Matlak R, Bujnicki JM (2008) MetaMQAP: a meta-server for the quality assessment of protein models. BMC Bioinformatics, 9, 403. PMID: 18823532. 28. Benkert P, Kunzli M and Schwede T (2009) QMEAN server for protein model quality estimation. Nucleic Acids Res., 37, W510W514. PMID: 19429685. 29. Magrane M, Consortium U (2011) UniProt Knowledgebase: a hub of integrated protein data. Database 2011: bar009. PMID: 21447597. 30. Collins L, Penny D (2005) Complex spliceosomal organization ancestral to extant eukaryotes. Mol Biol Evol 22: 10531066. PMID: 15659557. 31. Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton,G.J. (2009) Jalview Version 2a multiple sequence alignment editor and analysis workbench. Bioinformatics, 25, 11891191. PMID: 19151095. 32. Bernado P, Mylonas E, Petoukhov MV, Blackledge M, Svergun DI (2007) Structural characterization of flexible proteins using small-angle X-ray scattering. J Am Chem Soc 129: 5656 5664. PMID: 17411046.

Publikacje

Nucleic Acids Research Advance Access published May 9, 2012


Nucleic Acids Research, 2012, 120 doi:10.1093/nar/gks347

Structural bioinformatics of the human spliceosomal proteome


Iga Korneta1, Marcin Magnus1 and Janusz M. Bujnicki1,2,*
1

Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw PL-02-109 and 2Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan PL-61-614, Poland

Received January 18, 2012; Revised March 27, 2012; Accepted March 30, 2012

ABSTRACT In this work, we describe the results of a comprehensive structural bioinformatics analysis of the spliceosomal proteome. We used fold recognition analysis to complement prior data on the ordered domains of 252 human splicing proteins. Examples of newly identified domains include a PWI domain in the U5 snRNP protein 200K (hBrr2, residues 258338), while examples of previously known domains with a newly determined fold include the DUF1115 domain of the U4/U6 di-snRNP protein 90K (hPrp3, residues 540683). We also established a non-redundant set of experimental models of spliceosomal proteins, as well as constructed in silico models for regions without an experimental structure. The combined set of structural models is available for download. Altogether, over 90% of the ordered regions of the spliceosomal proteome can be represented structurally with a high degree of confidence. We analyzed the reduced spliceosomal proteome of the intron-poor organism Giardia lamblia, and as a result, we proposed a candidate set of ordered structural regions necessary for a functional spliceosome. The results of this work will aid experimental and structural analyses of the spliceosomal proteins and complexes, and can serve as a starting point for multiscale modeling of the structure of the entire spliceosome. INTRODUCTION The spliceosome is a eukaryotic macromolecular ribonucleoprotein (RNP) complex that performs the excision of introns (non-coding sequences) from pre-mRNAs following transcription. In humans, two forms of the spliceosome exist. The major spliceosome, which excises >99% of human introns, is composed primarily out of four stable small nuclear ribonucleoprotein (snRNP) particles

(subunits), named after their small nuclear RNA (snRNA) components: U1, U2, U4/U6 and U5. The minor spliceosome, which is absent in many species and which in human excises the remaining <1% introns, contains a U5 snRNP identical to the one from the major spliceosome, as well as two other snRNPs: U11/U12, and U4atac/U6atac. The U11/U12, and U4atac/U6atac di-snRNPs are distinct from, but structurally and functionally analogous to, the U1 and U2, and U4/U6 di-snRNP, respectively (1). The major human spliceosome contains 45 distinct proteins in its snRNP subunits in addition to around 80 abundant non-snRNP proteins (2). These proteins, together with the snRNAs, may be considered to be an experimental approximation of the core of the spliceosome, that is the set of structural elements necessary for the procession of the splicing reaction. Proteomics analyses of spliceosomal proteomes from various species yield also up to over 100 non-abundant splicing proteins (28), which may be active e.g. in certain instances of splicing. Out of the 45 distinct snRNP proteins, only seven, the so-called Sm proteins, are present in more than one copy. The Sm proteins form heteroheptamers with a toric shape, one per each of the U1, U2, U4 and U5 snRNPs. In each snRNP, the Sm heteroheptamer forms a platform that supports the respective snRNA. A similar platform associated with the U6 snRNA is composed of a set of seven related like-Sm proteins (9). Splicing-related proteins may also participate in other cellular events, including mRNA transcription (10,11), 50 capping, 30 cleavage and polyadenylation, as well as mRNA export, localization and decay (12,13) and box C/ D snoRNP formation (14). While the majority of non-snRNP proteins are independent factors, some associate into non-snRNP protein complexes, which include the hPrp19/CDC5L (NTC) complex (15), the exon-junction complex (EJC) (16), the cap-binding complex (CBP) (17), the retention-and-splicing complex (RES) (18), and the transport-and-exchange complex (TREX) (19). These complexes may also have non-splicing functions (16,20). A characteristic feature of the spliceosome is its extraordinary dynamism, as the snRNP composition of

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

*To whom correspondence should be addressed. Tel: +48 22 597 0750; Fax: +48 22 597 0715; Email: iamb@genesilico.pl
The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

2 Nucleic Acids Research, 2012

a spliceosome entity bound to the substrate pre-mRNA changes depending on the stage of the splicing reaction. For the major spliceosome, an E (entry) complex spliceosome contains U1 snRNP, an A complex contains U1 and U2 snRNP, a B complex contains U1 and U2 snRNP in addition to a tri-snRNP entity composed of the U4/U6 and U5 snRNPs, called U4/U6.U5, while the activated B (B-act) and catalytic (C) complexes contain U2, U5 and U6 snRNPs. After the splicing catalysis occurs and the mRNA is released, the initial conguration of the snRNPs (U1, U2 and U4/U6 and U5 separately) is recycled (21). Each stage-specic conguration of the snRNP subunits is also associated with a different non-snRNP protein complement. As a result, just like the snRNP composition, the non-snRNP composition of a given instance of the spliceosome also varies (2). In recent years, evidence has surfaced that ubiquitin-based (2224) and intrinsic disorder-based (25) systems may contribute to the regulation of splicing assembly and dynamics. To further the studies of the spliceosome and the association between splicing and other cellular processes, it is useful to determine the domain architecture and the three-dimensional structures of spliceosomal proteins. Detailed knowledge of protein structure can help determine how molecules perform their biological functions. Structure can also aid in understanding the effects of variations, resulting, e.g. from SNPs or from alternative splicing, which may have implications for disease. Besides, identication of structural similarities can reveal distant evolutionary relationships between proteins that cannot be detected from a comparison of their sequences alone (26). Of particular importance is the structural analysis of components of larger systems and complexes that have eluded high-resolution structural characterization. For instance, it has been suggested that highresolution models of individual snRNP components may be t into molecular envelopes created by low-resolution cryo-electron microscopy (cryo-EM) maps (27) to construct structures of the spliceosome at different stages of its action (28). Thereby, structural characterization of individual components of the spliceosome can bring us closer to modeling the structure and function of the entire system. There are two main potential gaps in our understanding of the structure of the protein components of the spliceosome. The rst one lies in recognizing the protein architecture at the primary level, e.g. the detection of conserved/structured domains and disordered regions. Most structural domains of splicing proteins are annotated by automated inferences in protein sequence databases such as UniProt (29). Many domains, especially those of the core splicing proteins, have also been characterized in literature. However, automated annotations are limited in that they can only either spread information that is already available in the system (such as through homology inferences) or information that conforms to tight preset standards (such as in the detection of domains that conform to PFAM domain proles) (30). Hence, at times, elements of protein architecture remain undetected throughout automated annotation,

and can only be determined through additional analyses and human interpretation of other data. The second gap lies in the lack of structural representation. Partial or complete structures have been determined for many splicing-related proteins and their complexes. These include a nearly complete U1 snRNP (31), U4 snRNP core with the Sm ring (32), several complexes associated with the spliceosome such as the human EJC (33) or the human CBP (34) and various proteinprotein and proteinRNA complexes, such as the human U2 snRNP protein p14 (SF3b14a) bound to a region of SF3b155 (35). In total, as of December 2011, data from the Protein Data Bank (PDB) (36) show that at least 340 structures have been determined by X-ray crystallography and NMR for human spliceosomal proteins or their domains, either alone or in various complexes. Many of these structural models are redundant because they represent the same regions of the same proteins. However, for many regions, no three-dimensional models are available. As an essential step towards enhancing our current understanding of the spliceosome, we have carried out a systematic structural bioinformatics analysis of the proteins of the human spliceosomal proteome, with a dual focus on characterizing their ordered parts and modeling their structures. In an effort to help set the priorities for future modeling of the entire spliceosome, we also compared the human spliceosomal proteome with the proteome of the parasitic diplomonad Giardia lamblia, known for its genomic minimalism. We put forward the set of structural regions common for human and G. lamblia as an attractive target for future studies. This analysis complements a parallel study of the unstructured part of the proteins of the spliceosome (I.K. and J.M.B., submitted for publication), and runs alongside efforts of many research groups to characterize the structure of spliceosomal RNAs and map out the interactions between the spliceosomal components.

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

MATERIALS AND METHODS Collection and classication of spliceosome proteins A total of 244 proteins found in the proteomics analyses of the major human spliceosome [sourced from one or more of the following references (2,4,8,3741)], and 8 proteins specic to the U11/U12 di-snRNP subunit of the minor spliceosome (Supplementary Table S1) (42), were downloaded from the NCBI Protein (nr) database. Proteins were classied as abundant and non-abundant according to (2), and they were assigned into groups based mainly on (2), followed by references (4,3840). Proteins classied here as miscellaneous were classied in primary sources, variably, as miscellaneous proteins, miscellaneous splicing factors, additional proteins, proteins not reproducibly detected and proteins not previously detected. We disclaim any responsibility for the factual accuracy of the association of proteins with the relevant groups beyond the point of following the primary sources.

Nucleic Acids Research, 2012 3

Sequence searches, alignments and clustering Searches of protein homologs in the NCBI Protein (nr) database were carried out at the NCBI using BLASTP/ PSI-BLAST (43) with default parameter settings. Putative homology was validated by reciprocal BLASTP searches against the Protein database with human (NCBI taxon id: 9606) as a taxon search delimiter. Sequence alignments were calculated using the MAFFT server using the Auto strategy (http://mafft.cbrc.jp/alignment/server/) (44). Clustering analysis of helicase sequences was performed with CLANS (45). Identication and description of structural regions of proteins Identication of intrinsically ordered and disordered regions of proteins, prediction of protein secondary structure and domain boundaries, as well as fold-recognition (FR) analyses, were carried out via the GeneSilico MetaServer gateway (for references to the original methods, see https://genesilico.pl/meta2) (46). In non-trivial cases (usually when putative modeling templates returned by FR scored low and/or various methods disagreed on the best template), FR alignments to the top-scoring templates from the PDB were compared, evaluated and ranked by the PCONS server (47), and the PCONS result was used to identify region boundaries. Additional searches were performed on the HHPRED server (48). SCOP database (49) IDs used for the purposed of structural domain identication were either extracted from the Protein Data Bank or from the SCOP parseable les on the SCOP website (http://scop.mrc-lmb.cam.ac.uk/scop/parse/ index.html) or assigned using the fastSCOP server (http:// fastscop.life.nctu.edu.tw/) (50). PFAM domain names were assigned on the PFAM website (http://pfam.sanger.ac.uk/). SCOP v. 1.75 and PFAM v. 25.0 were used. Structural similarity was compared using the DALI server (51). Assignment of models to structural regions of proteins In assigning structural models to regions, we followed a four-step procedure (Figure 1). Whenever a high-resolution experimental structural model (either X-ray or NMR

Figure 1. Rules for selecting and producing structural representations of protein regions. From left to right, structural representations decrease in the average condence.

structure) was available, we assigned it to the corresponding sequence region. If a structural similarity to a protein of known structure was predicted for a given region by fold-recognition algorithms (see below for details), we constructed a model for this region by a comparative (template-based) modeling technique, using the detected experimental structures as templates. In the absence of condently predicted templates, we used de novo folding methods for relatively small fragments likely to form globular domains. For the remaining regions (those without experimentally solved structures and for which the current modeling methodology cannot provide condent predictions of the 3D structure), we generated pro forma models, in which only the primary and (predicted) secondary structure was represented explicitly, while the tertiary arrangement was arbitrary. Pro forma models are not supposed to be reliable at the tertiary level and were constructed for the sake of further analyses (e.g. to initialize protein folding analyses that require some kind of a structural representation as an input). For regions with multiple solved structures in the Protein Data Bank, the following criteria of preference were used: (i) structures of the region in complex with other proteins and/or nucleic acids (i.e. in a potentially active or functionally relevant state) were given priority over structures of the region in isolation, (ii) crystallographic structures were given priority over NMR structures, (iii) higherresolution crystallographic structures were given priority over lower-resolution structures and (iv) more complete structures were given priority over less complete structures. The following experimental artifacts were removed from experimental structure les or corrected by standard modeling procedures: non-native sequences added to aid in the protein expression and structure determination process (e.g. afnity tags), non-standard amino acids (e.g. selenomethionine was replaced by methionine), and gaps in sequences (e.g. short disordered loop fragments were added). Single chains only were retained if the original PDB le contained multiple chains of the same protein. Comparative models were constructed by default with MODELLER (52) based on templates identied in the fold-recognition process. Selected challenging models were constructed using the I-TASSER server (53). Selected models were also adjusted with ROSETTA 3.0/3.1 using the loop modeling mode (54). De novo models were produced with the ROSETTA 3.0/3.1 AbInitioRelax application and clustered with the Rosetta 3.0/3.1 Cluster Application, following the protocols set out in the ROSETTA User Guide for version 3.1. (http://www .rosettacommons.org/manual_guide) (54). De novo folding was attempted if the following conditions were fullled: the region was 125 residues in length, predicted to be completely ordered and predicted to contain secondary structure elements. These conditions correspond to the current practical limit of utility of this type of methods (55). Articial pro forma spatial representations of protein chains of unknown/uncertain structure or predicted to lack a stable structure were built with UCSF Chimera (v.1.4/1.5) using the Tools>Structure Editing>Build Structure command (56). Pro forma constructs reect only the known primary and predicted

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

4 Nucleic Acids Research, 2012

secondary structure of the corresponding regions, while their tertiary structure should be regarded as unassigned (and remains to be modeled in the future). Miscellaneous manipulations of structures and models of molecules during this stage were performed in UCSF Chimera (56) and Swiss-PdbViewer v. 4.0.1 (57). Protein model quality assessment Assessment of model quality was performed with Meta MQAPII [https://genesilico.pl/toolkit/unimod?method= MetaMQAPII, an updated version of a method described in (58)] and QMEAN [http://swissmodel.expasy.org/qmean/ (59)]. MetaMQAP predicts the deviation of the query model from the (unknown) native structure and expresses it as the predicted global root mean square deviation (RMSD) and the predicted global distance test total score (GDT_TS) (60). The lower the predicted RMSD and the higher the predicted GDT_TS score, the better the model. QMEAN rst calculates an internal score, and then the QMEAN Z-score indicates by how many standard deviations the QMEAN score of the model differs from expected values for experimental structures that have a similar length to the model. High quality models are expected to have positive QMEAN Z-scores, and good models are expected to have a QMEAN Z-score above 2.0. Indicators of accuracy of individual residues were generated by MetaMQAPII and are supplied as B-factor values inside the model les available from the SpliProt3D database website (see below). They can be visualized with the UCSF Chimera command Render By Attribute > (attributes of residues: average B-factor) or with equivalent commands in other molecular visualization programs. Mean values and standard deviations of the QMEAN Z-scores for the six QMEAN contributing factors are provided with this publication (Supplementary Table S4) and the values for all models are provided with the model les. Models of low quality are expected to have a strongly negative QMEAN Z-score, but also strongly negative Z-scores for most of the contributing terms. As MetaMQAPII is not capable of evaluating multimeric models, for models of protein complexes (11 X-ray models and 2 NMR models) only the quality of the longest chain was evaluated by MetaMQAPII.

Website/database of models Models and additional data, including alignments of representative sequences annotated with predictions of order/disorder, secondary structure, binding disorder, solvent accessibility and coiled coils, as well as and annotations of sites of post-translational modication from UniProt (29), are available via the SpliProt3D web server at http://iimcb.genesilico.pl/spliprot3D. The entire archive of les available for download has approximately 250 MB. Visualization of sequence alignments and molecular structures Sequence alignments were visualized with Jalview v. 2.6.1 (61), while molecular structure graphics were produced with UCSF Chimera (56).
Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

RESULTS AND DISCUSSION Identication of structural domains of splicing proteins Our main priorities in identifying structural domains of splicing proteins were to check and correct previously reported domain boundaries and to identify and characterize domains that were not available in UniProt and other databases. We focused on 252 proteins of the human spliceosome, including 244 proteins found in the results of proteomics analyses of the major human spliceosome and 8 proteins specic to the U11/U12 subunits of the minor spliceosome (see Materials and Methods section for references to protein sources and Supplementary Table S1 for protein GIs). We did not nd any references to U4atac/ U6atac-specic proteins either in literature or in the Gene Ontology (GO) database [http://geneontology.org (62)]. A total of 118 proteins were classied as abundant as in (2); other proteins were classied as non-abundant. Abundant proteins are suggested to be the most important for the correct action of the spliceosome (2). Using a combination of protein fold-recognition and sequence conservation-based domain identication methods, we identied 465 ordered structural domains in the 252 proteins, including 80 domains in the snRNP proteins of the major human spliceosome (Table 1 and Supplementary Table S2). Ordered structural domains cover >80% of the ordered regions of the proteins, and $50% of all residues in the splicing proteins. Correspondingly, close to a half of the human spliceosomal

Table 1. Statistics of structural domains detected in the human spliceosomal proteome Feature Number of proteins Number of residues Number of ordered residues Number of ordered structural domains Number of suspected ordered structural domains Number of domains predicted to be disordered, but found to be ordered in experimentally determined structures Fraction of ordered residues covered by ordered structural domains (%) Fraction of total number of residues covered by ordered and disordered structural domains (%) Major spliceosome snRNP 45 20 390 13 427 80 7 3 89.6 61.0 All proteins 252 133 040 63 242 465 25 9 90.3 43.4

Nucleic Acids Research, 2012 5

proteome is predicted to be intrinsically disordered. The analysis of various structural and functional types of intrinsic disorder in the spliceosome brought about a quantity of data whose presentation is beyond the scope of this article and that has been consequently made the subject of an independent article (I.K. and J.M.B., submitted for publication). Based on the predicted order/disorder boundaries and the presence/absence of predicted secondary structure elements, we also detected 25 regions that we termed suspected domains. This category included two groups of regions. The rst group were domain-length (>40 residues) regions without a recognized fold that were the only ordered regions of otherwise highly intrinsically disordered proteins (!70% residues predicted to be disordered). The second group were present in proteins with low-to-middle intrinsic disorder content (<70% residues predicted to be disordered) that contained other ordered structural domains. The suspected domains in these proteins were ordered regions that had clear order/disorder boundaries and contained predicted secondary structure elements, but lacked a PFAM domain assignment (30) and showed no clear relationship to any known folds according to protein fold-recognition analyses. Ordered domains of splicing proteins classied in the SCOP (49) catalogue belong to classes ae and g, with

Table 2. Statistics of ordered structural domains of the human spliceosome according to the SCOP classication SCOP ID a b c d e g Description All a All b a and b (a/b) a and b (a+b) Multi-domain (a and b) Small Number of domains 79 83 53 159 1 49

an over-representation of class d, which contains superfamily d.58.7 (RNA-binding domain, RRM (RBD), which usually corresponds to PFAM domain PF00076, RRM_1; Table 2). RRM is present in the 252 proteins in as many as 117 copies. This means that roughly each fourth to fth domain in the spliceosomal proteome is an RRM. As RRM is a small domain that usually binds single-stranded RNA (63,64), this reects the key character of proteinRNA interactions in the splicing process. Other common types of ordered protein regions found in the human spliceosomal proteome include other small RNA-binding domains, large a- and b-repeat-based protein-binding domains, small protein disorder-binding domains, ubiquitin-related domains and stable multidomain RNA helicase architectures (Table 3). Repeat-based domains are often found as building blocks of protein complexes, while some of the ubiquitin-related domains have been shown to be part of a putative ubiquitin-based system of controlling spliceosome assembly and dynamics (22,65). In addition to ordered domains, we found nine regions with an expected independent function that were predicted to be disordered, but that were either found in experimental structures or could be condently modeled due to strong sequence matches to known domains. We considered these nine regions to be putative disordered domains that undergo a transition to order upon entering a complex. We discuss the features of these domains in an independent article that focuses specically on intrinsic disorder in the spliceosomal proteome (I.K. and J.M.B., submitted for publication). Here, we will only note that, in general, the identication of disordered structural domains is currently a non-trivial task in comparison with the identication of ordered structural domains, as fewer experimentally validated examples of disorder exist in databases and the properties of disorder make automated identication and propagation more difcult.

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Table 3. Common types of ordered structural domains in the human spliceosomal proteome Domain type Example PFAM domains RRM_1a, PWI, KH_1, S1, KOW, dsrm, G-patch, Surpb, SAP, zf-CCCH, zf-U1c, zf-metc, zf-C2H2_jazc, zf-U11-48K, zf-CCHC, FYVE WW, FHA, FF, GYF, SMN, SH3_1 Arm, TPR/HAT, HEAT, LRR_4, WD40 repeats Ubiquitin, U-box, zf-UBP, UCH, Rtf2, zf-C3HC4, ZZ, DWNN, RWD, JAB+PROCT DnaJ, HSP70, HSP20, CS Pro_isomerase DEAD+Helicase_C, DEAD+Helicase_C+HA2+OB_NTP_bind, (DEAD+Helicase_C+Sec63) 2, Upf1p-like U1snRNP70_N, SF3b1, PRP4, SF3a60_bindingd LSM Number of copies !201 Examples of proteins

Small RNA-binding domains Small protein disorder-binding domains Repeat-based protein-binding domains Ubiquitin-related domains Heat shock-related Proline isomerase Stable helicase architectures Small domains that act as ligands Sm/Lsm domains
a b

U1-A, U1-70K, U1-C FBP11, U5-52K (CD2BP2) U4/U6-60K (hPrp4), U5-102K (hPrp6), SF3b155, U2-A SF3a120, U4/U6.U5-65K, RNF113A CCAP1 U4/U6-20K (PPIH) hPrp43 (DHX15), U5-200K (hBrr2), KIAA0560 (AQR) SF3b155, U4/U6-60K (hPrp4) Sm, Lsm proteins

!24 !28 !19 !6 8 !19

!6 14

Some RRM domains bind peptide ligands (66). The Surp domain is predicted to bind RNA. However, in the only single structure of a Surp domain in complex (PDB ID: 2DT7), the Surp domain binds a peptide ligand. c Some zf-C2H2 domains mediate protein binding.

6 Nucleic Acids Research, 2012

Non-redundant set of experimental and theoretical structural models Following the identication of domains, we constructed a non-redundant set of experimental and theoretical structural models of regions in splicing proteins. As the utility and credibility of models, both experimental and theoretical, depends on their accuracy, we set some simple heuristic rules of preference to increase the chance that we chose the models with the best quality. We preferred experimental models over theoretical models, X-ray experimental models over NMR experimental models and comparative theoretical models over de novo theoretical models (Figure 1). The lowest tier in the hierarchy was pro forma constructs, in which only the primary and secondary structure were represented explicitly, while the tertiary arrangement was arbitrary. As a result, we mapped 104 non-redundant experimental models to the sequences of the spliceosomal proteins, and created 255 comparative and 43 de novo models (Table 4 and Supplementary Table S3), as well as over 500 constructs. The 104 non-redundant experimental models include 23 models of (nucleo)protein complexes, of which 13 complexes have residues from more than one spliceosome-associated protein. While models of complexes tend to have lower accuracy than models of isolated chains, we considered them to be more informative about the protein functional than models of isolated chains. This was the only instance where we favored the availability of additional information over plain accuracy of the structure. Over 90% of ordered regions of splicing proteins can be associated with experimental structural information or with comparative and de novo models (Figure 2).

Table 4. Structural representations of regions of proteins of the human spliceosomal proteome Feature Major spliceosome snRNP All proteins

This value is similar for the proteins of the snRNP subunits of the major spliceosome and other proteins associated with the human spliceosome. Between different types of structural representations, experimentally determined structural models cover 20.6% of all ordered residues, the comparative models we generated cover 67.4% of all ordered residues, and the de novo models cover 4.8% of all ordered residues. Hence, our theoretical models cover three times the length of ordered protein sequence covered by experimental models. X-ray crystallography is useful for the structure determination of large proteins (>30 kDa) and protein complexes, while NMR is well-suited for the structure determination of relatively small proteins. Not surprisingly, the ratio of the number of ordered residues in proteins from snRNP subunit structures solved by X-ray crystallography versus NMR is $3:1 (15.7%:4.7%), while this ratio for all splicing proteins is $1.77:1 (13.4%:7.2%). The main reason for this is that small domains are statistically more populous in the general set of splicing proteins compared to the snRNP subunits. Contrariwise, most structures of proteinprotein complexes available for splicing proteins include regions from snRNP proteins. Since the resolution (and hence accuracy) of experimentally determined structures is typically inversely correlated with the molecule or complex size, X-ray models of snRNP proteins have on average a slightly worse reso lution (mean 2.20 A) than X-ray models of all spliceosomal proteins (mean 2.08 A). For predicted disordered regions, condent structural coverage is very low in comparison to ordered regions. Less than 2% of residues predicted to be disordered are covered by experimental models, and even together with our theoretical models, we could only cover 8.9% of all disordered residues. Moreover, most of the residues covered belong to linkers between ordered structural domains or short regions in protein termini. This low coverage of intrinsically disordered regions by structural models may be in the future a considerable challenge in producing a comprehensive structural model of the spliceosome. Assessment of model quality For all models except pro forma constructs, we also independently evaluated their accuracy to determine how credible they were. To do this, we used two methods: MetaMQAPII (58) and QMEAN (59). Both of them provide a global score for the entire model (predicted RMSD for MetaMQAPII, QMEAN Z-score for QMEAN) as well as a local score for individual residues (in this analysis, only the MetaMQAPII score was used). Functionally relevant and evolutionarily conserved regions (e.g. binding interfaces) are typically predicted with a higher than average accuracy, in particular when comparative modeling is used. Consequently, even a model with a poor global score can be useful for functional considerations, if its functionally important parts are scored well and are likely to be accurate. Some readers may also be interested in scores that describe only the models quality with respect to a particular feature (e.g. secondary structure). To help describe

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Number of proteins 45 Number of residues 20 390 Number of ordered residues 13 427 Number of non-redundant experimental 20 models Number of non-redundant X-ray models 11 Mean resolution of X-ray models (A) 2.20 Number of non-redundant NMR 9 models Number of non-redundant theoretical 49 models Number of non-redundant comparative 37 models Number of non-redundant de novo 13 models Total number of non-redundant 139 representations Number of experimental models con9 (8/1) taining residues of more than one splicing protein (X-ray/NMR) Total fraction of structural order 91.2 covered (%) Total fraction of combined protein 64.3 sequence covered (%)

252 133 040 63 242 104 43 2.08 61 297 255 43 803 13 (11/2) 92.7 48.7

Nucleic Acids Research, 2012 7

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Figure 2. Coverage of structural order and disorder with different types of structural models. The values displayed on the graph are the number of residues covered by a given type of structural model, followed by percentage value.

different features of models, we recorded the mean values and standard deviations of QMEAN Z-scores for six QMEAN contributing factors. These values for all models are provided with the manuscript (Supplementary Table S4). For comparison with theoretical models, we predicted the global quality of experimentally determined structures (Supplementary Figure S1). Expectedly, both X-ray and NMR models we selected for our data set are highly scored by both MetaMQAPII and QMEAN, which is an indicator of the high accuracy of these structures (Table 5; for RMSD, the lower the score, the better the model; for the QMEAN Z-score good models are scored higher). Mean QMEAN Z-scores for models of both types (0.42 for X-ray and 0.08 for NMR) compare favorably to mean QMEAN Z-scores of models across the entire PDB (0.58 and 1.19, respectively) (67). As X-ray models in our database were scored slightly better than NMR models, we used scores for X-ray models as a benchmark with

which to classify theoretical models into those likely to be globally accurate or unlikely to be globally accurate. The worst-scored X-ray models in our data set have a predicted RMSD of 4.5 A (PDB ID 2ok3, resolution ) and a QMEAN Z-score of 1.99 (PDB ID 2qfj, 2.0 A resolution 2.10 A). Consequently, we divided all non-Xray models into four classes depending on passing one or both thresholds: predicted RMSD 4.5 A and QMEAN Z-score !2.0 (Figure 3). The majority of both NMR and theoretical models belong to the most reliable class (i.e. scored not worse than the worst crystal structures in the data set). These models are expected to be generally correct, although their local accuracy may vary. Models scored well only by one method should be treated with more caution than models scored well by both methods. However, poor scoring by one method may also be due to the model being either very short or very long. Models that are scored poorly by MetaMQAPII, but are scored well according to the

8 Nucleic Acids Research, 2012

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Figure 3. Models of regions of human splicing proteins divided by quality. This bubble graph displays the numbers of models of different types that belong to different classes of quality. Mean lengthcomp is the mean length of a comparative model of a given quality class.

Table 5. Predicted quality of models of regions of human spliceosomal proteins Feature X-ray Mean (SD) 43 1.90 (0.84) 78.56 (12.78) 0.805 (0.087) 0.42 (0.87) NMR Mean (SD) 61 3.85 (1.82) 55.94 (19.45) 0.744 (0.110) 0.08 (0.86) Comparative Mean (SD) 255 4.53 (1.96) 47.28 (21.35) 0.585 (0.164) 1.30 (1.43) De novo Mean (SD) 43 4.02 (1.50) 45.59 (15.85) 0.562 (0.132) 1.42 (1.33)

Number of models Predicted RMSD (MetaMQAPII) Predicted GDT_TS (MetaMQAPII) QMEAN total score QMEAN Z-score

QMEAN Z-score are usually short, while models that are scored high by MetaMQAPII and low by QMEAN are usually long. The mean length of a model scored well by both methods is 220 residues, but the mean length of a model scored well only by QMEAN is 70 residues and the mean length of a model scored well only by MetaMQAPII is 362 residues. Therefore, we urge the reader to consider the length of the model before while using models scored poorly by only one method. Over 40 models are scored poorly by both MetaMQAPII and QMEAN. These models may have been built on remotely related templates or did not fold well when modeled de novo, and are to be expected to have various errors. Based on our previous experience, we believe that some of these cases may represent new protein folds or interesting variations of known folds that present considerable challenge for protein modeling methods. Hence, while we regard these models as unreliable, we propose the corresponding proteins or domains as attractive targets both for experimental protein structure determination, and for protein modeling with other advanced techniques.

Database The entire non-redundant set of representations (including selected representative models determined by experimental methods, and all theoretical models built with computational methods) is available as an online database SpliProt3D at http://iimcb.genesilico.pl/SpliProt3D. The web server allows for browsing, selecting and downloading the models. Proteins are also associated with sequence alignments annotated with predictions of intrinsic order versus disorder, predictions of secondary structure, protein-binding disorder, solvent accessibility and coiled-coils, as well as the positions of post-translational modications. The database will be curated and new entries will be added and obsolete ones archived following the progress in structure determination of new spliceosomal proteins and/or publication of new theoretical models with better predicted accuracy. We would like to encourage structural biologists working on structure determination or prediction for spliceosomal proteins to contact us to have their models included and referenced in our database.

Nucleic Acids Research, 2012 9

Comparison of predictions with the experimentally determined SF3A structure After submission of this article for review, a crystal structure of the yeast U2 snRNP SF3A sub-complex was published (68), giving us an opportunity to compare some of our predictions with the independently determined experimental structure. The structure of the yeast SF3A complex includes, in addition to several regions composed of individual secondary structure elements, three ordered domains for which an experimental structure had not been published before. One domain in the yeast protein Prp9 is >200 residues long (its counterpart in the human protein SF3a60 is situated roughly between residues 177, 129244 and 310372); it features a novel helical architecture. Originally, we made no tertiary structural predictions for this domain (i.e. our database contained only constructs), and it is highly unlikely that the structure of this domain could have been predicted accurately by a standard bioinformatics approach. Another domain in the yeast Prp9 is a zf-C2H2 zinc nger inserted into the long helical domain, whose counterpart in the human protein SF3a60 lacks the Zn-binding residues and is closely neighbored by another insertion, of a SAP domain. Despite these differences, in our original model of this domain (with a predicted RMSD of 8.8 A and QMEAN Z-score of 1.93), we correctly predicted the fold and the position of nearly all residues in this zinc nger. We also correctly predicted the boundaries and the fold of an all-b domain in the human protein SF3a66, a counterpart of the yeast protein Prp11. The original comparative model of this domain had a pre dicted RMSD of 4.7 A and a QMEAN Z-score of 0.92, with a medium reliability of the fold prediction. In practice, upon comparison, this translated to predicting the position of approximately a half of the residues in the domain correctly. This analysis demonstrates the utility of the predictions, and that even models with a predicted
Table 6. Ubiquitin-related regions in the spliceosomal proteome Type of domain Ubiquitin SCOP ID d.15.1 d.15.1 d.15.1 d.15.1 d.15.1 d.15.1 d.15.2 g.44.1 g.44.1 g.44.1 g.44.1 g.44.1 g.44.1 g.44.1 g.44.1 g.44.1 d.3.1 d.20.1 c.97.3 PFAM ID Ubiquitin Ubiquitin SAP18 ubiquitin XAP5 DWNN zf-UBP U-box Rtf2 Rtf2 zf-C3HC4 Rtf2 Rtf2 DUF572 (ZZ) U-box UCH JAB+PROCT

relatively low accuracy can, in fact, exhibit correct folds, spatial shapes and locations of some of the functionally important residues. Given the availability of the new template, we generated new models for the human counterparts of the SF3A crystal structure, using the comparative approach. We also generated a new comparative model for a domain in the C-complex-related protein cactin (NY-REN-24/C19orf29, gi: 126723149) as this protein is predicted to have a domain with the same all-b fold as the SF3a66 domain. The new models have been deposited in the database, while the old models have been moved to the archive of the obsolete entries and are still available for analysis. Ubiquitin-related domains are most common in the proteins of the late stages of splicing Given the known role of ubiquitin in controlling spliceosome assembly and dynamics (21,22), and the fact that ubiquitin-related domains are one of the largest groups of domains in splicing proteins, we were interested in learning how these domains were distributed across the different groups of splicing proteins. We found 19 potential or known ubiquitin-related domains in 15 splicing-related proteins, including 12 abundant proteins of the major spliceosome and one protein of the U11/U12 di-snRNP subunit of the minor spliceosome (Table 6 and Figure 4). These domains cover most of the main classes of ubiquitin-related domains, including ubiquitin fold domains, RING zinc nger/U-box domains that may act as ubiquitin ligases, a ubiquitin conjugating enzyme-like domain, a ubiquitin carboxyl-terminal hydrolase domain and the JAB1/MPN domain of protein U5-220K (hPrp8) described in (23). In several cases, such as that of the abundant C-complex-specic protein FLJ35382 (C1orf55) and the TREX complex protein THOC5, only similarity of a protein region to a known ubiquitin-related fold could be detected.
Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Protein SF3a120a U11/U12-25K (C16orf33) SAP18a UBL5 FLJ35382 (C1orf55)a XAP-5 (FAM50A)a RBQ-1 U4/U6.U5-65K (USP39)a hPRP19a Cyp-60a Cyp-60a RNF113Aa NOSIPa NOSIPa CCDC130 RBQ-1 U4/U6.U5-65K (USP39)a THOC5 U5-220K (Prp8)a

Protein region 689,785 41,132 18,140 1,73 7,74 197,283 3,77 97,200 1,60 36,94 101,161 256,319 33,79 217,286 43,117 258,312 220,556 468,640 2064,2335

Protein group U2 snRNP U11/U12 di-snRNP EJC B complex C complex C complex Miscellaneous U4/U6.U5 trisnRNP hPrp19 / CDC5L B-act complex B-act complex B-act complex C complex C complex C complex Miscellaneous U4/U6.U5 trisnRNP TREX U5 snRNP

DWNN RING zinc nger/U-box

UCH UBC-like (RWD) JAB1/MPN


a

Abundant protein.

10 Nucleic Acids Research, 2012

Figure 4. Ubiquitin-related structural regions of human splicing proteins. (A) Ubiquitin-fold region of protein FLJ35382 (C1orf55; residues 180). Predicted RMSD 3.5 A, QMEAN Z-score 1.33. (B) RWD-like region of protein THOC5 (residues 458641). Predicted RMSD 3.9 A, QMEAN Z-score 1.85.

Ubiquitin-related domains are more abundant in proteins active in the late stages of splicing (B, B-act and C complexes). The ubiquitin-fold domain of protein SF3a120 is the only ubiquitin-related domain found in the U2 snRNP (its counterpart is found in the U11/U12 di-snRNP). On the other hand, as many as three proteins of the B/B-act complex (UBL5, Cyp-60 and RNF113A) and four proteins of the C complex (FLJ35382/C1orf55, XAP-5/FAM50A, NOSIP and CCDC130) contain ubiquitin-related domains, in addition to a domain in the U5 snRNP (the JAB1/MPN of U5-220K) and a protein in the U4/U6.U5 tri-snRNP (U4/U6.U5-65K). In summary, this distribution suggests that the late stages of splicing are probably under a stricter ubiquitin-based control than the early stages. This may be due to the fact that the earlier stages of splicing, such as intron/exon denition, are more dependent on weak, disorder-based interactions, while the later catalytic stages require precise subunit rearrangements. Zinc nger-like domains anked by conserved intrinsically disordered regions in U2 snRNP SF3a120 and other splicing proteins Our FR analysis detected that the human SF3A sub-complex contains, in addition to the zinc nger in protein SF3a60, another degenerate C2H2 (g.37.1)-type zinc nger in the middle conserved region of protein SF3a120 (conserved region: residues 217530, PFAM domain PRP21_like_P; zinc nger: residues 407435). In Saccharomyces cerevisiae, this zinc nger is absent entirely. However, in the majority of non-animal species, especially other fungi, amoeba and Apicomplexa, this zinc nger retains some of the cysteine and histidine zinc-binding residues (Figure 5A). The zinc nger remnant is surrounded on both sides by intrinsically unstructured regions that are in part predicted to form helical (potentially coiled-coil) structures. The short motifs lying on the distal ends of the disordered linkers are conserved. An additional coiled-coil region connects the N-terminal conserved motif with the previously

described (69) second Surp module of SF3a120. Thus, the PRP21_like_P module consists of three motifs, the second of which is a zinc-nger remnant, connected by exible linkers, with an N-terminal coiled coil that connects the N-terminal motif to the Surp region (Figure 5B). Structural modules of this type usually serve to simultaneously contact a binding partner of the protein in several locations. In the particular case of SF3a120, it has been suggested that both the U2 snRNA and a so far, unidentied splicing protein are potential partners (69). Through a systematic search, we found several other examples of zinc nger and zinc nger-like domains embedded in conserved disordered regions in the spliceosomal proteome (Table 7). Alternatively, tandem zinc ngers can be separated, e.g. by predicted coiled-coil regions. The new zinc-nger domains we found belong usually to the zf-C2H2 (g.37.1)-type, which can bind RNA and/or mediate proteinprotein interactions. The pre-mRNA/mRNA-binding protein ARS2 contains a ZZ RING zinc nger, while the C complex protein NOSIP contains two RING zinc nger/U-box-like regions. BLUF-like domain (DUF1115) of the U4/U6 di-snRNP protein 90K (hPrp3) The C-terminal ordered domain of protein U4/U6-90K (hPrp3), which corresponds to PFAM domain DUF1115 (PFAM ID: PF06544; residues 540683), was predicted in our analysis to have a ferredoxin-like fold. It is predicted to be related to the acylphosphatase/BLUF domain-like superfamily (SCOP ID: d.58.10). BLUF family domains have two additional helices in the C-terminus compared to acylphosphatase family domains. These helices are present in the DUF1115 domain, and so this domain is predicted to be a BLUF-like domain (Figure 6). This is an unusual assignment, because the BLUF domain is a FAD/ FMN-binding blue light photoreceptor domain found primarily in bacteria. In Eukaryota, it is found almost exclusively in euglenids and Heterolobosea. On the other hand, DUF1115 is found exclusively in eukaryotes. However, very high scores of BLUF domain templates yielded by FR methods for the hPrp3 DUF1115 sequence suggest that this protein is denitely homologous to the BLUF family. Nevertheless, DUF1115 differs from BLUF domains in some key features. The conserved FAD/FMN-binding residues are not conserved in DUF1115, and nor is a tryptophan residue whose position is altered depending on the excitement state of the photoreceptor (70) (Supplementary Figure S2). On the other hand, DUF1115 contains a disordered loop between the second a-helix and the fth b-strand. The presence of this loop, though not its length, is conserved in DUF1115 domains. Moreover, a conserved tryptophan residue, W604 in hPrp3, is located next to the disordered loop. Based on biochemical data, the DUF1115 domain may be a region of interaction of hPrp3 with the U5 snRNP protein hPrp6 and/or the U4/U6.U5 tri-snRNP protein U4/ U6.U5-110K (SART-1) (71). However, it is also possible

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Nucleic Acids Research, 2012 11

Table 7. Zinc-nger domains anked by or embedded in predicted disordered regions PFAM domain Protein SF3a120a LUC7B1 LUC7B1 CCDC130 NOSIPa NOSIPa Fra10Ac1 ASR2Ba Protein group Region SCOP superfamily ID g.37.1 g.66.1 g.37.1 g.44.1 g.44.1 g.44.1 d.325.1 g.37.1 PFAM domain of template SCOP description Condence Regionsuperfamily similarity High High High High High High Low High

PRP21_like_P LUC7 LUC7 DUF572 Rtf2 Rtf2 Fra10Ac1 ARS2


a b

U2 snRNP SF3A A complex A complex C complex C complex C complex C complex pre-mRNA/mRNA-binding

406,435 30,74 186,232 43,117 33,79 217,286 166,220 714,738

zf-U11-48K zf-CCCH zf-C2H2_jaz ZZ RING zf-C3HC4 Ribosomal_L28 zf-C2H2

bba zinc ngers CCCH zinc nger bba zinc ngers RING/U-box RING/U-box RING/U-box L28p-like bba zinc ngers

High High High High High High Lowb High

Abundant protein. Alternative templates: FYVE, fn1.

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Figure 6. BLUF-like region of protein U4/U6-90K (hPrp3) (domain DUF1115, residues 540683). The position of the conserved residue W604 is displayed. Predicted RMSD 3.7 A, QMEAN Z-score 3.06.

N-terminal PWI-like domains of the helicases hPrp22 (DHX8), hPrp2 (DHX16) and hBrr2 (U5-200K)
Figure 5. Architecture of the conserved middle region of protein SF3a120 (residues 217530). (A) Alignment of the residues of a zinc-nger domain in the middle part of SF3a120 (residues 407435). The g.37.1 annotation row displays residues predicted to form a part of a g.37.1 (zf-C2H2) zinc nger. The jnetpred SF3a120 annotation row displays predicted secondary structure elements of the human of the human SF3a120 (ovals represent a-helices, while arrows represent b-strands). (B) Architecture of the middle region of SF3a120; disordered linkers denoted as IDR linker (intrinsically disordered region-linker). (C) Model of the middle region.

that this interaction proceeds through the disordered PRP3 domain of this protein (71). A possible alternative role for DUF1115 is suggested by the fact that, apart from proteins from the hPrp3 family, it is found only in a family of proteins containing the RWD domain. The RWD domain belongs to the ubiquitin conjugating enzyme superfamily (72). Hence, the hPrp3 DUF1115 may be a part of the spliceosomal ubiquitin-based system.

hPrp22 (DHX8) and hPrp2 (DHX16) are RNA helicases that function in the remodeling of the spliceosome (6). According to our predictions, these two helicases contain N-terminal ordered helical bundles with a PWI superfamily fold (SCOP superfamily a.188.1) and similarity to the PFAM PWI domain (Figures 7 and 8). PWI is a nucleic acid-binding domain rst described in the splicing protein SRm160 (73,74). PWI is also found in the animal protein U4/U6-90K (hPrp3). The hPrp22 and hPrp2 PWI-like bundles (hPrp22: residues 192 or 1120; hPrp2: 195) are not found in a search with the prole of the PFAM PWI domain, possibly because their eponymous PWI tripeptide motifs are degenerated. In hPrp22 and its homologs, only the third position of this motif is conserved: [x][x][IV], while in hPrp2 and its homologs, the second and third positions are usually conserved: [x][WFY][IV]. However, PFAM displays several putative hPrp2/hPrp22 homologs when queried for proteins that contain PWI domains. Furthermore, stable binding to

12 Nucleic Acids Research, 2012

Figure 7. PWI-like regions of splicing helicases. (A) hPrp22 (DHX8; residues 1120 shown, but domain may end at residue 92). Predicted RMSD 2.4 A, QMEAN Z-score 2.76. (B) hPrp2 (DHX16; residues 195). Predicted RMSD 5.8 A, QMEAN Z-score 2.19. (C) U5-200K (hBrr2; residues 259338). Predicted RMSD 3.8 A, QMEAN Z-score 0.79.

nucleic acids by PWI requires an adjacent basic-rich region (74). We found potential candidates for such ancillary regions both in hPrp22 and in hPrp2 (hPrp22: residues: 93116; hPrp2: residues 120132). We also found a PWI-like helical bundle in the N-terminus of the human protein U5-200K (hBrr2; residues 258338; Figure 7). This helical bundle is conserved across the majority of eukaryotes, and is found, for instance, in the S. cerevisiae Brr2. The PWI-like domain of U5-200K retains a relatively well conserved second and third position of the tripeptide PWI motif: [x][WFY][ILV]. Notably, if correct, this prediction represents the rst case when a PWI-like domain is located in the middle of a protein. Usually, as is the case of SRm160, hPrp3, hPrp22 and hPrp2, a PWI domain is located either in the immediate N-terminus or in the immediate C-terminus of a protein. There are at least three candidate basic-rich regions in the vicinity of the U5-200K PWI-like domain (residues 254259; 343349; 373386). Sequences of proteins from the hPrp22 (DHX8) and hPrp2 (DHX16) families are very similar, to the effect that we could not easily separate them in a clustering analysis (Supplementary Figure S3). The most important discriminant between the two families appears to be the presence of an S1 RNA-binding domain (PDB ID: 2eqs; DOI:10.2210/pdb2eqs/pdb, manuscript to be published) between the N-terminal PWI-like bundle and the C-terminal helicase domains. This domain is present in hPrp22 and its homologs, but not in hPrp2 and its homologs. This led us to the hypothesis that Prp2, with the PWI-like domain, was the ancestral protein, which then underwent the insertion of the S1 domain. Nevertheless, the PWI-like domains of hPrp22 and hPrp2 differ in several aspects. The rst difference lies in the above-mentioned degree of degeneration of the tripeptide PWI motif, which is larger in hPrp22 and its homologs than in hPrp2 and its homologs. In an extreme case, the N-terminus of the Prp22 protein of S. cerevisiae and the related organism Eremothecium (Ashbya) gossypii is located inside the motif, which is therefore incomplete. The degeneration of the PWI motif may be offset by the heavy conservation of a [DE][FY] motif in the second helix of the bundle. The main reason for the conservation of the PWI motif in canonical PWI domains is that it stabilizes the structure of the PWI domain (74). It is possible that the conservation

of the [DE][FY] motif is sufcient to guarantee the stabilization of the bundle in conjunction with the conservation of the third position of the PWI motif. Second, there is also a possible difference in either the number or the arrangement of helices comprising the PWI domain. SCOP describes superfamily a.188.1 as a four-helix bundle. However, in the structure of the PWI domain from protein SRm160, the bundle is followed by an additional short a-helix orthogonal to the bundle (PDB ID: 1mp1) (74). The presence of this a-helix is also predicted for the hPrp3 PWI domain, although it is missing from the available experimental structure (PDB ID: 1x4q; DOI:10.2210/pdb1x4q/pdb, manuscript to be published). Similarly, secondary structure predictions for hPrp2 also indicated that this protein is likely to contain an additional a-helix. However, for hPrp22, predictions of domain boundaries are less decisive. The hPrp22 PWI-like domain is either predicted to be a four-helix bundle (in which case it is conned to residues 192), or to contain an additional a-helix, but separated from the bundle by an intrinsically disordered region (in which case the domain spans residues 1120). In either case, the helix arrangement is predicted to be different than in hPrp2. To note, the U5-200K PWI-like domain is predicted to be a ve-helix domain. Third, the pattern of evolutionary conservation of the PWI-like domains is different in hPrp22 and hPrp2. Fewer putative and conrmed hPrp2 homologs from different species have the PWI-like domain than do hPrp22 homologs. For instance, the functional analog of hPrp2 in S. cerevisiae, Prp2, is considered to be its homolog, but lacks the PWI-like domain. The Prp22 combination of PWI+S1 appears to be retained, while the Prp2 PWI is missing, also in putative homologs in organisms, such as kinetoplastids (Trypanosoma brucei, Leishmania major), some Apicomplexa (Plasmodium falciparum, Babesia bovis, but not Tetrahymena thermophila, which has both), Trichomonas vaginalis and Entamoeba histolytica. Altogether, the PWI-like domain of hPrp22 is more diverged from the canon, but more often retained, while the PWI-like domain of hPrp2 is less diverged from canon, but more often completely lost. This result does not contradict the hypothesis that the Prp22 protein was formed in the insertion of the S1 domain into the ancestral Prp2. It rather suggests the possibility that some property of the degenerated PWI-like domain ensured its retention

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Nucleic Acids Research, 2012 13

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Figure 8. The PWI domain and PWI-like regions in splicing helicases. In all alignments, the PWI annotation row displays the residues of the PWI motif conserved in a given protein. The jnetpred (. . .) annotation row displays secondary structure elements predicted in the relevant human proteins (ovals represent a-helices, while arrows represent b-strands). Vertical lines indicate hidden columns (inserted residues present in only one or two sequences in the alignment). (A) Alignment of a canonical PWI domain from protein SRm160. The PDB ID: 1mp1 annotation row displays the actual secondary structure elements found in the structure of the PWI domain of the human protein SRm160. (B) PWI-like region from protein hPrp22 (DHX8). The disorder annotation row displays the position of a disordered region in the hPrp22 protein. (C) PWI-like region from protein hPrp2 (DHX16). (D) PWI-like region from protein U5-200K (hBrr2).

14 Nucleic Acids Research, 2012

in evolution. An in-depth structural study of this region may elucidate the reason why. As hinted above, the U5-200K PWI-like domain is in many respects a canonical PWI-like domain similar to that of hPrp2,it retains two out of three of the positions of the tripeptide PWI motif, and is predicted to be a ve-helix domain. However, U5-200K is in general highly conserved, and unlike in hPrp2, this conservation also applies to its PWI-like domain. The N-termini of S. cerevisiae Prp2 and Prp22 are dispensable for splicing (75,76), while the N-terminus of S. cerevisiae Brr2 was shown not to contact any of the proteins of the U4/U6.U5 tri-snRNP (71). Hence, the N-terminal PWI-like domains of hPrp2, hPrp22 and U5-200K are likely to have only a supporting role in splicing, one that is not revealed in the activity of the yeast proteins. We suggest that they may help in the correct positioning of the C-terminal helicase domains on the relevant snRNAs. Nevertheless, we could not nd any data on the activity of the N-termini of hPrp2, hPrp22 and U5-200K. Furthermore, no experimental model of a PWI domain bound to RNA exists, to which we could compare the mode of binding of the hPrp2, hPrp22 and U5-200K PWI-like domains. Hence, as far as this publication is concerned, the question of what is bound to the PWI-like domains of the splicing helicases remains open. An N-terminal domain of the hPrp8 protein (U5-220K) We could not conrm a published prediction of a bromo-domain encompassing hPrp8 residues 127242 (a part of the N-terminal PFAM domain PRO8NT), originally made for yeast Prp8 residues 200315 (77). In our view, the bromo-domain assignment does not command a consistent evolutionary conservation pattern. It encompasses 20 residues universally conserved in Prp8 homologs from all known species and nearly 100 residues conserved only in some eukaryotic Prp8 homologs. On the other hand, we were able to construct a de novo model for the most conserved part (residues 86 150) of the PRO8NT domain (Supplementary Figure S4). Quality evaluation indicates that the model of the putative Prp8 bromo-domain described in (77) has low predicted accuracy (predicted RMSD 8.7 A, QMEAN Z-score 4.25) compared to our de novo model of residues 86150 (predicted RMSD of 2.4 A, QMEAN Z-score 1.93). Altogether, although we cannot exclude the possibility that PRO8NT encases a bromo-domain, we suggest that further studies (ideally: experimental structure determination) will be required to provide a condent structural model of this region. Other previously uncharacterized structural regions of abundant splicing proteins We found several other new types of structured regions in abundant splicing proteins that we were able to assign to known folds and/or are similar to existing structures, with varying degree of condence (Table 7). For instance, a region in the C-terminus of the hPrp19/CDC5L-related protein KIAA0560 (IBP160/Aquarius homolog; residues

4531485) has a helicase architecture similar to the nonsense-mediated decay protein Upf1p (Figure 9). KIAA0560 is a 1485-residue-long protein, whose binding to pre-mRNA introns is necessary for the successful deposition of the exon junction complex on the pre-mRNA (78) and for successful release of box C/D snoRNAs (small nucleolar RNAs) from introns (14). Upf1p contains two RNA helicase domains (c.37.1), the rst of which is interrupted twice by two insertions: an all-b and an all-a domain insertion (79). In KIAA0560, this rst c.37.1 domain is interrupted three times: both of the original insertions are kept, but a third insertion, largely disordered, has appeared between them. Another previously not described region lies in the C-terminus of the B complex protein TFIP11 (homolog of the yeast protein Spp382). The results of our FR analysis suggest that region is a potential double-stranded RNA binding domain (dsRBD) (Figure 9). In other splicing proteins, such as the non-abundant A complex protein DHX9, dsRBD domains often occur in tandem, but the TFIP11 region does not have a partner. However, TFIP11 contains also another previously structurally uncharacterized region with a putative RNA-binding function, a G-patch domain. While the G-patch domain does not show sequence similarity to any other known domains, a highly scoring de novo model of this domain shows structural similarity to a dsRBD domain (Figure 9). In fact, in the non-abundant splicing-related protein SON, the G-patch domain occurs in tandem with a dsRBD domain partner. If the G-patch domain has a dsRBDlike fold, the TFIP11 G-patch domain could provide the functionality of a second tandem dsRBD-like domain for the not described suspected domain of TFIP11. We were also able to construct highly scored de novo models with a clear structural similarity to known folds for ordered helical regions located on the N-termini of proteins hnRNP R and Q. No known structural domain is assigned to these regions, but our de novo models of these regions exhibit fairly high scores (predicted RMSD 1.3 A, QMEAN Z-score 0.12) for the region in protein hnRNP R. Based on structural similarity scores yielded by the DALI server (51), these may be helix-turn-helix domains (Figure 9). Other new putative structural domains are described in Table 8. Comparison of the human and Giardia lamblia spliceosomal proteome: setting priorities for spliceosome structure modeling The human spliceosome, with its 119 abundant proteins, represents a fairly challenging target for both experimental and theoretical structural analyses. To round-off our analysis, we wanted to put forth a candidate minimum set of structural regions in a functional spliceosome that, in our opinion, should be prioritized during the modeling of the structure of the complex. In general, eukaryotic species with fewer introns have fewer splicing proteins. The yeast Saccharomyces cerevisiae has homologs of only 61 of the human abundant splicingrelated proteins (2). On the other hand, S. cerevisiae has

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Nucleic Acids Research, 2012 15

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Figure 9. Other previously uncharacterized structural regions of the spliceosomal proteome. (A) The C-terminus of protein KIAA0560 (AQR), structurally similar to protein Upf1p (residues 4531485). RMSD 3.3 A, QMEAN Z-score 4.97. (B) Dsrm-like region of protein TFIP11 (residues 701838). Predicted RMSD 4.5 A, QMEAN Z-score 2.28. (C) The G-patch domain of LUCA15 (residues 741815). Predicted RMSD 3.0 A, QMEAN Z-score 1.22. (D) HTH-like region of protein hnRNP R (residues 2392). Predicted RMSD 1.3 A, QMEAN Z-score 0.12.

Table 8. New types of predicted structural regions in the human spliceosomal proteome that can be classied into known superfamilies PFAM domain Protein Protein group Region SCOP PFAM superfamily domain of ID template a.118.1 Arm repeats Upf1p d.50.1 d.50.1 a.4.14 d.58.7 b.34.2 a.118.9 a.96.1
a

SCOP description

Condence Regionsuperfamily similarity Medium High Medium High High High High High High High High Medium High High High High

G-patch

KIAA0560 (A) hPrp19/ CDC5L-related KIAA0560 (A) hPrp19/ CDC5L-related TFIP11 B-complex LUCA15 (A) A-complex hnRNP R hnRNP pre-mRNA/ mRNA-binding C-complex U2 snRNP-related step 2 factors B-complex A-complex pre-mRNA/ mRNA-binding A-complex A-complex

1,452 453,1348 771,837 741,815 28,92 124,182 28,53 534,680 424,457 26,206 861,934 827,899 82,177 194,325

ARM repeat

DUF2414 ELG DUF1604 CTK3 Slu7 PRP38 Q9BRR8 SR140 hSlu7 (A) hPrp38 (A) TRAP150 (A) BCLAF1 NFAR NFAR

dsrm dsrm KorB (clan HTH) RNA_bind SH3_1 DUF618 BTK motif HhH-GPD Btz Btz

dsRNA-binding domain-like Mediumb dsRNA-binding domain-like Mediumc KorB DNA-binding domain-like Mediumd RNA-binding domain, RBD SH3-domain ENTH/VHS domain DNA-glycosylase High High High Lowe Lowf Highg Highg Highh Highh

DZF DZF
a c

d.218.1 a.160.1

NTP_transf_2 Nucleotidyl transferase OAS1_C PAP/OAS1 substratebinding domain

Protein. Highly scored alternative template TcpQ (bacterial). De novo model, highly scored, structural similarity only (1DI2_B). d De novo model, highly scored, structural similarity only (1R71_A). e Short; BTK motif always found C-terminal to PH domains, which is not found in Slu7. f Alternative templates: HtH motifs. g Predicted disordered region. h DZF is a member of clan NTP_transf.
b

16 Nucleic Acids Research, 2012

also some Saccharomycetes-specic splicing proteins, such as Prp24 (41), which do not appear in other fungi. In the search of a minimum set of regions to include in the model of a functional spliceosome, we turned to the extremely intron-scarce (80,81) parasitic organism G. lamblia, which is also known for its genome minimalism (82). This organism apparently underwent a reversed process with respect to the diversied and specialized human spliceosomal proteome, namely the loss of many genes encoding spliceosomal proteins. The genome of G. lamblia ATCC50803 encodes homologs of only 30 human abundant splicing proteins (Table 9). Two more proteins can be found in G. lamblia P15. However, not all of these homologs may be involved in splicing. For instance, G. lamblia ATCC50803 possesses orthologs of U4/U6-15.5K and EIF4A3. In humans, U4/U6-15.5K is a component of the U4/U6 di-snRNP, where it binds to U4/U6-61K (hPrp31) (83), while EIF4A3 is a protein of the EJC (33). U4/U6-61K and all EJC proteins save EIF4A3 are missing in G. lamblia. However, the human U4/U6-15.5K protein also participates in box C/D snoRNP formation (83), where it binds a different protein, which does have a G. lamblia homolog, and the human EIF4A3 is an isoform of the eukaryotic translation initiation factor 4A. It is therefore possible that their orthologs in G. lamblia perform only these splicing-unrelated functions. There is a pattern to the presence and absence of abundant splicing-related proteins and/or their domains and disordered regions in the G. lamblia proteome. Almost all the proteins of the U2 snRNPs are present in G. lamblia, as well as a homolog of U2AF35K, but only some core proteins of the U5 snRNP, such as Prp8 and Brr2. Snu114, which, according to the current understanding, is in other organisms the third part of the troika of U5 proteins essential to splicing (21), is an important absentee. Many proteins of the U1 snRNP and U4/U6 di-snRNP proteome are missing, as well as are all proteins specic to the human U4/U6.U5 tri-snRNP. The set of Step 2 factors is reduced to three RNA helicases, and these helicases are reduced to C-terminal regions of their human counterparts, with a common architecture. The G. lamblia helicases are also impossible to assign unambiguously to their human or yeast counterparts. Clustering analysis of helicase sequences from different organisms places the G. lamblia helicases away from any major cluster (Supplementary Figure S3). Finally, G. lamblia has very few homologs of human proteins of the auxiliary complexes, and only two non-snRNP stage-specic proteins (PRP38 and RNF113A) are present in this organism. The snRNP protein homologs present in the G. lamblia proteome are shorter than their human counterparts. Three main types of structural features that are common for human spliceosomal proteins are largely absent from the G. lamblia spliceosomal proteome: (i) intrinsically disordered proteins or disordered regions with possibly autonomous function (long protein disorder that does not form inter-domain linkers, including compositionally biased disorder and some regions of disorder with preformed

structural elements); consequently, highly disordered proteins, such as the U4/U6.U5-specic proteins U4/U6.U5-110K and U4/U6.U5-27K; (ii) short peptide regions that act as ligand partners for other splicing proteins (PRP4, SF3a60_bindingd, SF3b1 and the ULM-containing region of protein SF3b155); and their partners (PRP4 partner: U4/ U6-20K; SF3a60_bindingd partner: second Surp domain of protein SF3a120. This protein is missing entirely (see below); SF3b1 partner: p14; SF3b155 ULM partner: U2AF65K); (iii) ubiquitin-related domains. This includes: the entire protein SF3a120 (which contains an ubiquitin domain in addition to the Surp domains); the U4/U6.U5-specic protein U4/U6.U5-65K, which contains the ubiquitin hydrolase domains zf-UBP and UCH; the zf-C3HC4 RING zinc nger of protein RNF113A. In contrast, the zf-CCCH zinc nger of RNF113A, which is a putative RNAbinding domain, is present. In our analysis of intrinsic disorder in the human spliceosomal proteome (I.K and J.M.B., submitted for publication), we discuss how disordered regions of splicing proteins are tied to functions of dynamics, assembly and regulation of the spliceosome. This is also the function of known ubiquitin-related regions. Hence, it appears that G. lamblia is missing most proteins and/or protein regions primarily responsible for splicing regulation and dynamics. On the other hand, G. lamblia retained pre-mRNA and snRNA-binding proteins and/or regions, as well as proteins that directly assist in splicing, such as the catalytic factor helicases. It also appears that this parasitic organisms ubiquitin-based system of splicing control is reduced, rather than entirely missing. The C-terminal Mov34/MPN/JAB1 domain present in Prp8 from human or yeast (SCOP superfamily c.97.3), which may be implicated in an ubiquitin-based system (65), is absent from the G. lamblia Prp8 (84), but the corresponding region in the latter protein is predicted by FR analysis to be a domain with a ubiquitin-like fold (SCOP superfamily d.15.1). It is possible, that, like yeast, G. lamblia evolved its own specialized splicing proteins, which would not be detected in sequence similarity searches done with proteins from other organisms. Since G. lamblia is a parasite, it is also possible that it supplements some of its missing proteins (such as Snu114) from the host. Finally, it is also possible that some information was missed by our bioinformatics analysis but may be uncovered by an in-depth experimental analysis. With the caveat of the possibility of gaps in data (such as, possibly, Snu114), these are not single proteins that are missing, reduced or degenerated, but entire systems. The cropped set of proteins remaining in our G. lamblia spliceosomal proteome data set, corresponds to a system much less dynamical than the human spliceosome, less precisely regulated and less able to adapt to variable conditions. However, such a spliceosome may still be functional. Hence, we propose that from a practical standpoint, the set of structural regions with homologs in G. lamblia is a good starting point for the higher order

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Nucleic Acids Research, 2012 17

Table 9. Human spliceosomal proteins with potential G. lamblia homologs, and these potential homologs Protein group Human protein Sm-B/B0 Sm-D1 Sm-D2 Sm-D3 Sm-E Sm-F Lsm2 Lsm3 Lsm4 U1-A/U2-B00 U1-C U2-A0 SF3a66 SF3a60 SF3b155 SF3b145 SF3b130 SF3b49 PHF5A U2AF35 NHP2L1 NHP2L1 U5-15K U5-200K U5-220K GI of G. lamblia homolog 159117899 159116502 159111944 159107430 159110758 159114826 159109501 159118879 159110729 253745584 308158556 159115402 159112716 159115731 253747536 159118535 308162520 159117358 159114698 159112951 159112698 159111753 159116909 159109491 159109144 Human protein architecture Giardia lamblia protein architecture

Sm Sm Sm Sm Sm Sm Lsm Lsm Lsm U1 snRNP/U2 snRNP U1 snRNP U2 snRNP U2 snRNP U2 snRNP U2 snRNP U2 snRNP U2 snRNP U2 snRNP U2 snRNP U2 snRNPrelated U4/U6 di-snRNP U4/U6 di-snRNP U5 snRNP U5 snRNP U5 snRNP

LSM+G-rich LSM+G-rich LSM LSM+G-rich LSM LSM LSM LSM LSM+G-rich (RRM_1) 2

disorder+poly-P disorder disorder disorder

disorder

LSM LSM LSM LSM LSM LSM LSM LSM LSM RRM_1 zf-U1a (LRR_4) 2 zf-met+b.15.1 zf-met (g.37.1) + g.37.1b a.118.1 repeatsc DUF382+PSP CPSF_Ad (RRM_1) 2 PHF5 zf-CCCH+RRM_1+zf-CCCH Ribosomal_L7Aee Ribosomal_L7Aee DIM1 DEAD+Helicase_C+Sec63 PRO8NT+PROCN+RRM_4+ U5_2-snRNA_bdg+U6snRNA_bdg+PRP8_domainIV +d.15.3f
g g g g

zf-U1+poly-P disorder (LRR_4) 2 PRP4+zf-met+b.15.1+poly-P disorder SF3a60_bindingd+ SAP+g.37.1+g.37.1 ULM+SF3b1+a.118.1 (HEAT) repeats SAP+poly-P disorder+RS-like disorder+DUF382+PSP WD40 repeats+CPSF_A (RRM_1) 2+poly-P disorder PHF5 zf-CCCH+RRM_1+zf-CCCH+G-rich disorder Ribosomal_L7Ae Ribosomal_L7Ae DIM1 a.188.1+(DEAD+Helicase_C+Sec63) 2 PRO8NT+PROCN+RRM_4+U5_2-snRNA_bdg+U6snRNA_bdg+PRP8_domainIV+c.97.3 (JAB+PROCT) RS-like disorder+DEAD+Helicase_C+HA2+OB_NTP_bind a.188.1+DEAD+Helicase_C+HA2+OB_NTP_bind a.188.1+RS-like disorder+S1+DEAD+Helicase_C+HA2+ OB_NTP_bind RS-like disorder+DEAD+Helicase_C+HA2+OB_NTP_bind

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

U2 hPrp43 snRNP-related (DHX15) B-act complex hPrp2 (DHX16) step 2 factors hPrp22 (DHX8) step 2 factors hPrp16 (DHX38) 159108899 159113861 B complex B-act complex hPrp19/CDC5L EJC hPrp38A RNF113A CCAP2 EIF4A3 159117264 159116389 159114937 159115167 159117719

PRP38+RS-like disorder zf-CCCH+zf-C3HC4 Cwf_Cwc_15 DEAD+Helicase_C

ATP11+DEAD+Helicase_C+ HA2g,h DEAD+Helicase_C+HA2+ OB_NTP_bindg DEAD+Helicase_C+HA2g,h PRP38 zf-CCCH DEAD+Helicase_Ci

Only abundant human splicing proteins with homologs in G. lamblia are shown. Predicted disordered regions with an independent function are included in italics. Ordered structural regions are usually described with their PFAM domains; SCOP IDs are used if the structural region does not correspond to a PFAM domain. a Only in G. lamblia P15. b SAP domain insertion is limited to animals and plants. c Similarity to human SF3b155 only in C-terminal region (human SF3b155: 9981304). d Only in G. lamblia P15; WD40 repeat-like domain may be found via FR. e May not participate in splicing (other possible human homologs: ribosomal protein L7, 15.5K). f Ubiquitin-like fold (d.15) found in protein instead of c.97.3 domain. g The human splicing helicases hPrp43, hPrp2, hPrp22 and hPrp16 and potential G. lamblia homologs cannot be unequivocally assigned to one another. h OB_NTP_bind found via FR. i May not participate in splicing (other possible human homolog: initiation factor EIF4A).

18 Nucleic Acids Research, 2012

structural modeling of the spliceosome, as well as constitutes an attractive list of targets for experimental structural determination. CONCLUSIONS AND FUTURE PROSPECTS This work has been intended to review the existing structural information about human spliceosomal proteins and to ll in gaps, providing a framework of reference for future structural analyses of the spliceosome. We used protein structure prediction methods to identify ordered spliceosomal protein structural elements either not characterized at all on the structural level or characterized insufciently, and thus underreported in databases and literature. Examples of such un-/under-characterized elements include the zinc-nger domain in protein SF3a120 of the U2 snRNP, PWI-like domains in the essential splicing helicases hPrp22 (DHX8), hPrp2 (DHX16) and the U5 snRNP protein hBrr2 (U5-200K), and several ubiquitin-related regions in abundant splicing proteins. In the latter case, by combining database data with our results, we determined that ubiquitin processing-related domains are common especially in non-snRNP splicing factors active in the later stages of the splicing reaction. Having completed the characterization of ordered domains of splicing proteins, we constructed a minimum non-redundant set of experimental structural representations of the proteins of the human spliceosome and modeled most of the (potentially) ordered structural elements without experimental structural models. Condent high-resolution structural models can be assigned to over 90% of structural order in the spliceosome proteins, which corresponds to about 50% of all amino acid residues. We analyzed the spliceosomal proteome of the intron-poor organism G. lamblia to determine a candidate minimum set of structural elements present in a functional spliceosome. We found that the G. lamblia spliceosome does not contain the majority of disordered regions found in the human splicing proteome, and has retained only a vestigial ubiquitin-based system of control. Overall, the G. lamblia spliceosome appears to be much simpler than the human or the yeast one, in accordance with this organisms overall genomic minimalism and its genomes intron-poorness. The results of our analysis of the structural domains in proteins of the human spliceosome may be used to guide experimental characterization of these regions. The characterization of the reduced G. lamblia spliceosome may help set priorities in selecting the structural regions for experimental structural determination, and those to be included in a rst draft of a model of a functional spliceosome. We suggest that in the event of modeling the structure of a functional spliceosome, the ordered protein regions found in G. lamblia proteins should take priority. Finally, as long as the corresponding structural information is absent, the models we constructed may be used in further structural studies, for instance in modeling the structure of the entire spliceosome. Models of noncore proteins can be used to broaden our understanding of alternative splicing. Our models, domain characterizations and suggested priorities thus form a framework of

reference for future structural studies of the spliceosome, and in particular, for the modeling of the structure of the functional spliceosome. Following the (near) completion of the parts list of the spliceosome, we are also advancing our understanding of the structure of these parts. This work provides working structural models for a majority of the parts that appear to be ordered regardless of their functional state. While experimental determination of high-resolution structures for all of these elements would be desirable, theoretical models can be used to design experiments or perform calculations/simulations that require protein structure as a basis. The next step in the structural analysis the spliceosome would be to use integrative modeling techniques to generate three-dimensional pictures of the splicing machinery, in analogy to the previous work on the nuclear pore complex (85,86). The even greater challenge ahead will be to model the dynamics of the splicing cycle, for which even greater union of experimental and theoretical techniques will be required. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online: Supplementary Tables 14 and Supplementary Figures 14. ACKNOWLEDGEMENTS We thank Lukasz Kozlowski, Albert Bogdanowicz, Marcin Pawlowski, Geoff Barton, Jim Procter and Pascal Benkert for help with their software. We also thank Reinhard Luhrmann, Elzbieta Purta, Lukasz Kozlowski, _ Joanna Kasprzak, and Anna Czerwoniec for critical reading of the article, useful comments and suggestions. FUNDING EU 6th Framework Programme Network of Excellence EURASNET [EU FP6 contract no LSHG-CT-2005518238]. J.M.B. has been additionally supported by the 7th Framework Programme of the European Commission [EC FP7, grant HEALTHPROT, contract number 229676], by the European Research Council [ERC, StG grant RNA + P = 123D] and by the Ideas for Poland fellowship from the Foundation for Polish Science. Computing power has been provided in part by the Interdisciplinary Centre for Mathematical and Computational Modeling of the University of Warsaw [grant number G27-4]. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the article. Funding for open access charge: EC FP7 contract number 229676 (HEALTHPROT) and by ERC (RNA + P = 123D). Conict of interest statement. None declared. REFERENCES
1. Tarn,W.Y. and Steitz,J.A. (1996) A novel spliceosome containing U11, U12, and U5 snRNPs excises a minor class (AT-AC) intron in vitro. Cell, 84, 801811.

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Nucleic Acids Research, 2012 19

2. Agafonov,D.E., Deckert,J., Wolf,E., Odenwalder,P., Bessonov,S., Will,C.L., Urlaub,H. and Luhrmann,R. (2011) Semi-quantitative proteomic analysis of the human spliceosome via a novel two-dimensional gel electrophoresis method. Mol. Cell Biol., 31, 26672682. 3. Zhou,Z., Licklider,L.J., Gygi,S.P. and Reed,R. (2002) Comprehensive proteomic analysis of the human spliceosome. Nature, 419, 182185. 4. Jurica,M.S. and Moore,M.J. (2003) Pre-mRNA splicing: awash in a sea of proteins. Mol. Cell, 12, 514. 5. Luz Ambrosio,D., Lee,J.H., Panigrahi,A.K., Nguyen,T.N., Cicarelli,R.M. and Gunzl,A. (2009) Spliceosomal proteomics in Trypanosoma brucei reveal new RNA splicing factors. Eukaryot. Cell, 8, 9901000. 6. Valadkhan,S. and Jaladat,Y. (2010) The spliceosomal proteome: at the heart of the largest cellular ribonucleoprotein machine. Proteomics, 10, 41284141. 7. Ren,L., McLean,J.R., Hazbun,T.R., Fields,S., Vander Kooi,C., Ohi,M.D. and Gould,K.L. (2011) Systematic two-hybrid and comparative proteomic analyses reveal novel yeast pre-mRNA splicing factors connected to Prp19. PLoS One, 6, e16719. 8. Bessonov,S., Anokhina,M., Krasauskas,A., Golas,M.M., Sander,B., Will,C.L., Urlaub,H., Stark,H. and Luhrmann,R. (2010) Characterization of puried human Bact spliceosomal complexes reveals compositional and morphological changes during spliceosome activation and rst step catalysis. Rna, 16, 23842403. 9. Veretnik,S., Wills,C., Youkharibache,P., Valas,R.E. and Bourne,P.E. (2009) Sm/Lsm genes provide a glimpse into the early evolution of the spliceosome. PLoS Comput. Biol., 5, e1000315. 10. Kornblihtt,A.R., de la Mata,M., Fededa,J.P., Munoz,M.J. and Nogues,G. (2004) Multiple links between transcription and splicing. Rna, 10, 14891498. 11. Alexander,R. and Beggs,J.D. (2010) Cross-talk in transcription, splicing and chromatin: who makes the rst call? Biochem. Soc. Trans., 38, 12511256. 12. Hsu,S.N. and Hertel,K.J. (2009) Spliceosomes walk the line: splicing errors and their impact on cellular function. RNA Biol., 6, 526530. 13. Dreyfuss,G., Kim,V.N. and Kataoka,N. (2002) Messenger-RNAbinding proteins and the messages they carry. Nat. Rev. Mol. Cell Biol., 3, 195205. 14. Hirose,T., Ideue,T., Nagai,M., Hagiwara,M., Shu,M.D. and Steitz,J.A. (2006) A spliceosomal intron binding protein, IBP160, links position-dependent assembly of intron-encoded box C/D snoRNP to pre-mRNA splicing. Mol. Cell, 23, 673684. 15. Hogg,R., McGrail,J.C. and OKeefe,R.T. (2010) The function of the NineTeen Complex (NTC) in regulating spliceosome conformations and delity during pre-mRNA splicing. Biochem. Soc. Trans., 38, 11101115. 16. Tange,T.O., Nott,A. and Moore,M.J. (2004) The ever-increasing complexities of the exon junction complex. Curr. Opin. Cell Biol., 16, 279284. 17. Lewis,J.D. and Izaurralde,E. (1997) The role of the cap structure in RNA processing and nuclear export. Eur. J. Biochem., 247, 461469. 18. Dziembowski,A., Ventura,A.P., Rutz,B., Caspary,F., Faux,C., Halgand,F., Laprevote,O. and Seraphin,B. (2004) Proteomic analysis identies a new complex required for nuclear pre-mRNA retention and splicing. EMBO J., 23, 48474856. 19. Katahira,J. (2009) Regulation of nuclear export and cytoplasmic localization of mRNAs by NXF family proteins. Tanpakushitsu Kakusan Koso, 54, 21092113. 20. Zhang,N., Kaur,R., Lu,X., Shen,X., Li,L. and Legerski,R.J. (2005) The Pso4 mRNA splicing and DNA repair complex interacts with WRN for processing of DNA interstrand cross-links. J. Biol. Chem., 280, 4055940567. 21. Wahl,M.C., Will,C.L. and Luhrmann,R. (2009) The spliceosome: design principles of a dynamic RNP machine. Cell, 136, 701718. 22. Bellare,P., Small,E.C., Huang,X., Wohlschlegel,J.A., Staley,J.P. and Sontheimer,E.J. (2008) A role for ubiquitin in the

spliceosome assembly pathway. Nat. Struct. Mol. Biol., 15, 444451. 23. Pena,V., Liu,S., Bujnicki,J.M., Luhrmann,R. and Wahl,M.C. (2007) Structure of a multipartite protein-protein interaction domain in splicing factor prp8 and its link to retinitis pigmentosa. Mol. Cell, 25, 615624. 24. Song,E.J., Werner,S.L., Neubauer,J., Stegmeier,F., Aspden,J., Rio,D., Harper,J.W., Elledge,S.J., Kirschner,M.W. and Rape,M. (2010) The Prp19 complex and the Usp4Sart3 deubiquitinating enzyme control reversible ubiquitination at the spliceosome. Genes Dev., 24, 14341447. 25. Mathew,R., Hartmuth,K., Mohlmann,S., Urlaub,H., Ficner,R. and Luhrmann,R. (2008) Phosphorylation of human PRP28 by SRPK2 is required for integration of the U4/U6-U5 tri-snRNP into the spliceosome. Nat. Struct. Mol. Biol., 15, 435443. 26. Laskowski,R.A. and Thornton,J.M. (2008) Understanding the molecular machinery of genetics through 3D structures. Nat. Rev. Genet., 9, 141151. 27. Stark,H. and Luhrmann,R. (2006) Cryo-electron microscopy of spliceosomal components. Annu. Rev. Biophys. Biomol. Struct., 35, 435457. 28. Jurica,M.S. (2008) Detailed close-ups and the big picture of spliceosomes. Curr. Opin. Struct. Biol., 18, 315320. 29. Magrane,M. and Consortium,U. (2011) UniProt Knowledgebase: a hub of integrated protein data. Database, 2011, bar009. 30. Finn,R.D., Mistry,J., Tate,J., Coggill,P., Heger,A., Pollington,J.E., Gavin,O.L., Gunasekaran,P., Ceric,G., Forslund,K. et al. (2010) The Pfam protein families database. Nucleic Acids Res., 38, D211D222. 31. Pomeranz Krummel,D.A., Oubridge,C., Leung,A.K., Li,J. and Nagai,K. (2009) Crystal structure of human spliceosomal U1 snRNP at 5.5 A resolution. Nature, 458, 475480. 32. Leung,A.K., Nagai,K. and Li,J. (2011) Structure of the spliceosomal U4 snRNP core domain and its implication for snRNP biogenesis. Nature, 473, 536539. 33. Bono,F., Ebert,J., Lorentzen,E. and Conti,E. (2006) The crystal structure of the exon junction complex reveals how it maintains a stable grip on mRNA. Cell, 126, 713725. 34. Mazza,C., Segref,A., Mattaj,I.W. and Cusack,S. (2002) Large-scale induced t recognition of an m(7)GpppG cap analogue by the human nuclear cap-binding complex. EMBO J., 21, 55485557. 35. Schellenberg,M.J., Edwards,R.A., Ritchie,D.B., Kent,O.A., Golas,M.M., Stark,H., Luhrmann,R., Glover,J.N. and MacMillan,A.M. (2006) Crystal structure of a core spliceosomal protein interface. Proc. Natl Acad. Sci. USA, 103, 12661271. 36. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242. 37. Makarov,E.M., Makarova,O.V., Urlaub,H., Gentzel,M., Will,C.L., Wilm,M. and Luhrmann,R. (2002) Small nuclear ribonucleoprotein remodeling during catalytic activation of the spliceosome. Science, 298, 22052208. 38. Behzadnia,N., Golas,M.M., Hartmuth,K., Sander,B., Kastner,B., Deckert,J., Dube,P., Will,C.L., Urlaub,H., Stark,H. et al. (2007) Composition and three-dimensional EM structure of double afnity-puried, human prespliceosomal A complexes. EMBO J., 26, 17371748. 39. Deckert,J., Hartmuth,K., Boehringer,D., Behzadnia,N., Will,C.L., Kastner,B., Stark,H., Urlaub,H. and Luhrmann,R. (2006) Protein composition and electron microscopy structure of afnity-puried human spliceosomal B complexes isolated under physiological conditions. Mol. Cell Biol., 26, 55285543. 40. Bessonov,S., Anokhina,M., Will,C.L., Urlaub,H. and Luhrmann,R. (2008) Isolation of an active step I spliceosome and composition of its RNP core. Nature, 452, 846850. 41. Fabrizio,P., Dannenberg,J., Dube,P., Kastner,B., Stark,H., Urlaub,H. and Luhrmann,R. (2009) The evolutionarily conserved core design of the catalytic activation step of the yeast spliceosome. Mol. Cell, 36, 593608. 42. Will,C.L., Schneider,C., Hossbach,M., Urlaub,H., Rauhut,R., Elbashir,S., Tuschl,T. and Luhrmann,R. (2004) The human 18S

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

20 Nucleic Acids Research, 2012

U11/U12 snRNP contains a set of novel proteins not found in the U2-dependent spliceosome. RNA, 10, 929941. 43. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402. 44. Katoh,K., Kuma,K., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511518. 45. Frickey,T. and Lupas,A. (2004) CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics, 20, 37023704. 46. Kurowski,M.A. and Bujnicki,J.M. (2003) GeneSilico protein structure prediction meta-server. Nucleic Acids Res., 31, 33053307. 47. Lundstrom,J., Rychlewski,L., Bujnicki,J. and Elofsson,A. (2001) Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci., 10, 23542362. 48. Soding,J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics, 21, 951960. 49. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classication of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540. 50. Tung,C.H. and Yang,J.M. (2007) fastSCOP: a fast web server for recognizing protein structural domains and SCOP superfamilies. Nucleic Acids Res., 35, W438W443. 51. Holm,L. and Rosenstrom,P. (2010) Dali server: conservation mapping in 3D. Nucleic Acids Res., 38, W545W549. 52. Sali,A., Potterton,L., Yuan,F., van Vlijmen,H. and Karplus,M. (1995) Evaluation of comparative protein modeling by MODELLER. Proteins, 23, 318326. 53. Roy,A., Kucukural,A. and Zhang,Y. (2010) I-TASSER: a unied platform for automated protein structure and function prediction. Nat. Protoc., 5, 725738. 54. Das,R. and Baker,D. (2008) Macromolecular modeling with rosetta. Annu. Rev. Biochem., 77, 363382. 55. Kaufmann,K.W., Lemmon,G.H., Deluca,S.L., Sheehan,J.H. and Meiler,J. (2010) Practically useful: what the Rosetta protein modeling suite can do for you. Biochemistry, 49, 29872998. 56. Pettersen,E.F., Goddard,T.D., Huang,C.C., Couch,G.S., Greenblatt,D.M., Meng,E.C. and Ferrin,T.E. (2004) UCSF Chimeraa visualization system for exploratory research and analysis. J. Comput. Chem., 25, 16051612. 57. Guex,N. and Peitsch,M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis, 18, 27142723. 58. Pawlowski,M., Gajda,M.J., Matlak,R. and Bujnicki,J.M. (2008) MetaMQAP: a meta-server for the quality assessment of protein models. BMC Bioinformatics, 9, 403. 59. Benkert,P., Kunzli,M. and Schwede,T. (2009) QMEAN server for protein model quality estimation. Nucleic Acids Res., 37, W510W514. 60. Zemla,A., Venclovas, Moult,J. and Fidelis,K. (2001) Processing and evaluation of predictions in CASP4. Proteins, (Suppl 5), 1321. 61. Waterhouse,A.M., Procter,J.B., Martin,D.M., Clamp,M. and Barton,G.J. (2009) Jalview Version 2a multiple sequence alignment editor and analysis workbench. Bioinformatics, 25, 11891191. 62. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. (2000) Gene ontology: tool for the unication of biology. The Gene Ontology Consortium. Nat. Genet., 25, 2529. 63. Maris,C., Dominguez,C. and Allain,F.H. (2005) The RNA recognition motif, a plastic RNA-binding platform to regulate post-transcriptional gene expression. FEBS J., 272, 21182131. 64. Clery,A., Blatter,M. and Allain,F.H. (2008) RNA recognition motifs: boring? Not quite. Curr. Opin. Struct. Biol., 18, 290298. 65. Bellare,P., Kutach,A.K., Rines,A.K., Guthrie,C. and Sontheimer,E.J. (2006) Ubiquitin binding by a variant Jab1/MPN domain in the essential pre-mRNA splicing factor Prp8p. RNA, 12, 292302.

66. Kielkopf,C.L., Lucke,S. and Green,M.R. (2004) U2AF homology motifs: protein recognition in the RRM world. Genes Dev., 18, 15131526. 67. Benkert,P., Biasini,M. and Schwede,T. (2011) Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics, 27, 343350. 68. Lin,P.C. and Xu,R.M. (2012) Structure and assembly of the SF3a splicing factor complex of U2 snRNP. EMBO J., 31, 15791590. 69. Kramer,A., Ferfoglia,F., Huang,C.J., Mulhaupt,F., Nesic,D. and Tanackovic,G. (2005) Structure-function analysis of the U2 snRNP-associated splicing factor SF3a. Biochem. Soc. Trans., 33, 439442. 70. Yuan,H., Anderson,S., Masuda,S., Dragnea,V., Moffat,K. and Bauer,C. (2006) Crystal structures of the Synechocystis photoreceptor Slr1694 reveal distinct structural states related to signaling. Biochemistry, 45, 1268712694. 71. Liu,S., Rauhut,R., Vornlocher,H.P. and Luhrmann,R. (2006) The network of protein-protein interactions within the human U4/ U6.U5 tri-snRNP. RNA, 12, 14181430. 72. Andersen,K.M., Hofmann,K. and Hartmann-Petersen,R. (2005) Ubiquitin-binding proteins: similar, but different. Essays Biochem., 41, 4967. 73. Blencowe,B.J. and Ouzounis,C.A. (1999) The PWI motif: a new protein domain in splicing factors. Trends Biochem. Sci., 24, 179180. 74. Szymczyna,B.R., Bowman,J., McCracken,S., Pineda-Lucena,A., Lu,Y., Cox,B., Lambermon,M., Graveley,B.R., Arrowsmith,C.H. and Blencowe,B.J. (2003) Structure and function of the PWI motif: a novel nucleic acid-binding domain that facilitates pre-mRNA processing. Genes Dev., 17, 461475. 75. Edwalds-Gilbert,G., Kim,D.H., Silverman,E. and Lin,R.J. (2004) Denition of a spliceosome interaction domain in yeast Prp2 ATPase. RNA, 10, 210220. 76. Schneider,S. and Schwer,B. (2001) Functional domains of the yeast splicing factor Prp22p. J. Biol. Chem., 276, 2118421191. 77. Dlakic,M. and Mushegian,A. (2011) Prp8, the pivotal protein of the spliceosomal catalytic center, evolved from a retroelementencoded reverse transcriptase. RNA, 17, 799808. 78. Ideue,T., Sasaki,Y.T., Hagiwara,M. and Hirose,T. (2007) Introns play an essential role in splicing-dependent formation of the exon junction complex. Genes Dev., 21, 19931998. 79. Chamieh,H., Ballut,L., Bonneau,F. and Le Hir,H. (2008) NMD factors UPF2 and UPF3 bridge UPF1 to the exon junction complex and stimulate its RNA helicase activity. Nat. Struct. Mol. Biol., 15, 8593. 80. Roy,S.W. and Gilbert,W. (2006) The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet., 7, 211221. 81. Nixon,J.E., Wang,A., Morrison,H.G., McArthur,A.G., Sogin,M.L., Loftus,B.J. and Samuelson,J. (2002) A spliceosomal intron in Giardia lamblia. Proc. Natl Acad. Sci. USA, 99, 37013705. 82. Morrison,H.G., McArthur,A.G., Gillin,F.D., Aley,S.B., Adam,R.D., Olsen,G.J., Best,A.A., Cande,W.Z., Chen,F., Cipriano,M.J. et al. (2007) Genomic minimalism in the early diverging intestinal parasite Giardia lamblia. Science, 317, 19211926. 83. Liu,S., Li,P., Dybkov,O., Nottrott,S., Hartmuth,K., Luhrmann,R., Carlomagno,T. and Wahl,M.C. (2007) Binding of the human Prp31 Nop domain to a composite RNA-protein platform in U4 snRNP. Science, 316, 115120. 84. Grainger,R.J. and Beggs,J.D. (2005) Prp8 protein: at the heart of the spliceosome. RNA, 11, 533557. 85. Alber,F., Dokudovskaya,S., Veenhoff,L.M., Zhang,W., Kipper,J., Devos,D., Suprapto,A., Karni-Schmidt,O., Williams,R., Chait,B.T. et al. (2007) Determining the architectures of macromolecular assemblies. Nature, 450, 683694. 86. Alber,F., Dokudovskaya,S., Veenhoff,L.M., Zhang,W., Kipper,J., Devos,D., Suprapto,A., Karni-Schmidt,O., Williams,R., Chait,B.T. et al. (2007) The molecular architecture of the nuclear pore complex. Nature, 450, 695701.

Downloaded from http://nar.oxfordjournals.org/ by guest on August 13, 2012

Intrinsic Disorder in the Human Spliceosomal Proteome


Iga Korneta1, Janusz M. Bujnicki1,2*
1 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland, 2 Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland

Abstract
The spliceosome is a molecular machine that performs the excision of introns from eukaryotic pre-mRNAs. This macromolecular complex comprises in human cells five RNAs and over one hundred proteins. In recent years, many spliceosomal proteins have been found to exhibit intrinsic disorder, that is to lack stable native three-dimensional structure in solution. Building on the previous body of proteomic, structural and functional data, we have carried out a systematic bioinformatics analysis of intrinsic disorder in the proteome of the human spliceosome. We discovered that almost a half of the combined sequence of proteins abundant in the spliceosome is predicted to be intrinsically disordered, at least when the individual proteins are considered in isolation. The distribution of intrinsic order and disorder throughout the spliceosome is uneven, and is related to the various functions performed by the intrinsic disorder of the spliceosomal proteins in the complex. In particular, proteins involved in the secondary functions of the spliceosome, such as mRNA recognition, intron/exon definition and spliceosomal assembly and dynamics, are more disordered than proteins directly involved in assisting splicing catalysis. Conserved disordered regions in spliceosomal proteins are evolutionarily younger and less widespread than ordered domains of essential spliceosomal proteins at the core of the spliceosome, suggesting that disordered regions were added to a preexistent ordered functional core. Finally, the spliceosomal proteome contains a much higher amount of intrinsic disorder predicted to lack secondary structure than the proteome of the ribosome, another large RNP machine. This result agrees with the currently recognized different functions of proteins in these two complexes.
Citation: Korneta I, Bujnicki JM (2012) Intrinsic Disorder in the Human Spliceosomal Proteome. PLoS Comput Biol 8(8): e1002641. doi:10.1371/ journal.pcbi.1002641 Editor: Lilia M. Iakoucheva, University of California San Diego, United States of America Received December 29, 2011; Accepted June 16, 2012; Published August 9, 2012 Copyright: 2012 Korneta, Bujnicki. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work has been supported by the EU 6th Framework Programme Network of Excellence EURASNET (EU FP6 contract no LSHG-CT-2005-518238). J.M.B. has been additionally supported by the 7th Framework Programme of the European Commission (EC FP7, grant HEALTHPROT, contract number 229676), by the European Research Council (ERC, StG grant RNA+P = 123D) and by the Ideas for Poland fellowship from the Foundation for Polish Science (FNP). Computing power has been provided in part by the Interdisciplinary Centre for Mathematical and Computational Modeling of the University of Warsaw [grant number G27-4]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: iamb@genesilico.pl

Introduction
In eukaryotic cells and certain viruses that infect them, the coding sequences (exons) of most protein-coding genes are interrupted by noncoding regions (introns). Following the transcription of an entire gene into a precursor messenger RNA (pre-mRNA), the introns are excised and the exons are spliced together to form a functional mRNA. The splicing reaction is catalyzed by a large macromolecular ribonucleoprotein (RNP) machine termed the spliceosome. The most common form of the spliceosome is composed primarily of five small nuclear RNA (snRNA) molecules: U1, U2, U4, U5 and U6, and 45 proteins, arranged into snRNP particles. Seven mutually related Sm proteins are common to all spliceosomal snRNP apart from the U6, which contains a set of related like-Sm (Lsm) proteins [1]. The Sm or Lsm proteins form a ring structure that acts as a platform to support the snRNA [2]. Apart from Sm and Lsm heptamers, all other proteins in the human snRNP subunits are unique (review: [3]). Apart from the snRNP proteins, approximately 80 proteins are abundant in the human spliceosome and reported to be essential to the process of spliceosome-dependent splicing [4], while results of proteomics analyses [47] yield up to over 200 proteins in toto. Non-snRNP splicing factors are divided into independent protein splicing factors and proteins that combine into multiprotein
PLoS Computational Biology | www.ploscompbiol.org 1

complexes auxiliary to the spliceosome: the hPrp19/CDC5L (NTC) complex, the exon-junction complex (EJC), the cap-binding complex (CBP), the retention-and-splicing complex (RES), and the transcription-export complex (TREX). Spliceosomal proteins are richly phosphorylated, as well as undergo other types of posttranslational modifications (review: [8]). A rare class of introns exists (,1% of all introns in human) that are excised by the so-called minor spliceosome [9]. This lowabundance spliceosome variant contains a U5 snRNP identical to the one from the major spliceosome and four snRNPs with snRNAs U11, U12, U4atac, and U6atac snRNAs that are distinct from, but structurally and functionally analogous to, U1, U2, U4, and U6 snRNAs, respectively. Some proteins specific to the minor spliceosome have been found [10]. The primary activity of the spliceosome, i.e. the excision of introns and ligation of exons, requires the correct working of several additional functionalities of the spliceosomal machinery: recognition of the 59 and 39 splice sites (intron/exon definition), mutual recognition of spliceosome subunits and correct spliceosome assembly, spliceosome remodeling and regulation (review: [11]). In the course of the splicing reaction, the snRNP subunits combine and detach from one another and from the pre-mRNA, forming in turn the so-called E (entry), A, B, B* (B-activated), and C complexes. For the major spliceosome, the U1 and the U2 snRNPs perform the initial scanning of the pre-mRNA for intron
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Author Summary
In eukaryotic cells, introns are spliced out of proteincoding mRNAs by a highly dynamic and extraordinarily plastic molecular machine called the spliceosome. In recent years, multiple regions of intrinsic structural disorder were found in spliceosomal proteins. Intrinsically disordered regions lack stable native three-dimensional structure in solutions, which makes them structurally flexible and/or able to switch between different conformations. Hence, intrinsically disordered regions are the ideal candidate responsible for the spliceosomes plasticity. Intrinsically disordered regions are also frequently the sites of post-translational modifications, which were also proven to be important in spliceosome dynamics. In this article, we describe the results of a structural bioinformatics analysis focused on intrinsic disorder in the spliceosomal proteome. We systematically analyzed all known human spliceosomal proteins with regards to the presence and type of intrinsic disorder. Almost a half of the combined sequence of these spliceosomal proteins is predicted to be intrinsically disordered, and the type of intrinsic disorder in a protein varies with its function and its location in the spliceosome. The parts of the spliceosome that act earlier in the process are more disordered, which corresponds to their role in establishing a network of interactions, while the parts that act later are more ordered.

in large complexes [20]. Among RNP complexes, the ribosome in particular illustrates an RNA-related structural function for disordered proteins. Many ribosomal proteins contain long disordered extensions attached to ordered globular bodies [21] that, upon the formation of the ribosome complex, become ordered and penetrate into the macromolecule core formed by the rRNA [22,23]. In other words, the long disordered extensions become the mortar of the macromolecule that fills in gaps in the rRNA and stabilizes it. The subject of intrinsic disorder of the spliceosome has not yet been systematically analyzed for the entirety of the spliceosomal proteome. As an essential step towards broadening our understanding of the functioning of the spliceosome, we have carried out a bioinformatics analysis of intrinsic disorder within the human spliceosomal proteome. We discovered that almost half of the residues within the human spliceosomal proteins are disordered, and that the distribution of intrinsic disorder is uneven across the spliceosome. The spliceosome is divided into three layers: a rigid inner core that performs the precise operations required to effect splicing catalysis, a middle layer of disorder that acquires structure in spliceosome-bound proteins, and a fluid outer layer of disordered regions that do not acquire structure and that are responsible for the establishment of a matrix of weak interactions in the initial stages of the splicing process.

Results/Discussion The human spliceosome is highly disordered


Initially, we predicted the average intrinsic disorder content of 122 core proteins of the major human spliceosome, including all abundant proteins sensu Agafonov et al. [4] (Table S1). This prediction was carried out in two stages. The initial fully automated analysis, carried out via the GeneSilico MetaDisorder server [24], estimated the intrinsic protein disorder content in the 122 human spliceosomal proteins at 53.5%, and at 45.2% for 45 proteins of the snRNP subunits of the major spliceosome (each Sm protein counted once). Subsequently, we adjusted manually the predictions of order/ disorder boundaries of IDRs based on structural predictions yielded by the GeneSilico MetaServer [25]. This manual correction shifted the disorder estimate downwards in some cases by as much as 10%, to an intrinsic disorder content estimate of 44.0% for all the 122 proteins of the major spliceosome, and 34.1% for the snRNP proteins. Nevertheless, even after the correction, at least 98 out of the 122 core spliceosomal proteins (80.3%) were predicted to contain at least one IDR$30 residues. An intrinsic disorder content estimate of 44.0% is twice the average value for all human proteins as calculated on the basis of genome-based predictions, which is 21.6% [26]. The predicted fraction of 80.3% of proteins with at least one IDR$30 residues contrasts against the calculated fraction of 35.2% for the entire human proteome [26]. Although different methods of prediction of intrinsic disorder content differ in their estimates, altogether the human spliceosomal proteome contains a high amount of intrinsic disorder. This finding will have a significant impact on further studies involving spliceosomal proteins.

sites, while the actual two-step splicing reaction occurs after the addition of a U4/U6.U5 tri-snRNP entity and the elimination of the U1 and U4 snRNPs from the complex, at the assembled interface of the pre-mRNA substrate and U2, U5, and U6 snRNAs (complex C). For the minor spliceosome, the U11/U12 di-snRNP performs the role of the U1 and U2 snRNPs, while the U4atac/U6atac disnRNP performs the role of the U4/U6 di-snRNP (review: [12]). The early recognition and assembly of the splicing reaction (E/A complex formation) rely on the use of multiple weak binary interactions to ensure flexibility. On the other hand, later stages of the splicing reaction (B, B-act, C complexes) involve enzymatic catalysis [11]. Each of the stages of the splicing reaction has its own set of associated non-snRNP proteins [4]. Splicing has been associated with intrinsic protein disorder [13]. Intrinsically disordered regions (IDRs) lack stable, well-defined three-dimensional structure (review: [14]). IDRs frequently contain low-complexity regions and repeats, although they may also contain conserved linear motifs embedded in the less conserved regions (ELMs; [15]). IDRs are not necessarily completely unfolded. In particular, some IDRs may contain stable preformed secondary structure elements in isolation [16], while others may switch from disorder to order (i.e. exhibit dual personality) depending on the environment, for instance upon binding to other proteins [17,18]. As they lack tertiary structure under many or all conditions, IDRs are more flexible and plastic than the rigid structures of globular domains. Disorder may increase the speed of intermolecular binding and unbinding and make interactions weaker [14]. As a result of these properties, IDRs are found in a variety of molecular functions, which include forming linkers between structured domains, being sites of post-translational modifications, and sites of protein-protein and protein-RNA recognition [19]. The large interaction capacity of IDRs predisposes them to organizing the assembly of complexes; disorder is a characteristic feature of hub proteins that interact with many partners, and, notably for spliceosome research, disordered proteins are common
PLoS Computational Biology | www.ploscompbiol.org 2

Early human spliceosomal proteins are more disordered than late proteins
To determine whether there was any variation of disorder content throughout the complexes forming the spliceosome at different stages of the splicing reaction, we analyzed the fraction of predicted intrinsic disorder for different groups of proteins of the spliceosome complex. For this analysis, we divided the spliceosome proteins in our dataset into several groups based on proteomics
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

data as well as included eight proteins of the U11/U12 di-snRNP of the minor spliceosome (Table S1). As most of the U11/U12 proteins are structurally and functionally related to proteins of the U1 and U2 snRNPs [10], we expected that they would have a similar IDR content to the U1 and U2 snRNP subunit proteins. Different groups of spliceosome proteins differ in their predicted disorder content (Figure 1). In particular, proteins of the U1 snRNP, U2 SF3A, U11/U12 di-snRNP, U2-related and U4/U6.U5 trisnRNP-specific proteins are predicted to be more disordered than average spliceosome proteins (.44.0% disorder content). Of these groups of proteins, all apart from the U4/U6.U5 tri-snRNP-specific proteins are early proteins associated with the early stages of splicing. On the other hand, U2 SF3B, U4/U6 di-snRNP, U5 snRNP, Sm and Lsm proteins are predicted to be more ordered than average (,44.0% disorder content). The Sm and Lsm proteins comprise scaffolds for snRNA, and especially proteins of the U4/U6 di-snRNP and U5 snRNP may be responsible for assisting in splicing catalysis. Among auxiliary protein complexes, the retentionand-splicing (RES) complex, whose function is the retention of unspliced pre-mRNAs in the nucleus [27], is predicted to be extremely disordered (80.6%), while the cap-binding complex (CBC) is more ordered than average (28.0%). Two other complexes, hPrp19/CDC5L and EJC, both of which have multiple functions, situate in between (40.5% and 53.6% disorder content, respectively). Finally, while all the groups of transiently binding non-snRNP spliceosomal proteins are predicted to be more disordered than average for all spliceosomal proteins, the early A-complex proteins are predicted to be the most disordered in this group, followed by Bcomplex proteins, B-act complex proteins, and C-complex proteins.

Early human spliceosomal proteins contain more compositionally biased disorder than late proteins
As no external standardized annotation scheme was available for IDRs in the spliceosomal proteins, we developed a classification based on their predicted primary and secondary structure features. We divided the spliceosomal IDRs into three classes: regions with consistently predicted secondary structure (SS) elements (henceforth disorder with SS or IDR with SS), long ($25 residues) compositionally biased IDRs without predicted secondary structure elements (henceforth compositionally biased disorder/IDR), and other IDRs, which we omitted from further analyses (Figure S1). Several types of compositionally biased regions without predicted SS elements that frequently appear throughout the spliceosomal proteome had been previously described in literature. For these compositionally biased IDR types, we sought to define relevant standard IDR subclasses within our classification (RS-like, poly-P/Q, G-rich; see Methods for details). Having annotated the IDRs, we analyzed the distribution of different types of disorder across different groups of human spliceosome proteins. Different groups of spliceosome proteins are predicted to differ in the type of disorder they contain (Figure 2, Figure S2). The heptameric complexes of Sm and Lsm proteins are predicted to contain mainly compositionally biased disorder without secondary structure elements (69.9% of all disorder). Correspondingly, crystal structures of the Sm complex lack most of the predicted disordered regions (example PDB ID: 2Y9A, [28]) and show a stable ungapped platform, which suggests that disorder in Sm and Lsm proteins is located outside of the ordered torus. Protein groups that are present earlier in the course of the splicing process and that are in general highly disordered (U1, U2 SF3A, U11/U12, U2-related, SR, hnRNP, A-complex proteins) are predicted to contain more disorder with predicted compositional bias and less disorder with SS than late proteins. Similarly to
PLoS Computational Biology | www.ploscompbiol.org 3

2Y9A, the majority of predicted disorder of the U1 snRNP-specific proteins included in the crystal structure of the U1 snRNP (PDB ID: 3CW1; [29]) is missing from the crystal structure. Also similarly to 2Y9A, almost all compositionally biased disorder is missing from the structure, while almost all predicted disorder with SS is present. Notably, also the EJC, whose post-splicing functions in exon ligation and mRNA transport involve mRNA binding, also exhibits a high content of compositionally biased disorder (62.9%). The RES complex also contains long regions of disorder with very little predicted secondary structure, but we could not unambiguously divide these regions into subregions with different compositional bias. Among different types of compositionally biased disorder, RSlike IDRs are found in all groups of early proteins, while poly-P/Q and miscellaneous noncharged IDRs are predicted to be concentrated mainly in the U1, U2, U11/U12 and U2-related proteins. Domain-length ($100 residues) hnRNP-type G-rich regions are found only in hnRNP proteins, but short (,100 residues) hnRNP-like G-rich regions are found, in addition to SR and Sm proteins, in A-complex and U2-related proteins (Table S2). Based on the widespread distribution of compositionally biased IDRs in spliceosomal proteins, we speculate that interactions mediated by these IDRs may be in fact more common and important than suggested by the particular cases studied before. In particular, the role of glycine-rich regions in many spliceosomal proteins is unknown and requires further study. Based on the fact that RS-like and glycine-rich disordered regions frequently appear in the same proteins (e.g. SF2/ASF, TRAP150) and in proteins that interact with each other and/or interact with the same RNA (SR, hnRNP), we also suggest that these two types of regions may interact with each other directly. If so, also RS-like and glycinerich regions from other proteins may interact with one another. This interaction may be important for the regulation of splicing and definition of intron/exon boundaries, and, by extension, for the regulation of alternative splicing. In contrast to early proteins, proteins of the later stages of splicing are often predicted to contain high amounts of disorder with SS. These proteins include proteins of the U5 snRNP and U4/U6 di-snRNP, proteins specific to the U4/U6.U5 tri-snRNP entity, hPrp19/CDC5L, step 2 catalytic factors, as well as B, B-act and C-complex proteins. Most of these protein groups are also predicted to be relatively ordered. In particular, for the isolated proteins of the U5 snRNP, which is predicted to be the least disordered of all the snRNP subunits, over a half of the disordered residues are predicted to be in IDRs with SS. We suggest that, in the case of proteins of larger complexes, disorder with SS may acquire structure as the individual proteins of the complex come together. If so, the U5 snRNP may be almost completely ordered when the proteins come together in the complex. For the highly disordered U4/U6.U5 tri-snRNP-specific proteins, high disorder content coupled with a high content of disorder with SS suggests a high potential for structure variability. We suggest that this potential is exercised upon the assembly and disassembly of the trisnRNP. Among compositionally biased IDRs, only RS-like domains are commonly found in the late proteins. Between proteins of the U4/U6.U5 tri-snRNP, step 2 catalytic factors and the abundant B, B-act and C complex stage-specific proteins, we identified 12 RS-like IDRs, including a single RS-like IDR in the central part of the U4/U6 di-snRNP protein U4/U6-90K and the RS-like IDR on the N terminus of the U5 snRNP protein U5100K [30]. The broad distribution of the RS-like IDRs leads us to propose that RS-like IDRs may be, in fact, a major driving force behind spliceosome dynamics in addition to fulfilling their role in the process of pre-mRNA recognition and intron/exon definition.
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Figure 1. Intrinsic disorder content of the various groups of core spliceosome proteins. In deeper shades are marked the values for all proteins of the snRNP subunits of the major spliceosome (snRNP proteins, major spl.) and for all the proteins of the major spliceosome (all proteins, major spl.). The orange line indicates means calculated per-protein (disorder fraction was calculated for each protein first, and then a mean was taken out of this) while the green line indicates means calculated per-residue (the number of all disordered residues in a protein group divided by the total length of proteins in the group). Per-residue means are indicated above the line. Spliceosome protein groups are ordered according to per-residue means. doi:10.1371/journal.pcbi.1002641.g001

PLoS Computational Biology | www.ploscompbiol.org

August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Figure 2. Types of disorder in core spliceosomal proteins. Compositionally biased disorder (Y-axis) vs. disorder with SS (X-axis). Datapoints are colored according to predicted total per-residue disorder content. Groups of all proteins of the major spliceosome and all proteins of the snRNP subunits of the major spliceosome are indicated in bold. doi:10.1371/journal.pcbi.1002641.g002

Non-abundant proteins contain more compositionally biased disorder than core spliceosomal proteins
We repeated our IDR analysis for 122 additional proteins consistently found in the results of proteomics analyses of the major spliceosome (Table S1). The addition of these proteins increased the overall predicted disorder content of the major spliceosome proteome to 52.3%. Hence, the auxiliary spliceosomal proteins have their overall disorder content higher even than the core proteins. For most protein groups, adding non-abundant proteins changed IDR content values by less than 10% of the respective lengths of proteins involved (Figure 3). In particular, nonabundant early (A-complex and B-complex-associated) proteins are, like abundant early proteins, estimated to be more disordered than B-act proteins and C-complex proteins (59.5% and 58.4% disorder content vs.52.5% and 51.2%). Compared to abundant proteins, non-abundant proteins are predicted to contain a larger amount of long regions of compositional disorder (Table S2). RSlike IDRs are again present in multiple proteins, including non-SR proteins. In the case of the EJC, three non-abundant proteins, acinus, pinin and RNPS1, supply the RS-like IDRs that are missing from the EJC as defined only by abundant proteins. We also found poly-P/Q regions, mainly in early (A-complex, U2 snRNP-related, pre-mRNA/mRNA-binding proteins and misPLoS Computational Biology | www.ploscompbiol.org 5

cellaneous proteins) and hnRNP proteins. Short hnRNP-like Grich regions are found predominantly in SR, A-complex, premRNA/mRNA-binding proteins and miscellaneous proteins, as well as the EJC protein Aly/Ref. Most of the proteins that contain hnRNP-like G-rich IDRs have been confirmed to bind RNA. In short, the distribution of the non-hnRNP G-rich IDRs is similar to the distribution of other compositionally biased IDRs, and the distribution of compositionally biased IDRs in non-abundant proteins is similar to their distribution in abundant proteins. Some auxiliary proteins, such as the two RS-like IDR-rich splicing coactivators SRm160/300, are both extremely long and extremely disordered (SRm300: 2752 residues, predicted 98.1% disorder content). In this particular case, the SRm160/300 proteins are thought to form a matrix promoting interactions between splicing factors [31].

Compositionally biased disorder of spliceosome proteins (RS-like and glycine-rich) is associated with posttranslational modifications (serine phosphorylation and arginine methylation)
We next considered the association of post-translational modifications (PTMs) of human spliceosomal proteins with intrinsic disorder. To do so, we compared our data on IDR distribution throughout the human spliceosomal proteome with
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Figure 3. Disorder in core vs. non-abundant spliceosome proteins. Blue bars indicates values of intrinsic disorder content for core proteins, green bars for both core and additional spliceosome proteins. The blue and green lines indicate means for given protein groups, calculated perresidue. In deeper shade, values for all core (blue) and all (green) proteins associated with the major spliceosome. doi:10.1371/journal.pcbi.1002641.g003

PTM data from UniProt [32]. Four distinct PTMs are found in UniProt data in large enough numbers to warrant numerical analysis: phosphorylations (on various residues), lysine N-acetylations, other N-terminal acetylations and arginine methylations (various types). Of these, N-terminal acetylation is a ubiquitous cellular process not connected to splicing. 8090% of human proteins are acetylated on the N terminus [33]. 82.6% of all PTMs of spliceosomal proteins found in UniProt are phosphorylations (Table 1), of which phosphorylation on a serine is the most common (78.9% of all phosphorylations), followed by threonine (15.2%) and tyrosine (5.9%) phosphorylation. 32.2% of all phosphorylations are mapped to RS-like IDRs, even though such regions comprise only 7.1% of the combined length of the 252 spliceosome proteins. In the 122 core proteins of the major spliceosome, which include fewer SR proteins, RS-like IDRs comprise 3.2% of their combined length, but they encompass as many as 23.0% of all phosphorylation sites. This result suggests that the known cases of recorded functional importance of phosphorylation of RS-like IDRs in non-SR proteins may not be isolated, and that phosphorylation may be as important a control mechanism for the function of these sites as
PLoS Computational Biology | www.ploscompbiol.org 6

it is for the RS domains of SR proteins. 9.7% of PTMs are lysine N-acetylations, which map to ordered and disordered regions in proportions similar to the total amounts of order vs. disorder for both the core 122 and all 252 proteins (0.6:0.4 order vs. disorder),and therefore do not appear to be associated with either order or disorder. Finally, UniProt registers 74 cases of arginine methylations in the 252 spliceosome proteins (3.4% of all PTMs). Almost all sites of arginine methylation are located in hnRNP protein G-rich regions and shorter hnRNP-like G-rich regions in Sm proteins, SR proteins and A-complex, pre-mRNA-binding and miscellaneous RNA-binding proteins. Note that UniProt does not list any arginine methylations for some proteins, such as Sm-D3, that have been shown to contain methylated arginines [34] and where we found a G-rich region (Table S2). Hence, arginine methylations may be more widespread than indicated by database data. The consideration of arginine methylation has been so far overshadowed by the consideration of the far more widespread consideration of phosphorylation (see e.g. [8]). We suggest that the importance of arginine methylation for spliceosomal proteins should be considered in greater detail. In particular, the possibility exists that, if RS-like IDRs (of SR and other proteins) interact with
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Table 1. Post-translational modifications in 252 spliceosome proteins.

Modification Phosphorylation (*) Lysine N-acetylation Other N-acetylation (**) Arginine methylations (***) Lysine methylations (****) Cysteine methyl ester

Structural order 158 127 14 5 3 0

Disorder with SS 326 30 20 2 0 1

RS-like 572 12 1 13 2 0

Poly-P/Q 137 4 0 4 0 0

hnRNP-like G-rich 82 6 1 42 0 0

Noncharged 43 0 2 2 0 0

Charged 49 3 2 0 0 0

Other disorder 412 27 44 6 1 0

Total 1779 209 84 74 6 1

Percent 82.6% 9.7% 3.9% 3.4% 0.3% 0.0%

(*) S,T and Y phosphorylation. (**) N-terminal acetylation of MGASTV. (***) Includes the keywords dimethylarginine, asymmetric dimethylarginine, omega-N-methylarginine. (****) Includes the keywords N6-methyllysine, N6, N6-dimethyllysine, N6, N6, N6-trimethyllysine. doi:10.1371/journal.pcbi.1002641.t001

the hnRNP-like G-rich regions (of hnRNP and other proteins), these interactions may be modulated by phosphorylation and by methylation. UniProt registers also six cases of lysine methylations at five unique residues, two of them in disordered regions and three in ordered regions. Five of the six cases occur in proteins with methylated arginines.

ULMs are associated with early proteins, while other disordered recognition motifs are found throughout splicing complexes and candidate hub proteins are associated with later stages of splicing
To further analyze the possible roles of disorder that may acquire structure in the human spliceosome, we considered three sources of information: data from experimentally determined structures available in the Protein Data Bank (PDB) [35], predictions of disordered PFAM [36] domains and predictions of the most disordered proteins of the human spliceosome. We browsed the experimentally determined structures of spliceosomal protein complexes to find out which regions predicted to be disordered in isolation were found to be ordered in a complex. Short disordered ligand peptides (,30 residues) that acquire structure upon binding larger partners are called Molecular Recognition Features (MoRFs) [37], while larger sequence features of this kind are called domain-length disordered recognition motifs [16]. In the structures of spliceosomal protein complexes, we found eight distinct regions that fit either definition (Table 2, Figure S3). Three of these regions were the previously defined ULMs (UHM Ligand Motifs), that is ligands for U2AF Homology Motif domains [38] (ELM database: LIG_ULM_U2AF65_1). Experimental structures containing ULMs represented U2 snRNP, U2 snRNP-related and A-complex proteins. Via a pattern recognition search, we found additional candidate regions for ULMs, mainly in low-abundance U2 snRNP-related proteins and A-complex proteins (Table S3). The majority of these tentative ULMs were predicted to be disordered. Although the presence of an individual ULM in a sequence may not be significant, we suggest that the concentration of sequences with ULM patterns at the early stage of the spliceosome action may be functionally relevant, and that the additional candidate ULMs may represent actual functional ULMs. If so, these additional ULMs could represent a non-essential extension of the essential UHM-ULM interactions, and UHMULM interactions may form an accessory network to the network created by compositionally biased IDRs (and their partners). Notably, a list of candidate UHM partners for ULMs also contains mainly early spliceosomal proteins [39].
PLoS Computational Biology | www.ploscompbiol.org 7

Other recognition regions (U1snRNP70_N, SF3a60_bindingd, SF3b1, PRP4, Btz, all of which we labeled after PFAM regions) are found in complexes present at various stages of the splicing reaction. Notably, the U1snRNP70_N region encompasses two subregions, the C-terminal of which is the only predicted disordered region shown through an experimental structure to bind RNA. Via a profile search, we found two additional candidate regions for the Btz motif and one additional candidate PRP4 region. The candidate Btz regions are found in TRAP150, an abundant A-complex protein, and its paralog BCLAF1, a lowabundance pre-mRNA/mRNA-binding protein that has been implicated in a wide range of processes [40]. The candidate PRP4 region is found in the U2 snRNP SF3A protein SF3a66. Unlike the ULMs, which appear to be widespread and function in multiple contexts at the early stage of splicing, non-ULM motifs appear to have specific functions and bind specific partners. To find other potential domain-length recognition motifs in spliceosomal proteins, we considered the PFAM domains that mapped to predicted IDRs. We found 51 such PFAM domains (Table S4), which included both conserved disordered regions in otherwise ordered proteins and the only conserved regions of almost completely disordered proteins. We propose these domains as targets for experimental structural analyses. Notably, when we compared the list of disordered PFAM domains with the list of the most disordered proteins in the spliceosomal proteome, we found that this group includes two out of three U4/U6.U5 tri-snRNP-specific proteins (U4/U6.U5-27K and 110K), as well as several conserved proteins associated with the B, B-act and C complex (e.g. MFAP1, RED, GCIP p29) that are also abundant in the human spliceosomal proteome [4] (Table 3; Figure S4). We suggest that the presence of conserved motifs comprising disordered PFAM domains in these abundant conserved highly disordered proteins may allow them to act as hub proteins. If so, these proteins may be crucial to spliceosome dynamics. Targeted deletions of the conserved motifs within these proteins may help elucidate their role.

Conserved disordered regions in spliceosomal proteins are less widespread and evolutionarily younger than essential ordered domains in the core of the spliceosome
As spliceosomal proteins found in human are typically conserved throughout eukaryotes [41], we used the set of proteins found in the human spliceosomal proteome to determine the evolutionary path for the accumulation of order and disorder in the spliceosomal proteome. We investigated whether conserved
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

(*) Domain names in brackets. (**) ULMs correspond to the ELM motif LIG_ULM_U2AF65_1, defined by the pattern [KR]{1,4}[KR]-x{0,1}-[KR]W-x{0,1}. (***) Non-abundant A-complex protein. (****) The PRP4 region of Prp18 is ordered and its structure in isolation was solved. It is included in the table since the PRP4 region of U4/U6-60K is predicted to be partially disordered. doi:10.1371/journal.pcbi.1002641.t002

Table 2. Regions predicted to be disordered, found to be ordered in experimentally solved complexes of spliceosomal proteins.

U2 snRNP-related

Protein group

U4/U6 di-snRNP

A-complex (***)

step 2 factors

ordered and disordered PFAM domains present in human spliceosomal proteins were present in the last eukaryotic common ancestor species (LECA), according to [42], and whether they are currently ubiquitous outside of eukaryotes. The majority of both ordered and disordered PFAM domains were present in LECA (Table 4). However, while almost none of the disordered domains are currently widespread in prokaryotes, at least one-third of the ordered domains are. This suggests that, unlike disordered domains, these ordered domains may have been transferred to eukaryotes from prokaryotes, and may be, in fact, older than LECA. Notably, the contribution of these evolutionarily old domains is much higher in the ordered regions of the snRNP proteins than in the general group of abundant proteins. As many as 19 out of 29 (distinct) domains of the U4/U6.U5 tri-snRNP are old domains. Furthermore, the majority of the proteins of the U4/U6.U5 tri-snRNP, including the Sm/Lsm proteins but not the U4/U6.U5 tri-snRNP-specific proteins, either possess homologs among bacterial and non-splicing-related eukaryotic proteins or are composed of ubiquitous domains [1,43] (Table S5). The U5 snRNP contains ordered domains similar to those present in maturase proteins of modern bacterial group II introns [44], from which the spliceosome snRNAs and introns are predicted to have evolved [45]. In consequence, this group of proteins/domains as has a strong potential to evolutionarily predate the eukaryotes. Likewise, the C-terminal region of the splicing helicases hPrp2/ 22/16/43 is also found in some bacterial helicases such as the Escherichia coli HrpA and therefore is likely to be ancient [46]. We suggest that the spliceosome likely accrued piecewise, and that these evolutionarily old regions, which are also the most ordered regions of the spliceosome, were recruited into the system first and formed the structural and functional core of the spliceosome. Disordered regions, as well as ordered domains only found in eukaryotes, would in this scenario appear in the spliceosome later.

Reference

[29]

[87]

[38]

[88]

[89]

[29]

Structure

[90]

1MZW

3CW1

3CW1

1O0P

[91]

1JMT

2DT7

2F9D

2DK4 ordered

2PEH

Predicted ordered/disordered status in isolation

disordered, next to ordered helix

partially ordered

SF3b14a/p14 (RRM)

SF3a120 (Surp)

U1-C (zf-U1)

U2AF35 (UHM)

U2AF65 (UHM)

SPF45 (UHM)

Partner (*)

partially ordered

U4/U6-20K

U1 snRNA

partially ordered

U1 snRNP

U1 snRNP

EIF4A3

disordered, next to ordered helix

disordered

disordered

disordered

disordered

2J0S

[92]

The spliceosomal and the ribosomal proteomes have a similar fraction of disordered residues, but different types of intrinsic disorder
As the final step of our analysis, we compared the fractions and distributions of intrinsic disorder in the proteomes of the subunits of the human major spliceosome and the human and the Escherichia coli ribosomes. The bacterial ribosome was chosen to supplement structural information on disorder-to-order transition, as no crystal structure of the human ribosome is presently available. Our comparison revealed a number of similarities and differences between the proteins of the human snRNP subunits and both ribosomes (Table 5).The percentage fraction of residues predicted to be disordered is slightly higher in the ribosomal proteins compared to proteins of the spliceosomal snRNP subunits. The human ribosome contains more intrinsic disorder than the E. coli one, in keeping with the overall higher disorder content in eukaryotic proteins [47]. However, the types of the predicted disorder in the ribosomes and in the spliceosome are different. IDRs in ribosomal proteins are much shorter. While the number of proteins with at least one IDR$30 residues are similar between the human ribosome and the human spliceosome, the spliceosome subunits contain twice as many proteins with at least one IDR$70 residues as the human ribosome (Figure S5). Furthermore, the majority of intrinsic disorder in ribosomal proteins is predicted to contain SS elements, while the majority of intrinsic disorder in spliceosomal snRNP proteins is predicted not to contain secondary structure. There are 15 distinct non-SS IDRs$70 residues in the subunits of the human spliceosome, but only three such regions in the human ribosome and none in the bacterial ribosome. Disordered regions $70 residues without
8 August 2012 | Volume 8 | Issue 8 | e1002641

U2, SF3A

U2, SF3B

U2, SF3B

107137

71106

U1-70K

SF3a60

U4/U6-60K

SF3b155

SF3b155

Protein

U2AF65

U1-70K

77115 Prp18 PRP4 (****) Domain-length

822

short, RNA-binding

SF3a60_bindingd

N-U1snRNP70_N

C-U1snRNP70_N

ULM (**)

Region

SF3b1

Domain-length

MoRF

PRP4

ULM

ULM

Domain-length

MoRF

PLoS Computational Biology | www.ploscompbiol.org

Btz

Domain-length

MoRF

MoRF

MoRF

Type

MLN51

SF1

169196, 215230

333342

377415

Region

90112

6389

1325

EJC

Disorder in the Spliceosomal Proteome

Table 3. Most highly disordered proteins in the spliceosomal proteome.

Abundance Abundant

Protein SPF30 U4/U6.U5-110K U4/U6.U5-27K CCAP2 TRAP150 MFAP1 RED MGC23918 HSPC220 GCIP p29

Disorder fraction 80.3% 87.9% 76.8% 78.2% 100.0% 79.3% 79.5% 100.0% 84.8% 93.0% 91.1% 93.8% 100.0% 92.3% 93.5% 88.6% 100.0% 100.0% 100.0% 100.0% 86.1% 100.0% 100.0% 100.0% 100.0%

PFAM domains SMN SART-1 DUF1777 Cwf_Cwc_15

Group U2 snRNP-related U4/U6.U5 trisnRNP U4/U6.U5 trisnRNP hPrp19/CDC5L A-complex

MFAP1_C RED_N, RED_C cwf18 Hep_59 SYF2

B-complex B-complex B-act complex C-complex C-complex U11/U12

Non-abundant

U11/U12-59K Npw38BP MLN51 pinin MGC13125 C19orf43 FLJ10154 CCDC55 CCDC49 PRCC DGCR14 DKFZP586O0120 FLJ22626 LENG1 BCLAF1

Wbp11 Btz Pinin_SDK_N, Pinin_SDK_memA Bud13

hPrp19/CDC5L EJC EJC RES A-complex A-complex

DUF2040 CWC25 PRCC_Cterm Es2 DUF1754 SynMuv_product Cir_N

B-complex B-complex B-act complex C-complex C-complex C-complex C-complex pre-mRNA/mRNA-binding

Entries in this table fulfill simultaneously two conditions: they have a predicted disorder content .75%, and do not contain any PFAM domains that correspond to ordered structural domains. doi:10.1371/journal.pcbi.1002641.t003

secondary structure comprise 8.3% of the total mass of the snRNP subunits of the major human spliceosome, but only 0.4% in the human ribosome (Figure S6). Hence, intrinsic disorder in the ribosomes is considerably more structured than the disorder in the spliceosome. Both in the E. coli and in the human ribosomes, the large subunit is predicted to contain higher percentage of disorder than the small subunit. However, the differences in the fraction and type of disorder are less pronounced between the ribosomal subunits than between the various subunits of the spliceosome. The ribosome is therefore more homogeneous with

respect to the distribution of the intrinsic disorder of its proteins than the spliceosome. The inspection of crystal structures confirms the predicted differences. 98.9% of predicted disordered residues of 51 E. coli ribosomal proteins are found ordered in one or more crystal structures of this ribosome. Only three proteins, L10, L7/L12 and S1, are missing from all crystal structures of ribosomes deposited in the PDB. Of these proteins, only L7/L12 contains an interdomain linker that is confirmed not to acquire structure in a complex [48], while only S1 contains a C-terminal disordered extension whose

Table 4. Statistics of conserved ordered and disordered PFAM domains.

ordered domains all proteins all domains domains found in LECA domains found in prokaryotes (**) 124 121 47 (37.9%) abundant proteins 86 86 34 (39.5%) U4/U6.U5 tri-snRNP (*) 29 29 19 (65.5%)

disordered domains all proteins 46 36 1 (0.0%) abundant proteins 24 22 0 (0.0%) U4/U6.U5 tri-snRNP 5 5 0 (0.0%)

(*) Including the LSM domain present in Sm and Lsm proteins. (**) In .100 copies. doi:10.1371/journal.pcbi.1002641.t004

PLoS Computational Biology | www.ploscompbiol.org

August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Table 5. Features of intrinsic disorder in E. coli and human ribosomes and human major spliceosome snRNP subunits.

Feature Number of proteins Maximum protein length (aa) Mean protein length (aa) Fraction of predicted disorder (% of the combined lengths of proteins) Number of proteins with at least one IDR $30 residues Number of proteins with at least one IDR $70 residues Mean IDR length (aa) Fraction of predicted disordered residues with secondary structure (% predicted disorder) Number of non-PSE IDRs $70 residues Fraction of predicted disordered residues found in the crystal structure of the complex (% of predicted disorder) Minimal and maximal fractions of predicted disordered residues for individual subunits Maximum RNA length (nt) RNA fraction of total weight (% total weight)

Ribosome, E. coli 54 557 (S1) 132 37.7% 28 1 28 66.6% 0 98.9% 34.8% (small subunit) - 40.0% (large subunit) 2904 (23S) 65.2%

Ribosome, human 80 427 (L4) 170 47.0% 61 19 39 64.0% 3 39.1% (small subunit) - 52.2% (large subunit) 5070 (28S) 60.3%

Major spliceosome, snRNP subunits, human 45 2335 (U5-220K/hPrp8) 453 34.1% 28 23 93 41.9% 15 ,10% (U1 snRNP) 20.1% (U5 snRNP) - 65.5% (U1 snRNP) 188 (U2 snRNA)(*) 8.2%

(*) Saccharomyces cerevisiae U1 snRNA is 570 nts long, while the U2 snRNA is 1172 nts long. Such exceptional lengths are restricted to the genus Saccharomyces. doi:10.1371/journal.pcbi.1002641.t005

fate in a ribosome-bound form is unknown. This contrasts with the experimentally determined structure of the U1 snRNP, which reveals order for less than 10% of residues predicted to be disordered in isolated U1 proteins. As described in the Introduction, the main function fulfilled by IDRs in the ribosome is to be the mortar that fills in the gaps in the rRNAs, while the RNA forms the bulk of the macromolecular structure of the ribosome and defines its shape and catalytic center [23,49]. Only in few cases is a different function realized. For instance, the flexible interdomain linker of protein L7/L12 interfaces the ribosome with ribosome-acting GTPases [48]. We suggest that the prominence of the mortar function is the reason both for the greater homogeneity of disorder types and their spatial distribution in the ribosomes, and the prevalence of disorder with SS in the ribosomes. Although, in percentages, both the ribosomes and the spliceosome contain a similar amount of SS disorder, so far, there is very little structural evidence for the mortar function of the proteins of the spliceosome. We found only one predicted disordered region confirmed to bind RNA in all experimental structures of the spliceosome (C-terminal part of the U1snRNP70_N region, Table 2). Most experimental structures of splicing-related complexes feature ordered domains on the protein side. It is possible that novel structures will reveal binding interfaces wherein protein disorder supports the RNA in a mortar-like manner. However, the mortar role of intrinsic disorder may be simply less important in the spliceosome. The ribosomal RNA is longer in residues than any given ribosomal protein, occupies more space and has a higher molecular mass than all ribosomal proteins combined (Figure S6). In comparison, the snRNAs are much shorter than the rRNAs. Being shorter, they may be more likely to form a catalytically active form unaided by proteins and thus be in less need of mortar.

mind the image of a precise, assiduously controlled and operated mechanism proceeding to perform the splicing reaction according to discrete and precise steps. This mechanistic point of view of the spliceosome action leaves very little space to uncertainty, randomness, and fuzziness. In this work, we made multiple predictions regarding individual regions of human spliceosomal proteins as well as systematically analyzed the fraction, distribution and types of disorder across the various spliceosomal components. Summarizing, we found that the spliceosome, far from being a uniformly ordered machine, can be divided into three layers:

Summary and conclusions


The spliceosome has been called a molecular machine [11]. While useful, this metaphor may also be misleading, as it brings to
PLoS Computational Biology | www.ploscompbiol.org 10

An inner layer, which best fits the definition of a machine. It includes the ordered cores of U2 snRNP SF3B, U4/U6 disnRNP and U5 snRNP, as well as the Sm proteins of U1 snRNP and ordered C termini of the catalytic helicases. This layer also includes snRNAs. Proteins from this layer mainly assist the catalysis of the splicing reaction, and publications regarding this layer stress relatively precise mechanisms, such as kinetic proofreading [50]. Sm proteins, ordered proteins of the U4/U6 di-snRNP and U5 snRNP, as well as the C termini of catalytic helicases, are most likely the evolutionarily oldest peptide elements of the spliceosome. A middle layer, which is associated mostly with structured disorder (disorder with SS). It contains an abundance of domain-length disordered recognition motifs, disorder with predicted secondary structure that can act as, e.g., preformed structural elements and/or dual personality disorder, and long, highly disordered proteins with conserved disordered regions. Spatiotemporally, this layer is associated with U4/U6.U5 trisnRNP-specific proteins, and B, B-act and C-complex nonsnRNP proteins. Functionally, this layer is associated with spliceosome assembly, catalytic activation and dynamics. Many of these regions are phosphorylated. In addition to disorder with SS, this layer is also associated with some RS-like IDRs that function in splicing dynamics, such as [30]. This
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

layer is also associated with ubiquitin-dependent systems. Ubiquitin has been shown to control the dynamics of the spliceosome in several cases [51]. Proteins of the spliceosome contain many ubiquitin-related domains, and the majority of these domains are found in the proteins associated with the later stages of splicing [52]. An outer layer, which is associated with mostly unstructured disorder. It is enriched in regions of long, compositionally biased disorder that may function as sensors that the spliceosome extends to the surrounding environment. These regions contain interaction sites such as RS-like IDRs, hnRNP-like G-rich regions, polyproline regions and ULMs. They may interact with each other, or with small ordered structural domains such as the Tudor domain (bound by hnRNP-like G-rich regions) and GYF domain (bound by polyproline regions). On the other hand, small RNA-binding domains present in this layer, such as RRM (RNA Recognition Motif) and PWI, may aid in the binding of the substrate pre-mRNA. The function of this layer is regulated by phosphorylation (e.g. in RS-like IDRs) and methylation (e.g. in hnRNP-like G-rich regions). Spatiotemporally, this layer is associated with early (A-complex, U1, U2 SF3A, U11/U12, U2-related) proteins, with SR, hnRNP proteins, and SRm160/ 300 proteins, and with RES complex proteins. Functionally, this layer is associated with early recognition, intron/exon definition, and alternative splicing regulation processes.

structural studies of the spliceosome. While much progress has been achieved in the determination of global shapes of various spliceosomal assemblies by cryoEM [53], experimental structural information is missing for many regions of spliceosomal proteins. Intrinsic disorder in the spliceosome explains why: the functional importance of disordered regions notwithstanding, their physicochemical properties make them notorious spoilers of crystallization experiments [54]. Our predictions of disorder may guide the preparation of protein variants for crystallization that should be limited to regions that are intrinsically ordered or at least predicted to become ordered upon complex formation. For long disordered regions without secondary structure, stable conformations may not be obtained even in complexes. However, the structural characterization of intrinsically disordered elements of the spliceosome may require the application of completely different methods, such as small angle X-ray or neutron scattering (SAXS or SANS) experiments (review: [55]) and modeling with computational tools such as the Ensemble Optimization Method [56]. The results of our analyses will hopefully aid these efforts.

Methods Data
Spliceosome proteins with GI identifiers supplied in Table S1 were downloaded from the NCBI Protein database. Protein names and identifiers were acquired from [4,6,7,5761]. Division into abundant and non-abundant proteins was based on [4]. Assignment into protein groups was based mainly on [4], aided by information from: [6,5860]. Miscellaneous proteins were classified in primary sources, variably, as miscellaneous proteins, miscellaneous splicing factors, additional proteins, proteins not reproducibly detected, proteins not previously detected.

Full understanding of spliceosome activity requires information about each of its elements, at different functional stages [11]. Our predictions provide a number of testable functional hypotheses:

N N

We provide the proteins and positions of all types of compositionally biased disordered regions in spliceosomal proteins. Based on the colocation of two types of disordered regions (RS-like and G-rich), we suggest that these regions may interact with each other. As these two types of disordered regions are found in multiple proteins throughout the human spliceosomal proteome, we also suggest the possibility that many more human spliceosomal proteins interact nonspecifically with each other and the RNAs than previously suggested. Large-scale deletions of compositionally biased regions may suggest essential subsystems of this interaction network; We found that arginine methylation in spliceosomal proteins is associated with intrinsically disordered regions. We also suggest that arginine methylation and serine phosphorylation act in step to regulate the interaction network based on compositionally biased disordered regions. The elucidation of the effect of posttranslational modifications, such as conformational transitions and molecular interactions that depend on the introduction or removal of particular modifications, can also lead to an improved understanding of regulatory mechanisms; We provide candidate ULM sequences that can bind known and predicted UHM domains throughout the early stages of splicing. These sequences may participate in the regulation of particular instances of splicing; We suggest several abundant conserved proteins found in the later stages of splicing that may function as hub proteins (e.g. MFAP1, GCIP p29, U4/U6.U5 tri-snRNP proteins). Targeted deletions of ordered motifs within these proteins may reveal regions responsible for the formation of particular spliceosomal complexes, their rearrangements, and interactions with regulatory factors.

Prediction of intrinsic disorder and binding disorder


Initial predictions of intrinsic disorder were carried out using the GeneSilico MetaDisorder server (http://iimcb.genesilico.pl/ metadisorder/; [24]). Subsequently, disorder boundaries yielded by MetaDisorder were corrected manually based on predictions of secondary structure and solvent accessibility yielded by the GeneSilico MetaServer gateway (https://genesilico.pl/meta2/; [25]). In particular, sequence regions predicted to exhibit stable secondary structure and high fraction of solvent inaccessible residues, and confidently aligned to experimentally determined globular protein structures, were considered ordered regardless of the primary disorder prediction. Prediction of binding disorder was carried out using the ANCHOR server [62].

Assignment of disorder with predicted secondary structure


In disorder with SS, the disordered region is predicted to contain one or both types of canonical a and b SS elements. The predicted secondary structure may be either pre-formed in the disordered state or appear only upon the formation of a stable structure, e.g. upon binding to another molecule. This type of disorder also at times contains short ordered regions (Table 6, Figure S7). We defined regions of disorder with SS (predicted intrinsic disorder with predicted secondary structure elements) as regions for which simultaneously the majority of intrinsic disorder prediction methods on the MetaServer gateway yielded predictions of disorder and the majority of secondary structure prediction methods yielded predictions of secondary structure elements. Multiple closely spaced secondary structure elements (connected by loops ,20 residues) in a predicted disordered region were treated as elements of a single IDR with SS. If an IDR was predicted to contain a-helical elements and coiled-coil prediction methods aggregated on the MetaServer
11 August 2012 | Volume 8 | Issue 8 | e1002641

Our prediction that more than one-third of the residues of the snRNPs are disordered has significant implications for the
PLoS Computational Biology | www.ploscompbiol.org

Disorder in the Spliceosomal Proteome

also yielded a prediction, the IDR was classified into the special class of disorder with coiled coils.

Assignment of disorder with compositional bias


In compositionally biased disorder, the amino acid composition of the region deviates highly from the usual. We estimated compositional bias based on the absolute frequencies of occurrence of residues, compared to their usual frequency in vertebrates, as reported on the website http://www.tiem.utk. edu/,gross/bioed/webmodules/aminoacid.htm (information from [63,64]). A residue was considered overrepresented if (a) the region under consideration displayed considerable compositional bias (at least one kind of residue occurred with a frequency .20% or five times higher than its usual frequency of occurrence in vertebrates) and (b) this particular residue occurred in the region with a frequency .20% or three times higher than the usual frequency of occurrence in vertebrates. For several types of compositionally biased IDRs with a previous description in literature, we sought to define relevant standard IDR subclasses within our classification (Table 6):

hnRNP-like G-rich: IDRs that contain RGG and related repeats ([RSY]GG, R[AGT][AGTFIVR]) that can be classified as short (#100 residues) and long ones. These regions are predicted to have low solvent accessibility (Figure S7), but do not contain canonical higher order structures [72]. Repeats that contain arginines may be methylated on these residues [73]. Long G-rich IDRs were found in hnRNP proteins [74], while shorter G-rich IDRs are found in other splicing proteins, such as SmB/B, SF2/ASF and U1-70K ([73], [75], [76]). The G-rich region of hnRNP A1 has been shown to bind in vitro itself and other hnRNP proteins [77], to be necessary for the binding of hnRNP A1 to the U2 and U4 snRNPs [78], and to silence splicing [79]. Arginine-methylated G-rich regions may interact with the Tudor domain of the SMN protein [80,81]. Arginine methylation of yeast U1-70K homolog decreases binding of this protein by protein Npl3 [76].

We also developed two additional subclasses of compositionally biased IDRs to complement these classes of compositionally disordered IDRs:

RS-like: IDRs that are rich in arginine and serine residues. These regions were shown to be intrinsically disordered [65]. They are predicted to have high solvent accessibility (Figure S7). They may be phosphorylated on the serines [66]. RS-like regions were found in splicing factors from the SR family (RS domains) and in other spliceosomal proteins [67]. RS domains of SR proteins bind other RS-like IDRs as well as (pre-m)RNA and are crucial for the establishment of a network of weak contacts at the initial stages of splicing and intron/ exon definition [66]. Phosphorylation of some RS domains enhances their binding [68,69]. Phosphorylation of the RS-like IDR of the U5 snRNP protein DDX23 is also required for its stable association (with the U4/U6.U5 tri-snRNP) [30]. polyP/Q: IDRs that contain repeats of proline or glutamine residues. polyP/Q regions are capable of generating type II poly-P or poly-Q helices [70] and may contain short linear motifs involved in nonspecific binding of GYF and WW-type domains [41]. They are predicted to have high solvent accessibility (Figure S7). Several spliceosomal proteins, such as the Sm protein SmB/B, were shown to contain polyP/Q regions that interact with GYF and WW-type domains. Collectively, these regions are necessary for the formation of complex A [71].

N N

noncharged disorder, which is rich in noncharged residues (PQMGVWA); charged disorder, which is rich in charged residues (RKDE). The charged compositionally biased disorder is similar to a type of disorder with SS that has predictions for coiled-coil secondary structure.

PTM data
Site identifiers of 2153 known or possible post-translational modifications, including 720 modifications of the 122 core proteins, were downloaded from UniProt [32]. The following post-translational modifications were included: serine-, threonineand tyrosine phosphorylations, lysine N-acetylations, N-alphaterminal N-acetylations of non-lysine residues (MGASTV), various arginine methylations and various lysine methylations. All site identifiers available were used in the analysis (i.e. including sites with a status note By similarity and sites identified as Potential or Probable). 132 modification sites had a status note Status = By similarity and 8 had a status note Status = Potential or Status = Probable. Removing sites identified By similarity and sites identified as Potential or Probable did not impact overall statistics. In the listing, different modifications at same residues are considered separately (e.g. different possible arginine methylations), and the paper follows this model.

Table 6. Features of different IDR classes in the 130 spliceosomal proteins.

IDR class disorder with SS

Description contains secondary structure

Number of regions 95 (predicted to contain coiled coils), 115 (other types) 35 17 4 (hnRNP proteins), 10 (other proteins) 16 9

Mean length 64 aa (predicted to contain coiled coils), 55 aa (other types) 65 aa 138 aa 145 aa (hnRNP proteins), 56 aa (other proteins) 45 aa 57 aa

Compositional bias RKDE with additional MQW (predicted to contain coiled coils), no rule (other types) RS PQMGVWA GRY PQMGVWA RKDE

compositionally biased, RS-like compositionally biased, polyP/Q compositionally biased, hnRNP G-rich compositionally biased, noncharged compositionally biased, charged

biased towards arginine and serine residues noncharged with poly P/Q (P/Q(n), n$3)) repeats contains RGG and related repeats ([RSY]GG, R[AGT][AGTFIVR]) (*) biased towards noncharged residues biased towards charged residues

(*) [72]: XGG, where X aromatic or long aliphatic; arginine methylation data: R[AGT][AGTFIVR]. doi:10.1371/journal.pcbi.1002641.t006

PLoS Computational Biology | www.ploscompbiol.org

12

August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Pattern recognition and motif search


Assignment of boundaries for hnRNP-like G-rich regions and for positions of candidate ULMs was based on pattern analysis. For hnRNP-like G-rich regions, the following patterns were used: [RSY]GG-x{1,50}-[RSY]GG-x{1,50}-[RSY]GG; R[AGT] [AGTFIVR]-x{1,25}-RGG-x{1,25}-R[AGT][AGTFIVR]. For ULMs, the following pattern was used: [RK]{1,}-[RK]-x{0,1}[RK]{1,}-x{0,1}-W-x{0,2}-[DE]{1,}. The ULM consensus pattern was based on the sequences of known ULMs found in experimentally determined structures of ULM complexes. This stringent pattern does not retrieve all of the bona fide ULMs in protein SF3b155 that display a weaker binding affinity to the U2AF65 partner than the ULM found in the experimentally determined structure [82]. We decided to use a stringent pattern in order to reduce the number of possible false positives compared to the more lenient pattern described in literature [39]. Search for domain-length disordered recognition motifs was carried out with HHSEARCH [83].

Figure S3 MoRFs in the structures of spliceosome proteins. A: N-U1snRNP70_N (in yellow) and CU1snRNP70_N (in red) (protein U1-70K in the structure of U1 snRNP with removed Sm proteins, PDB ID: 3CW1). B: ULM (protein SF3b155 in complex with SPF45, PDB ID: 2PEH). C: ULM (protein U2AF65 in complex with U2AF35, PDB ID: 1JMT). D: SF3b1 (protein SF3b155 in complex with SF3b14a/ p14, PDB ID: 2F9D). E: SF3a60_bindingd (protein SF3a60 in complex with SF3a120, PDB ID: 2DT7). F: Btz (protein MLN51 in the structure of the exon-junction complex, PDB ID: 2J0S). (TIF) Figure S4 Disorder plots for highly disordered spliceosome proteins. Example disorder plots created by the ANCHOR server, http://anchor.enzim.hu. Red line: disorder probability; blue line: probability of binding another molecule at the residue; blue line at the bottom: another representation of the binding probability (the darker the blue, the higher the probability). A. MLN51 (EJC protein). The region corresponding to the Btz MoRF lies between residues 169230. B. U4/U6.U5110K. C. U4/U6.U5-27K. (TIF) Figure S5 IDR lengths in E. coli and human ribosome and human major spliceosome snRNP subunits. This graph shows the fraction of proteins in the proteomes of the E. coli (orange) and human ribosome (green) and the snRNP subunits of the major spliceosome (blue) that contain at least one IDR of a given length. (TIF) Figure S6 Structural regions in E. coli and human ribosome and human major spliceosome snRNP subunits. This graphs shows the fractions of the total weight of the three complexes taken up by different types of structural regions. The Sm proteins were calculated four times each towards the weight of the spliceosome. (TIF) Figure S7 Disorder plots for various types of IDRs found in spliceosome proteins. Example disorder plots created by the ANCHOR server, http://anchor.enzim.hu. Red line: disorder probability; blue line: probability of binding another molecule at the residue; blue line at the bottom: another representation of the binding probability (the darker the blue, the higher the probability). A. IDR with SS: SF3b145, residues 738818; B. RS-like IDR: protein 9G8, residues 121215; C. polyP/Q IDR: SF3a66, residues 216307; D. hnRNP G-rich IDR: hnRNPA1, residues 200285. Interpretation of the plots: A is predicted to contain short regions of order in regions of disorder, B and C are predicted to be almost completely unfolded in isolation and D is largely insoluble. A, B and C contain regions predicted to be binding. In the case of the RS region, this encompassed almost its entire length. (TIF) Table S1

Assignment of PFAM domains in disordered regions and LECA presence for disordered PFAM domains
PFAM IDs were assigned on the PFAM website [36]. The list of disordered domains present in LECA was established based on a list of predicted LECA domains kindly provided by Prof. Adam Godzik and Dr. Christian M. Zmasek [42].

Analysis of disorder and disorder-to-order transition in E. coli and human ribosome


E. coli and human ribosomal proteins were extracted from the Ribosomal Protein Gene database (RPG) [84]. The following crystal structures of E. coli ribosomes and ribosomal proteins were used to determine disorder-to-order transitions: majority of proteins: PDB ID: 2QAM (subunit 50S, resolution 3.21 A) and 2QAN (subunit 30S, resolution 3.21 A); protein L31: ribosomal structure 2AW4; protein L1: ribosomal structure 3FIK. For protein L7/L12, a dimer structure was used (PDB ID: 1RQU), while for protein S1 only the one available structure of a single domain was used (PDB ID: 2KHI). Although a crystal structure of a eukaryotic ribosome has been recently determined, many amino acid residues within this structure are unassigned [85]. Hence, this structure is unsuitable for the examination of sequences that alter their state between order and disorder.

Visualization
Disorder and binding disorder plots were generated using the ANCHOR server (http://anchor.enzim.hu) [62]. Molecular structure graphics were produced with UCSF Chimera [86].

Supporting Information
Figure S1 The hierarchy of classification of intrinsic disorder in the spliceosomal proteome. Compositionally biased disorder includes only disorder predicted not to contain any secondary structure elements. (TIF) Figure S2 Types of disorder in core spliceosomal proteins. This figure shows the fractions of all types of disorder with SS (left) and compositionally biased disorder (right) in various groups of core spliceosomal proteins. Values are given as fractions of total disorder. In this figure, disorder with SS is divided based on the presence or absence of coiled coils and types of secondary structure. (TIF)
PLoS Computational Biology | www.ploscompbiol.org 13

Proteins of the human spliceosomes divided into groups. (XLSX)

Table S2 Compositionally biased regions of spliceosome proteins. (XLSX) Table S3 Candidate ULMs, Btz and PRP4 regions in spliceosomal proteins. (XLSX)
August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

Table S4 PFAM domains that map to disordered regions in human spliceosomal proteins. (XLSX) Table S5 Conserved ordered regions in the core of the human spliceosome. (XLSX)

Christos Ouzonis for help with RS domains. IK thanks Peter Tompa for the kind gift of his book on protein disorder. We thank Reinhard Luhrmann, Elz _bieta Purta, Anna Czerwoniec, ukasz Kozowski, Joanna Kasprzak, and Marcin Magnus for critical reading of the manuscript, useful comments and suggestions.

Author Contributions
Conceived and designed the experiments: IK JMB. Performed the experiments: IK. Analyzed the data: IK JMB. Contributed reagents/ materials/analysis tools: IK JMB. Wrote the paper: IK JMB.

Acknowledgments
We thank ukasz Kozowski for help with his software, Adam Godzik and Christian Zmasek for the list of LECA domains, Ben Blencowe and

References
1. Veretnik S, Wills C, Youkharibache P, Valas RE, Bourne PE (2009) Sm/Lsm genes provide a glimpse into the early evolution of the spliceosome. PLoS Comput Biol 5: e1000315. 2. Kambach C, Walke S, Young R, Avis JM, de la Fortelle E, et al. (1999) Crystal structures of two Sm protein complexes and their implications for the assembly of the spliceosomal snRNPs. Cell 96: 375387. 3. Valadkhan S, Jaladat Y (2010) The spliceosomal proteome: at the heart of the largest cellular ribonucleoprotein machine. Proteomics 10: 41284141. 4. Agafonov DE, Deckert J, Wolf E, Odenwalder P, Bessonov S, et al. (2011) Semiquantitative proteomic analysis of the human spliceosome via a novel twodimensional gel electrophoresis method. Mol Cell Biol 31: 26672682. 5. Zhou Z, Licklider LJ, Gygi SP, Reed R (2002) Comprehensive proteomic analysis of the human spliceosome. Nature 419: 182185. 6. Jurica MS, Moore MJ (2003) Pre-mRNA splicing: awash in a sea of proteins. Mol Cell 12: 514. 7. Bessonov S, Anokhina M, Krasauskas A, Golas MM, Sander B, et al. (2010) Characterization of purified human Bact spliceosomal complexes reveals compositional and morphological changes during spliceosome activation and first step catalysis. RNA 16: 23842403. 8. McKay SL, Johnson TL (2010) A birds-eye view of post-translational modifications in the spliceosome and their roles in spliceosome dynamics. Mol Biosyst 6: 20932102. 9. Tarn WY, Steitz JA (1996) A novel spliceosome containing U11, U12, and U5 snRNPs excises a minor class (AT-AC) intron in vitro. Cell 84: 801811. 10. Will CL, Schneider C, Hossbach M, Urlaub H, Rauhut R, et al. (2004) The human 18S U11/U12 snRNP contains a set of novel proteins not found in the U2-dependent spliceosome. RNA 10: 929941. 11. Wahl MC, Will CL, Luhrmann R (2009) The spliceosome: design principles of a dynamic RNP machine. Cell 136: 701718. 12. Will CL, Luhrmann R (2005) Splicing of a rare class of introns by the U12dependent spliceosome. Biol Chem 386: 713724. 13. Xie H, Vucetic S, Iakoucheva LM, Oldfield CJ, Dunker AK, et al. (2007) Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. J Proteome Res 6: 18821898. 14. Tompa P (2009) Structure and Function of Intrinsically Disordered Proteins. Chapman & Hall. 15. Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, et al. (2003) ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 31: 36253630. 16. Tompa P, Fuxreiter M, Oldfield CJ, Simon I, Dunker AK, et al. (2009) Close encounters of the third kind: disordered domains and the interactions of proteins. Bioessays 31: 328335. 17. Zhang Y, Stec B, Godzik A (2007) Between order and disorder in protein structures: analysis of dual personality fragments in proteins. Structure 15: 11411147. 18. Dunker AK (2007) Another window into disordered protein function. Structure 15: 10261028. 19. Radivojac P, Iakoucheva LM, Oldfield CJ, Obradovic Z, Uversky VN, et al. (2007) Intrinsic disorder and functional proteomics. Biophys J 92: 14391456. 20. Hegyi H, Schad E, Tompa P (2007) Structural disorder promotes assembly of protein complexes. BMC Struct Biol 7: 65. 21. Helgstrand M, Rak AV, Allard P, Davydova N, Garber MB, et al. (1999) Solution structure of the ribosomal protein S19 from Thermus thermophilus. J Mol Biol 292: 10711081. 22. Wimberly BT, Brodersen DE, Clemons WM, Jr., Morgan-Warren RJ, Carter AP, et al. (2000) Structure of the 30S ribosomal subunit. Nature 407: 327339. 23. Ban N, Nissen P, Hansen J, Moore PB, Steitz TA (2000) The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science 289: 905 920. 24. Kozlowski LP, Bujnicki JM (2012) MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinformatics 13: 111. 25. Kurowski MA, Bujnicki JM (2003) GeneSilico protein structure prediction metaserver. Nucleic Acids Res 31: 33053307. 26. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337: 635645. 27. Dziembowski A, Ventura AP, Rutz B, Caspary F, Faux C, et al. (2004) Proteomic analysis identifies a new complex required for nuclear pre-mRNA retention and splicing. Embo J 23: 48474856. 28. Leung AK, Nagai K, Li J (2011) Structure of the spliceosomal U4 snRNP core domain and its implication for snRNP biogenesis. Nature 473: 536539. 29. Pomeranz Krummel DA, Oubridge C, Leung AK, Li J, Nagai K (2009) Crystal structure of human spliceosomal U1 snRNP at 5.5 A resolution. Nature 458: 475480. 30. Mathew R, Hartmuth K, Mohlmann S, Urlaub H, Ficner R, et al. (2008) Phosphorylation of human PRP28 by SRPK2 is required for integration of the U4/U6-U5 tri-snRNP into the spliceosome. Nat Struct Mol Biol 15: 435443. 31. Blencowe BJ, Bauren G, Eldridge AG, Issner R, Nickerson JA, et al. (2000) The SRm160/300 splicing coactivator subunits. RNA 6: 111120. 32. Magrane M, Consortium U (2011) UniProt Knowledgebase: a hub of integrated protein data. Database 2011: bar009. 33. Hwang CS, Shemorry A, Varshavsky A (2010) N-terminal acetylation of cellular proteins creates specific degradation signals. Science 327: 973977. 34. Liu Q, Dreyfuss G (1995) In vivo and in vitro arginine methylation of RNAbinding proteins. Mol Cell Biol 15: 28002808. 35. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28: 235242. 36. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein families database. Nucleic Acids Res 38: D211222. 37. Vacic V, Oldfield CJ, Mohan A, Radivojac P, Cortese MS, et al. (2007) Characterization of molecular recognition features, MoRFs, and their binding partners. J Proteome Res 6: 23512366. 38. Kielkopf CL, Rodionova NA, Green MR, Burley SK (2001) A novel peptide recognition mode revealed by the X-ray structure of a core U2AF35/U2AF65 heterodimer. Cell 106: 595605. 39. Kielkopf CL, Lucke S, Green MR (2004) U2AF homology motifs: protein recognition in the RRM world. Genes Dev 18: 15131526. 40. Sarras H, Alizadeh Azami S, McPherson JP (2010) In search of a function for BCLAF1. ScientificWorldJournal 10: 14501461. 41. Collins L, Penny D (2005) Complex spliceosomal organization ancestral to extant eukaryotes. Mol Biol Evol 22: 10531066. 42. Zmasek CM, Godzik A (2011) Strong functional patterns in the evolution of eukaryotic genomes revealed by the reconstruction of ancestral protein domain repertoires. Genome Biol 12: R4. 43. Staley JP, Woolford JL, Jr. (2009) Assembly of ribosomes and spliceosomes: complex ribonucleoprotein machines. Curr Opin Cell Biol 21: 109118. 44. Dlakic M, Mushegian A (2011) Prp8, the pivotal protein of the spliceosomal catalytic center, evolved from a retroelement-encoded reverse transcriptase. RNA 17: 799808. 45. Michel F, Costa M, Westhof E (2009) The ribozyme core of group II introns: a structure in want of partners. Trends Biochem Sci 34: 189199. 46. Moriya H, Kasai H, Isono K (1995) Cloning and characterization of the hrpA gene in the terC region of Escherichia coli that is highly similar to the DEAH family RNA helicase genes of Saccharomyces cerevisiae. Nucleic Acids Res 23: 595598. 47. Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ (2000) Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform 11: 161171. 48. Mulder FA, Bouakaz L, Lundell A, Venkataramana M, Liljas A, et al. (2004) Conformation and dynamics of ribosomal stalk protein L12 in solution and on the ribosome. Biochemistry 43: 59305936. 49. Brodersen DE, Nissen P (2005) The social life of ribosomal proteins. FEBS J 272: 20982108. 50. Valadkhan S (2007) The spliceosome: caught in a web of shifting interactions. Curr Opin Struct Biol 17: 310315. 51. Bellare P, Small EC, Huang X, Wohlschlegel JA, Staley JP, et al. (2008) A role for ubiquitin in the spliceosome assembly pathway. Nat Struct Mol Biol 15: 444 451. 52. Korneta I, Magnus M, Bujnicki JM (2012) Structural bioinformatics of the human spliceosomal proteome. Nucleic Acids Res. E-pub ahead of print. doi: 10.1093/nar/gks347

PLoS Computational Biology | www.ploscompbiol.org

14

August 2012 | Volume 8 | Issue 8 | e1002641

Disorder in the Spliceosomal Proteome

53. Stark H, Luhrmann R (2006) Cryo-electron microscopy of spliceosomal components. Annu Rev Biophys Biomol Struct 35: 435457. 54. Quevillon-Cheruel S, Leulliot N, Gentils L, van Tilbeurgh H, Poupon A (2007) Production and crystallization of protein domains: how useful are disorder predictions ? Curr Protein Pept Sci 8: 151160. 55. Bernado P, Svergun DI (2012) Structural analysis of intrinsically disordered proteins by small-angle X-ray scattering. Mol Biosyst 8: 151167. 56. Bernado P, Mylonas E, Petoukhov MV, Blackledge M, Svergun DI (2007) Structural characterization of flexible proteins using small-angle X-ray scattering. J Am Chem Soc 129: 56565664. 57. Makarov EM, Makarova OV, Urlaub H, Gentzel M, Will CL, et al. (2002) Small nuclear ribonucleoprotein remodeling during catalytic activation of the spliceosome. Science 298: 22052208. 58. Behzadnia N, Golas MM, Hartmuth K, Sander B, Kastner B, et al. (2007) Composition and three-dimensional EM structure of double affinity-purified, human prespliceosomal A complexes. EMBO J 26: 17371748. 59. Deckert J, Hartmuth K, Boehringer D, Behzadnia N, Will CL, et al. (2006) Protein composition and electron microscopy structure of affinity-purified human spliceosomal B complexes isolated under physiological conditions. Mol Cell Biol 26: 55285543. 60. Bessonov S, Anokhina M, Will CL, Urlaub H, Luhrmann R (2008) Isolation of an active step I spliceosome and composition of its RNP core. Nature 452: 846 850. 61. Fabrizio P, Dannenberg J, Dube P, Kastner B, Stark H, et al. (2009) The evolutionarily conserved core design of the catalytic activation step of the yeast spliceosome. Mol Cell 36: 593608. 62. Dosztanyi Z, Meszaros B, Simon I (2009) ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics 25: 27452746. 63. King JL, Jukes TH (1969) Non-Darwinian evolution. Science 164: 788798. 64. Dyer KF (1971) The quiet revolution: A new synthesis of biological knowledge. J Biol Edu 5: 1524. 65. Haynes C, Iakoucheva LM (2006) Serine/arginine-rich splicing factors belong to a class of intrinsically disordered proteins. Nucleic Acids Res 34: 305312. 66. Long JC, Caceres JF (2009) The SR protein family of splicing factors: master regulators of gene expression. Biochem J 417: 1527. 67. Calarco JA, Superina S, OHanlon D, Gabut M, Raj B, et al. (2009) Regulation of vertebrate nervous system alternative splicing and development by an SRrelated protein. Cell 138: 898910. 68. Roscigno RF, Garcia-Blanco MA (1995) SR proteins escort the U4/U6.U5 trisnRNP to the spliceosome. RNA 1: 692706. 69. Xiao SH, Manley JL (1997) Phosphorylation of the ASF/SF2 RS domain affects both protein-protein and protein-RNA interactions and is necessary for splicing. Genes Dev 11: 334344. 70. Cubellis MV, Caillez F, Blundell TL, Lovell SC (2005) Properties of polyproline II, a secondary structure element implicated in protein-protein interactions. Proteins 58: 880892. 71. Kofler M, Schuemann M, Merz C, Kosslick D, Schlundt A, et al. (2009) Prolinerich sequence recognition: I. Marking GYF and WW domain assembly sites in early spliceosomal complexes. Mol Cell Proteomics 8: 24612473. 72. Steinert PM, Mack JW, Korge BP, Gan SQ, Haynes SR, et al. (1991) Glycine loops in proteins: their occurrence in certain intermediate filament chains, loricrins and single-stranded RNA binding proteins. Int J Biol Macromol 13: 130139. 73. Bedford MT, Richard S (2005) Arginine methylation an emerging regulator of protein function. Mol Cell 18: 263272.

74. Han SP, Tang YH, Smith R (2010) Functional diversity of the hnRNPs: past, present and perspectives. Biochem J 430: 379392. 75. Sinha R, Allemand E, Zhang Z, Karni R, Myers MP, et al. (2010) Arginine methylation controls the subcellular localization and functions of the oncoprotein splicing factor SF2/ASF. Mol Cell Biol 30: 27622774. 76. Chen YC, Milliman EJ, Goulet I, Cote J, Jackson CA, et al. (2010) Protein arginine methylation facilitates cotranscriptional recruitment of pre-mRNA splicing factors. Mol Cell Biol 30: 52455256. 77. Cartegni L, Maconi M, Morandi E, Cobianchi F, Riva S, et al. (1996) hnRNP A1 selectively interacts through its Gly-rich domain with different RNA-binding proteins. J Mol Biol 259: 337348. 78. Buvoli M, Cobianchi F, Riva S (1992) Interaction of hnRNP A1 with snRNPs and pre-mRNAs: evidence for a possible role of A1 RNA annealing activity in the first steps of spliceosome assembly. Nucleic Acids Res 20: 50175025. 79. Del Gatto-Konczak F, Olive M, Gesnel MC, Breathnach R (1999) hnRNP A1 recruited to an exon in vivo can function as an exon splicing silencer. Mol Cell Biol 19: 251260. 80. Brahms H, Meheus L, de Brabandere V, Fischer U, Luhrmann R (2001) Symmetrical dimethylation of arginine residues in spliceosomal Sm protein B/B and the Sm-like protein LSm4, and their interaction with the SMN protein. RNA 7: 15311542. 81. Friesen WJ, Massenet S, Paushkin S, Wyce A, Dreyfuss G (2001) SMN, the product of the spinal muscular atrophy gene, binds preferentially to dimethylarginine-containing protein targets. Mol Cell 7: 11111117. 82. Thickman KR, Swenson MC, Kabogo JM, Gryczynski Z, Kielkopf CL (2006) Multiple U2AF65 binding sites within SF3b155: thermodynamic and spectroscopic characterization of protein-protein interactions among pre-mRNA splicing factors. J Mol Biol 356: 664683. 83. Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21: 951960. 84. Nakao A, Yoshihama M, Kenmochi N (2004) RPG: the Ribosomal Protein Gene database. Nucleic Acids Res 32: D168170. 85. Ben-Shem A, Jenner L, Yusupova G, Yusupov M (2010) Crystal structure of the eukaryotic ribosome. Science 330: 12031209. 86. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, et al. (2004) UCSF Chimeraa visualization system for exploratory research and analysis. J Comput Chem 25: 16051612. 87. Corsini L, Bonnal S, Basquin J, Hothorn M, Scheffzek K, et al. (2007) U2AFhomology motif interactions are required for alternative splicing regulation by SPF45. Nat Struct Mol Biol 14: 620629. 88. Selenko P, Gregorovic G, Sprangers R, Stier G, Rhani Z, et al. (2003) Structural basis for the molecular recognition between human splicing factors U2AF65 and SF1/mBBP. Mol Cell 11: 965976. 89. Schellenberg MJ, Edwards RA, Ritchie DB, Kent OA, Golas MM, et al. (2006) Crystal structure of a core spliceosomal protein interface. Proc Natl Acad Sci U S A 103: 12661271. 90. Kuwasako K, He F, Inoue M, Tanaka A, Sugano S, et al. (2006) Solution structures of the SURP domains and the subunit-assembly mechanism within the splicing factor SF3a complex in 17S U2 snRNP. Structure 14: 16771689. 91. Reidt U, Wahl MC, Fasshauer D, Horowitz DS, Luhrmann R, et al. (2003) Crystal structure of a complex between human spliceosomal cyclophilin H and a U4/U6 snRNP-60K peptide. J Mol Biol 331: 4556. 92. Bono F, Ebert J, Lorentzen E, Conti E (2006) The crystal structure of the exon junction complex reveals how it maintains a stable grip on mRNA. Cell 126: 713725.

PLoS Computational Biology | www.ploscompbiol.org

15

August 2012 | Volume 8 | Issue 8 | e1002641

Owiadczenia

Summary in English

S u m m a r y i n E n g l i s h | 68 Introduction: The spliceosome The spliceosome is a large molecular machine that in eukaryotic cells carries out the process of splicing the removal of introns (noncoding sequences) and the joining of exons (coding sequences) of the precursor mRNA (pre-mRNA). The human spliceosome comprises 45 different proteins bound in protein-RNA complexes called the subunits of the spliceosome, approximately 70-80 additional proteins found in the spliceosome in large quantities (abundantly), and over 100 additional non-abundant proteins. Non-subunit spliceosomal proteins may be essential to its operation, may participate in its function only in specific instances, or may mediate between the process of splicing and other mRNA processing pathways. Nonsubunit proteins may be components of stable protein complexes or function as independent splicing factors. Among the protein complexes functionally associated with the spliceosome are the hPrp19/CDC5L complex as well as the EJC, CBP, TREX and RES complex.

Introduction: Research project My research project focused on the structural analysis and modeling of 252 human spliceosomal proteins, including all proteins of the spliceosome subunits and all abundant non-subunit proteins. The project was initiated by Professor Janusz M. Bujnicki, who is also the supervisor of this dissertation work. The work completed in the project was supported by the EU 6th Framework Programme Network of Excellence EURASNET (grant number LSHG-CT-2005-518238). Computing power has been provided in part by the Interdisciplinary Centre for Mathematical and Computational Modeling of the University of Warsaw (grant number G27-4). Although the creation of an exhaustive structural representation of the protein part of the human spliceosome has a value of its own, the research project was largely motivated by the vision of creating a structural model of the entire spliceosome. At the moment when the project started, no high-resolution experimental structures were available for larger regions of the spliceosome. Through combining structural models of individual fragments of the complex with the results of experimental analyses such as mass spectrometry and electron cryomicroscopy, it would be possible to obtain a structural model of the spliceosome (that later could, in turn, aid in experimental work). Research on an attempt to create a structural model of the entire spliceosome is continued by other members of Professor Bujnickis research team.

First stage of the project The first task I performed within the confines of the project was to systematically analyze the structured regions of the proteins of the human spliceosome, as well as to review the existing high-resolution experimental structures of the human splicing proteins and to construct high-resolution models for regions of proteins without experimental representation. In the 252 proteins, I detected 465 autonomous ordered structural domains that can be assigned to known classes of structural domains. Furthermore, I discovered 25 ordered regions that could not be attributed to known classes, but for which their properties (such as coherence, length, prediction of structural order and secondary structure elements) indicate that they might constitute potential autonomous domains. Among the domains classified into a known type, several domains, including those of proteins of the subunits of the spliceosome and other abundant proteins, had not been characterized prior to this work (e.g. PWI-type domains found in proteins hBrr2, hPrp22 and hPrp2). Through a systematic characterization of ubiquitinrelated domains found in the human splicing factors, I concluded that these domains are common among the human spliceosomal proteins, and are specific mainly to proteins found at the later stages of the splicing process.

68

S u m m a r y i n E n g l i s h | 69 On the basis of the available experimental structures, I created a standardized library of 104 unique nonoverlapping structures. Taken together, these structures cover 20.6% of the total protein sequence predicted to be ordered (14.3% of the total protein sequence). In addition, I constructed 255 comparative models and 43 de novo models, which altogether cover a three times greater length of the total protein sequence. Overall, the available experimental structures and the comparative and de novo models I constructed cover more than 90% of the total length of the predicted ordered protein sequence (48.7% of the total protein sequence). For the majority of disordered regions and the remaining ordered regions, I constructed pro forma structures that will enable subsequent structural analysis of these fragments. Domain detection and model construction was for me the hardest part of the project, for several reasons. First, it was time-consuming and took up the lions part of the time of the project. Second, it required more tenacity than genuine intellectual creativity. Finally, the ultimate determinant of the value of the models the creation of the model of the entire spliceosome (or, at the minimum, of sufficiently large parts of it) lay outside of the scope of my part of the project. Taken together, these three circumstances vastly decreased my motivation at this stage of the project, and made it very hard for me to finish it. Nevertheless, there were some exciting moments the most satisfying of which was, of course, the discovery of new structural domains in some of the most important proteins of the spliceosome, some of which had been analyzed multiple times before (e.g. hBrr2). It is extremely satisfying, to find something that others before you have missed.

Second stage of the project The second stage of the project consisted of an analysis of intrinsic structural disorder in the spliceosomal proteins. Intrinsic protein disorder is defined as the lack of a stable tertiary structure of a given region of protein while in solution in isolation, although it is possible that secondary structure elements are formed and/or that the region acquires structure in certain conditions (for example, when the protein is bound in a complex). The analysis of intrinsic disorder was not a part of the original project. It was only found to be necessary after an initial analysis, when I realized that over a third of the total length of the proteins of the subunits of the human spliceosome, and more than half the total length of all human splicing proteins, was predicted to be intrinsically disordered. The issue of structural disorder in the proteins of the spliceosome had not been systematically examined prior to my analysis. Hence, the first step of the analysis was to collect as much previously published information on the various putative functional forms of structural disorder in splicing proteins as possible. Only after that could a systematic analysis of the human splicing proteins themselves follow. As a result of this analysis, I discovered that the proteins of the spliceosome subunits, as well as abundant and nonabundant proteins specific to different stages of the splicing process, differ in the content and type of structural disorder. Proteins specific to the initial stages of the reaction, whose role is to create a network of weak contacts between the subunits of the spliceosome and pre-mRNA, as well as between essential and instance-specific splicing factors, contain a significant amount of structural disorder with no predicted secondary structure elements, but exhibiting one of several characteristic types of amino acid compositional bias. In contrast, proteins responsible for the dynamics of the process of splicing and the interaction of the spliceosome subunits with one another contain a significant amount of structural disorder with predicted elements of secondary structure. During this stage of the project, I performed also several additional analyses, focusing on elements such as: the correlation of the sites of post-translational modifications in human spliceosomal proteins with regions of structural order or disorder; proteins with extremely high disorder content (>75%); comparative evolutionary history of conserved ordered and disordered regions. I also compared intrinsic disorder in the proteins of the human spliceosome with intrinsic disorder in the proteins of the human and bacterial (Escherichia coli) ribosomes. 69

S u m m a r y i n E n g l i s h | 70

This stage of the project was for me much more interesting than the first stage, because, upon combining the results of various analyses, I was at its end able to formulate a single coherent model (conceptual not structural) for a phenomenon that had not been described at all prior to that point: the presence and function of intrinsic structural disorder in the human spliceosome. My model assumes the existence of a hard, ordered, potentially evolutionarily ancient core comprising the ordered domains of proteins that directly assist the process of splicing performed by the RNA of the spliceosome; a plastic mantle containing a large amount of intrinsic disorder with secondary structure elements that can acquire or lose structure depending on circumstances, and being responsible for the control of spliceosome dynamics (putatively similar in this respect to the ubiquitin-related domains that I mentioned earlier); and a relatively loose external atmosphere composed of intrinsic disorder without predicted secondary structure and small ordered domains that bind RNA and protein intrinsic disorder, and active mainly in the beginning stages of the splicing process, that is partner recognition and definition. These three layers reflect the heterogeneity and complexity of the splicing reaction in human. My model can be used as a framework for further research of the phenomenon of intrinsic disorder in the spliceosome. At the same time, specific results of my analyses regarding particular protein regions that I presented together with the general model, can be verified experimentally (and, if necessary, be used to correct the model).

Third stage of the project In the third stage of the project, I compared the complement of proteins and protein domains found in the human spliceosomal proteome with the known complement of proteins and protein domains found in the spliceosomal proteome of the diplomonad Giardia lamblia. This species is characterized by genomic minimalism, including in this also a minimal number of introns in the genome. By comparing the G. lamblia spliceosomal proteome with the human one, I was able to determine that the G. lamblia spliceosomal proteome is missing most of the ubiquitin-related proteins and/or domains and the majority of structural disorder predicted to possess an independent function found in the spliceosomal proteome of human. On the other hand, the G. lamblia proteome contains the majority of conserved domains from the hard core that directly assist the RNA during splicing catalysis and that had been probably adapted into the spliceosome from pre-existent systems. I find this analysis to be the most interesting part of my project although very short, it brought about results confirming the existence of a coherent functionality of the spliceosomal machine based on ubiquitin-related domains, and another functionality based on intrinsically disordered regions of proteins. The result of this analysis may help determine precedence in the modeling of the much more complicated human spliceosome, because regions common to both human and G. lamblia proteomes should be probably prioritized in the modeling process.

Publication of data The structural models I have created (except for the pro forma structures) possess parameters adequate for use in further research of the spliceosome, including the possibility of combining them with the results of electron cryomicroscopy analyses of the spliceosome subunits in order to understand the structure of the complex. The catalogue of structures and models is available at http://iimcb.genesilico.pl/SpliProt3D. The website was developed by Marcin Magnus, M.Sc.. Although I did not participate in the programming of the website, I was one of its designers. From my point of view, an interesting challenge at this stage was the necessity to create a clear visualization of the 70

S u m m a r y i n E n g l i s h | 71 combination of the sequence alignment of homologs of human proteins with a description of some of the properties of the human protein, such as the predicted intrinsic disorder and secondary structure or the position of known or predicted sites of posttranslational modifications. Pre-existent tools that aggregate various types of data regarding protein properties (such as e.g. the GeneSilico metaserver), are powerful, but usually targeted towards taking in as much data as possible at the expense of the esthetics of the message and so, are not suited well for data visualization. However, for the purpose of the creation of the website as well as publication of the results, it was necessary to integrate data from alignments and predictions in a compact form. It is my belief that the final effect of my work, which I obtained using the Jalview program, responds well to the challenge set.

Publication of results The results of the project were published in two articles, Structural Bioinformatics of the Human Spliceosomal Proteome (Korneta I., Magnus M., Bujnicki JM., 2012, doi: 10.1093/nar/gks347, PMID: 22573172) and Intrinsic Disorder in the Human Spliceosomal Proteome (Korneta I., Bujnicki JM., 2012 doi: 10.1371/journal.pcbi.1002641, PMID: 22912569), which comprise the dissertation.

71

Вам также может понравиться