Le reti neurali e la predizione della struttura proteica Rita Casadio Interdepartmental Centre for Biotechnological Research University of Bologna, Italy
L’era “omica”: genomi completi Archea: 16 speci/33 in progress Batteri: 83 speci Eukarioti: 17 speci (242 chromosomi) www.ncbi.nlm.nih.gov Draft del genoma umano Nature (2/15/01) Human Genome Issue http://www.ncbi.nlm.nih.gov/genome/guide/human http://www.ensembl.org/ Science (2/16/01) Human Genome Issue http://public.celera.com/index.cfm
Dalla Sequenza alla Funzione Genomica funzionale, Proteomica ed Interattomica Strutture proteiche Geni > RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Sequenze proteiche Funzione
PRINCIPI DI BASE DELLA STRUTTURA DELLE PROTEINE Livelli di organizzazione strutturale Primaria Secondaria Terziaria Quaternaria
PRINCIPI DI BASE DELLA STRUTTURA DELLE PROTEINE Gli elementi di struttura secondaria Foglietto b a -elica C N
La predizione del Protein Folding Il processo di folding La catena La proteina nativa La cinetica del Folding: I siti di iniziazione
Le Banche Dati di Sequenze Biologiche e Strutture NCBI: 18,197,119 sequenze 22,616,937,182 nucleotidi >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Swiss-Prot: 113,470 sequenze 41,413,223 residui PDB: 17,510 strutture August/2002
Possiamo estrarre dal PDB circa 1500 esempi di catene di cui e’ nota la struttura terziaria al fine di ricavare informazioni non ridondanti per la relazione tra sequenza e: Struttura secondaria Motivi strutturali e funzionali Struttura terziaria (3D)
Il Protein Folding T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N
Caratteristiche della Predizione Strutturale di Sequenze Proteiche Ampio insieme di dati per cui la soluzione del problema è nota E’ difficile (impossibile) formulare una soluzione analitica del problema Le banche dati vengono aggiornate in modo continuo (grande volume di dati, necessità di operare in tempo reale)
Mapping generale non-lineare funzionale X x1 x2 ………xn X space Y space Y y1 y2 ………yn
Tools derivati dall’apprendimento automatico: Reti Neurali Training Predizione Set dalla banca dati Nuova sequenza Regole Generali Mapping noto Predizione
La finestra di input Le proprieta’ del residuo R dipendono sia dalle interazioni locali (finestra W) che da quelle non locali (contesto C) Contesto C Residuo R Finestra W Oa Onon a Rete Neurale
Input basato sulla Informazione Evolutiva Multiple Sequence Alignment (MSA) Posizione lungo la sequenza 1 MVKGPGLYTDIGKKARDLLYKDYHS--DKKFTISTYSPTGVAITSSGTKKGEL--FLGDV 2 MAKGPGLYTDIGKKARDLLYRDYQT--DQKFSITTYSPTGVAITSSGTKKGDL--FLADV 3 MVKGPGLYSDIGKRARDLLYRDYQS--DHKFTLTTYTANGVAITSTGTKKGEL--FLADV 4 MVKGPGLYSDIGKKARDLLYRDYVS--DHKFTVTTYSTTGVAITASGLKKGEL--FLADV 5 MVKGPGLYTEIGKKARDLLYRDYQG--DQKFSVTTYSSTGVAITTTGTNKGSL--FLGDV 6 MVVAVGLYTDIGKKTRDLLYKDYNT--HQKFCLTTSSPNGVAITAAGTRKNES--IFGEL 7 -MGGPGLYSGIGKKAKDLLYRDYQT--DHKFTLTTYTANGPAITATSTKKADL--TVGEI 8 AVVRPYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSLEI 9 --AVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSL 10 -MAVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVNGSL 11 --AVPPSYADLGKSARDIFNKGYGFG-LVKLDVKTKSATGVEFTTSGTSNTDSGKVNGSL 12 --MAPPSYSDLGKQARDIFSKGYNFG-LWKLDLKTKTSSGIEFNTAGHSNQESGKVFGSL 13 --MAVPAFSDIAKSANDLLNKDFYHLAAGTIEVKSNTPNNVAFKVTGKSTHDK-VTSGAL 1 MVKGPGLYTDIGKKARDLLYKDYHS--DKKFTISTYSPTGVAITSSGTKKGEL--FLGDV 2 MAKGPGLYTDIGKKARDLLYRDYQT--DQKFSITTYSPTGVAITSSGTKKGDL--FLADV 3 MVKGPGLYSDIGKRARDLLYRDYQS--DHKFTLTTYTANGVAITSTGTKKGEL--FLADV 4 MVKGPGLYSDIGKKARDLLYRDYVS--DHKFTVTTYSTTGVAITASGLKKGEL--FLADV 5 MVKGPGLYTEIGKKARDLLYRDYQG--DQKFSVTTYSSTGVAITTTGTNKGSL--FLGDV 6 MVVAVGLYTDIGKKTRDLLYKDYNT--HQKFCLTTSSPNGVAITAAGTRKNES--IFGEL 7 -MGGPGLYSGIGKKAKDLLYRDYQT--DHKFTLTTYTANGPAITATSTKKADL--TVGEI 8 AVVRPYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSLEI 9 --AVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSL 10 -MAVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVNGSL 11 --AVPPSYADLGKSARDIFNKGYGFG-LVKLDVKTKSATGVEFTTSGTSNTDSGKVNGSL 12 --MAPPSYSDLGKQARDIFSKGYNFG-LWKLDLKTKTSSGIEFNTAGHSNQESGKVFGSL 13 --MAVPAFSDIAKSANDLLNKDFYHLAAGTIEVKSNTPNNVAFKVTGKSTHDK-VTSGAL Sequenze allineate Finestra di Input
Artificial Neural Networks Bias Inputs Outputs x 0 x 1 x d z m z 1 Percettrone a singolo strato a = S w i x i i = 0 d z = g (a) La Funzione di Errore Y i (X q) = Output of the network D iq = Expected Value L’ Algoritmo di Training: il Back Propagation (gradient descendent: Rumelhart et al. 1986) Correction to the weights m = learning rate h = momentum term
Parametri variabili delle Reti Neurali Il codice di input L’ampiezza della finestra mobile L’architettura: il numero di nodi (neuroni) e gli strati di neuroni La velocità di apprendimento
Le Reti Neurali a Bologna predicono: La struttura secondaria delle proteine I siti di iniziazione del protein folding La topologia delle proteine di membrana all alpha and all beta (ISMB BEST PAPER AWARD 2002) La presenza dei peptidi segnale Lo stato di legame delle cisteine e la topologia dei ponti a zolfo Le mappe di contatto delle proteine (BEST PREDICTOR of the CATEGORY at CASP4) Le superfici di interazione tra proteine
www.biocomp.unibo.it
Schema generale dei predittori disponibili al nostro sito web
Predittori basati su Reti Neurali Verso la predizione della struttura 3D: La predizione delle mappe dei contatti
Predizione dei contatti tra residui Contatti nelle Proteine F 297 F 156 V 299 V 271 I 240 V 238 I 269
Computation of Contact Maps From 3D Structure F 297 F 156 V 299 V 271 I 240 V 238 I 269 To Contact Map TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
MARC 3-D Modelling through Contact Maps Bacteriorhodopsin Model N C 1QHJ (1.9 Å) N C MARC Contact map RMSD = 2.5 Å
Tools di Apprendimento Automatico Predizione della mappa dei contatti Le Reti Neurali imparano il mapping dalla sequenza alla mappa dei contatti Training Predizione Set Banca Dati Sequenza TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN Regole generali Mapping noto Predizione della mappa dei contatti
T0087: 310 residues A=20 % (FR/NF) C N
T0110: 128 residues A=30% (NF) N C
Predittori basati su Reti Neurali Verso la predizione della struttura 3D: La predizione dei ponti a zolfo
Il Protein Folding RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTF VYGGCRAKRNNFKSAEDCMRTCGGA
I legami a zolfo tra cisteine nelle proteine Ca 2-SH -> -SS- + 2H+ + 2e- S-S distance 2.2 Å Torsion angle C-S-S-C 90° Bond Energy 3 Kcal/mol
Neural Networks for the Prediction of the disulfide-bonding state of cysteines in proteins Non bonding 1 MVKGPGLYTDIGKKARDLLYKDYHS--DKKFTISTYSCTGVAITSSGTKKGEL--FLGDV 2 SAKGPGLYTDIGKKARDLLYRDYQT--DQKFSITTYSCTGVAITSSGTKKGDL--FLADV 3 MVKGPGLYSDIGKRARDLLYRDYQS--DHKFTLTTYTCNGVAITSTGTKKGEL--FLADV 4 MVKGPGLYSDIGKKARDLLYRDYVS--DHKFTVTTYSCTGVAITASGLKKGEL--FLADV 5 MVKGPGLYTEIGKKARDLLYRDYQG--DQKFSVTTYSCTGVAITTTGTNKGSL--FLGDV 6 MVVAVGLYTDIGKKTRDLLYKDYNT--HQKFCLTTSSCNGVAITAAGTRKNES--IFGEL 7 -MGGPGLYSGIGKKAKDLLYRDYQT--DHKFTLTTYTCNGPAITATSTKKADL--TVGEI 8 AVVRPYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSLEI 9 --AVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSGNGLEFTSSGSANTETTKVTGSL 10 -MAVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSGNGLEFTSSGSANTETTKVNGSL 11 --AVPPSYADLGKSARDIFNKGYGFG-LVKLDVKTKSCTGVEFTTSGTSNTDSGKVNGSL 12 --MAPPSYSDLGKQARDIFSKGYNFG-LWKLDLKTKTCSGIEFNTAGHSNQESGKVFGSL 13 --MAVPAFSDIAKSANDLLNKDFYHLAAGTIEVKSNTCNNVAFKVTGKSTHDK-VTSGAL
Most probable path through the states W1 W2 W3 MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP... P(B|W1), P(F|W1) P(B|W2), P(F|W2) P(B|W3), P(F|W3) Cysteine free states Cysteine bonding states End Begin Most probable path through the states Prediction of the bonding and non-bonding states of all the cysteines of the sequence
Il sistema ibrido Accuratezza per cisteina: 88%; per proteina: 84% Correctly predicted proteins (%) No of cysteines per protein No of proteins NN-based predictor HNN-based predictor Protein Science, in press
VGDKLIPLKITYDYYVCNNHMDTDTSYERWPALGTYRPLNGRDCVMNNHKLAASDRWECDQREPLYTCMCNKDLPTKAAGPLMNTRPILNLSREEWLLPLLTHMNVVAGLCKLP Input VGDKLIPLKITYDYYVCNNHMDTDTSYERWPALGTYRPLNGRDCVMNNHKLAASDRWECDQREPLYTCMCNKDLPTKAAGPLMNTRPILNLSREEWLLPLLTHMNVVAGLCKLP Output www.prion.biocomp.unibo.it/cyspred.html VGDKLIPLKITYDYYVCNNHMDTDTSYERWPALGTYRPLNGRDCVMNNHKLAASDRWECDQREPLYTCMCNKDLPTKAAGPLMNTRPILNLSREEWLLPLLTHMNVVAGLCKLP Disulfide bonding cysteine Free cysteine
I PREDITTORI POSSONO ESSERE USATI PER SCOPRIRE NUOVE PROTEINE?
Escherichia coli K12, genoma completo Completed: Oct 13, 1998. Total Bases: 4,639,221 bp NCBI (www.ncbi.nlm.nih.gov) Protein coding genes: 4,289 Structural RNAs: 115 EcoGene/EcoProt (bmb.med.miami.edu/EcoGene) Protein coding genes: 4,173 Structural RNAs : 120
EcoGene/SwissProt functional annotation Keywords of SwissProt entries (if exist) are extracted : 2160 ANNOTATED PROTEINS (52 %) 421 Inner membrane proteins 35 Outer membrane proteins 1704 Globular proteins 760 PARTIALLY ANNOTATED PROTEINS (18 %) proteins annotated as “Hypothetical proteins” and with other functional annotations 352 Inner membrane proteins 18 Outer membrane proteins 390 Globular proteins 1253 NON ANNOTATED PROTEINS (30 %) 137 proteins don’t have SwissProt entry 1116 proteins don’t contain functional annotation in SwissProt
Outer Membrane proteins (all b-Transmembrane proteins) Inner Membrane proteins (all a-Transmembrane proteins)
HUNTER PROTEOME Signal peptide No Yes All-a TM All-a TM No Globular All-b TM No Globular Yes all b-TM
Predicting globular, inner and outer membrane proteins in genomes of Gram-negative bacteria with Hunter * the number of new proteins predicted in the class with Hunter, out of the non-annotated region
www.biocomp.unibo.it
Collaborazioni Italia All’estero L.Masotti, Biochemistry, Bologna P.Mariani, Physics, Ancona M.Rossi, IBPE/CNR, Napoli G.Campadelli-Fiume, Pathology, Bologna G.Mita, IIGB/CNR, Napoli S.Prosperi, Veterinary, Bologna G.Irace, Biochemistry, Napoli F.Bernardi, Chemistry, Bologna D.Boraschi, CNR, Pisa S.Ciurli, Agricultural Chemistry, Bologna P.Arrigo, ICE/CNR, Genova C.Bergamini, Biochemistry, Ferrara All’estero B.Rost, Columbia University, New York A.Valencia, Protein Design Group, Cantoblanco, Madrid P.Baldi, Genomics and Bioinformatics, Irvine, California A.Krogh, University of Copenhagen, Copenhagen N.Ben Tal, Israel Insitute of Technology, Tel Aviv
The cross validation procedure Protein set Testing set Training set
Evaluation of the performance correct predictions total predictions p+n N Q2 = ———————— = —— correct predictions in class x total observations in class x p p+u Q(x) = ———————————— = —— correct predictions in class x total predictions in class x p p+o P(x) = ———————————— = —— p·n - o ·u [(p+o) ·(p+u) ·(n+o) ·(n+u)]1/2 C = Correlation index = ————————————— Legend: Predicted Observed
Evaluation of the efficiency of contact map predictions 1) Accuracy: A = Ncp* / Ncp where Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. 2) Improvement over a random predictor : R = A / (Nc/Np) where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts 3) Difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997): Xd= i=1,n (Pic - Pia ) / n di where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60Å cluster all the possible distances of residue pairs observed in the protein structure); di is the upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; Pic and Pia are the percentage of predicted contact pairs (with distance between di and di-1 ) and that of all possible pairs respectively
The cross validation procedure Training set 1 Testing set 1 Protein set
PRINCIPI DI BASE DELLA STRUTTURA DELLE PROTEINE Gli elementi della costruzione della struttura primaria Amminoacidi Backbone della proteina