G. Paolella Napoli, 18/12/ G. Paolella High performance computing per lannotazione e il mining di genomi interi
G. Paolella Napoli, 18/12/ DG-CST 1022 genes related to genetically transmitted disease
G. Paolella Napoli, 18/12/ CST Identificazione e caratterizzazione di sequenze nucleotidiche conservate tra uomo e topo (CSTs) in altre specie. H. Sapiens M. Musculus CSTs CST identificate in geni associati a malattie: Analisi da effettuare mediante BLAST contro altri genomi (ratto, cane, scimmia, pollo, etc).
G. Paolella Napoli, 18/12/ KinWeb 500 genes coding for human protein kinases
G. Paolella Napoli, 18/12/ (a) (b) (c) (d) (e) KinWeb DB
G. Paolella Napoli, 18/12/ Pipeline units
G. Paolella Napoli, 18/12/ Assemble … Contigs Scaffolds … geneA tRNA promoprAoprB geneCluster A Annotation High throughput sequencing
G. Paolella Napoli, 18/12/ Sequencing At CEINGE, Nonomuraea sequencing genome project by 454 FLX is in progress
G. Paolella Napoli, 18/12/ Annotation
G. Paolella Napoli, 18/12/ Identification of genes and other genetic elements. Protein functional annotation. Cellular process annotation. Identification of ORFs, tRNAs, rRNAs Scanning for signals, such as promoters and microRNAs Identification of operons and gene clusters Comparison with known genomes/proteins Identification of orthologs and paralogs Characterization of protein domains Reconstruction of complete metabolic pathways … Annotation Steps
G. Paolella Napoli, 18/12/ Stem Loop Structure (SLS) Protein and coding genes Forward strand Reverse strand E. coli k12 ERIC Rib (BIME family) SLS in Bacteria
G. Paolella Napoli, 18/12/ Identification of SLSs in bacterial genomes real shuffled
G. Paolella Napoli, 18/12/ Blast e-value Markov clustering (MCL) X SLSs BLAST all vs all Clustering by sequence similarity
G. Paolella Napoli, 18/12/ RESULTS Folding probability Clustered SLSsSLSsRandom sequences p = probability that the Minimum Free Energy (MFE) of a given sequence is equal to a distribution of MFE computed with random sequences. (RANDFOLD)
G. Paolella Napoli, 18/12/ RESULTS Grouping clusters = 98 clusters manual refinement 92 families
G. Paolella Napoli, 18/12/ Genome search Identification of all family members by HMM Sequence alignment matches HMM New alignment Final elements HMM build Align & extend
G. Paolella Napoli, 18/12/ An example of elongation process: Myg-1 M. genitalium
G. Paolella Napoli, 18/12/ Pae-1 (P. aeuruginosa) Examples Complex structures Efa-1 (E. faecalis)
G. Paolella Napoli, 18/12/ Bacterial SLSs Pae-1 (Pseudomonas aeuruginosa)Eric (Escherichia coli)
G. Paolella Napoli, 18/12/ Known (5) Known (20) % of genic families Novel (37) Novel (30) RNAz test Contain known motifs (14) Predict to be structured (57) Predict to be not structured (35) Contain known motifs (12) Secondary structure prediction analysis of families
G. Paolella Napoli, 18/12/ x14x2=112 procs 2.8 GHz 4x14x2=112 GB RAM 2 GB/s per scheda - 4 GB/s aggregata Cluster 2.8GHz biproc. node, 2GB RAM 160 GB HD
G. Paolella Napoli, 18/12/ Processing time
G. Paolella Napoli, 18/12/ The procedure requires high performance computing Blast + MCL Pcma HMMbuild HMMcalibrate HMMsearch Pcma n Identification Characterization Randfold RNAz Infernal SCoPE GRID computing
G. Paolella Napoli, 18/12/ Sito medicina HD attached to the system: 1 Cluster Element (CE) 5 Worker nodes (WN) biproc (expandable up to 40) 1 Storage Element (SE) with 50 Gb 1 User Interface (UI)
G. Paolella Napoli, 18/12/ BLAST Eseguibile submitted da un repository locale di programmi Librerie di dati genomici conservate su SE Esempio Blast delle CST contro genomi di cane, gallo, scimmia e ratto. Numero jobs sottomessi 67 Gruppo di sequenze di input: 1000 sequenze Tempo totale di esecuzione dei 67 jobs: 4 ore Tempo medio per job: 18 minuti (2 spesi per scaricare il dataset). Tempo CPU Ricerca di 1 sequenza nel genoma di topo => 5 sec sequenze => 3,75 giorni 10 genomi => 37,5 giorni
G. Paolella Napoli, 18/12/ Ricerca strutture secondarie Identificazione e caratterizzazione in genomi batterici di famiglie di sequenze ripetute che condividono una struttura secondaria conservata. Analisi da effettuare su oltre 300 genomi batterici Esempio Ricerca di una famiglia in un genoma =====> 6 ore. Ricerca di 50 famiglie in un genoma =====> 12,5 giorni Ricerca di 50 famiglie in 300 genomi =====> 10 anni
G. Paolella Napoli, 18/12/ RandFold Identificazione di sequenze potenzialmente strutturate nel trascrittoma umano. Analisi da effettuare mediante RANDFOLD sui geni frammentati a finestre di 50 basi in sequenze di 150 basi. Esempio Geni : 408 pari a circa 14 mln di basi Sequenze di 150 basi generate: Analisi di 1 sequenza =====> 45 sec. Analisi di sequenze =====> 152 giorni.
G. Paolella Napoli, 18/12/ What about more interactive uses ?
G. Paolella Napoli, 18/12/ CAPRI
DBs Private network Cluster Broker getNode Access server schedule r Access server schedule r Access server schedule r web requests http request for an available node rsh launch on the node web display of results Requests distribution on the cluster Cluster Status Cluster Status StatusDB sql Updates from node agents Cluster Manager Cluster Manager Display server sql... Cluster activity
G. Paolella Napoli, 18/12/ Broker virtual node virtual node DB Grid node Hierarchical node organization
G. Paolella Napoli, 18/12/ PROJECT *Cell line *Colture conditions *Fixation and inclusion methods, stainings, ecc *Objective *Focus Position *Stage position x/y *Project title *Experiment name, *Author, group, group leader, ecc. *Exposure time *Resolution, ecc. Digital images are produced by a variety of microscope devices. The management of large number of images requires the use of databases (DB), Processing of the acquired images is often necessary to enhance the visibility of cell features, that would otherwise be hidden Integrated image storage and processing environment
G. Paolella Napoli, 18/12/ IPROC The image processing system: IPROC
G. Paolella Napoli, 18/12/ Version number 1 features tab-delimited Name filename Depth size 16bit wdim size 4 where files cdim size 3 where files pdim size n where files tdim size n unit min scale 10 where files ldim size n unit µm scale 0.4 where layers Time 1Time 2Time n well1well2 well3 well4 Channel1 Channel 2 Channel 3 Position 1 Position n l1 ln File format Data input: text description
G. Paolella Napoli, 18/12/ Acquisition parameters Buttons to slide the acquisition Image processing menus Info panel for each frame hide/show control command IPROC Image processing
G. Paolella Napoli, 18/12/ image in iProcStep ImageMagick iProcStep PHP iProcStep Perl commandLine program Image Magick Package PHP Package PERL Package Command Line Packages adapter image out adapter Image processing modules
G. Paolella Napoli, 18/12/ HPC on Cluster nodes GatewayGateway iPage image area data + images page iPane proc- steps IPROC architecture
G. Paolella Napoli, 18/12/ Cluster Nodes Access Server Access Server Access Server CLUSTER IPROC Parallel processing
G. Paolella Napoli, 18/12/ The group Angelo Boccia Gianluca Busiello Mauro Petrillo Concita Cantarella Luca Cozzuto Leandra Sepe Vittorio Lucignano Marisa Passaro