Infrastruttura GRID di produzione italiana: stato ed organizzazione del supporto e delle operation Cristina Vistoli INFN Cnaf
Infrastruttura Grid di Produzione 37 ‘resource centers’: Tutti i siti sono accessibili attraverso i servizi (Resource Broker) di Grid 25 siti fanno parte della infrastruttura EGEE/LCG registrati nel GOCDB 12 siti aggiuntivi sono accedibili solo dai servizi italiani http://grid-it.cnaf.infn.it
Production Infrastructure: Resources
INFNGRID-2.6.0 deployment status: resources
INFNGRID-2.6.0 deployment status: Core SERVICES INFNGRID 2.6 On SCIENTIFIC LINUX 3.05 RLS LFC Gridice RB VOMS MyProxy GIS
La grid di produzione Obiettivo principale della organizzazione a supporto della infrastruttura GRID Fornire release di middleware stabile, certificata, documentata e con procedure automatiche di installazione adatte alle complessita’ delle farm Controllare le installazioni, le configurazioni e l’efficienza dei servizi Grid Collaborare con i site manager Fornire supporto agli utenti Garantire supporto agli esperimenti e alle applicazioni e promuovere l’integrazione dei servizi specifici di esperimento nei servizi comuni
EGEE/LCG EGEE SA1 garantisce l’operatività dell’infrastruttura di grid Europea La partecipazione italiana in EGEE/SA1 consiste in ROC (regional operation center) CIC (core infrastructue center) Gestione dei servizi generali di Grid (broker risorse, gestione dei cataloghi di File, monitoring e accounting dell’utilizzo, VOMS…) Sviluppo o adattamento di strumenti di controllo Produzione della release di middleware e relativa documentazione Controllo e certificazione delle configurazioni e della funzionalità dei siti Supporto agli utenti Supporto agli esperimenti per l’integrazione in Grid Porting e interoperabilita’
I siti inseriti nella Grid di produzione I nodi della grid di produzione devono: Fornire supporto sistemistico per le installazioni e configurazioni del middleware Rispondere prontamente ai problemi operativi e fornire un contatto di ‘sicurezza’ Fornire supporto agli utenti Partecipare ai turni di controllo dei servizi Grid della infrastruttura nel suo complesso Contribuire per: garantire distribuzione delle competenze e aggiornamento maggiore affidabilita e qualita’ del supporto
INFNGRID-2.6.0 features It is essentially LCG-2.6.0 with some additional features: new Network Monitor profile improved support for LSF and MPI support for additional VOs (managed via LDAP VO server): babar, zeus support for the additional VOs (managed via VOMS server): infngrid, cdf, gridit, compchem, planck, bio, enea, theophys, ingv, inaf, virgo, argo support for MPI jobs via home syncronisation with scp with hostbased authentication DGAS (DataGrid Accounting System)
Release and documentation Documentation: site installation guide, release notes…. Software repository Site management guide FRY is a tool developed by the Release and Documentation group of SA1 Italian ROC to perform quickly a set of basic test on all the grid elements (CE, SE, RB, WN,...). The idea is to increase the speed and reliability of the release certification phase, performing a "standard" set of tests to verify automatically configuration/setup troubles (daemons, permission and ownership of some directories, ...). http://grid-it.cnaf.infn.it/index.php?sitetest&type=1 DGAS checklist [new] DGAS developers produced this document to check if DGAS configuration is ok: UiPNP Installation of LCG 2.6 on IA64 http://www.spaci.it/egee/content.php?loc=docs&pg=default.php http://grid-it.cnaf.infn.it/index.php?siteman&type=1
Release and documentation
VO supportate
Central Management Team Site Certification The CMT is responsible of the certification: checking the functionalities of a site before to join the site to the production grid. In particular checks: GIIS' information consistence Local jobs submission (LRMS) Grid submission with Globus (globus-job-run) Grid submission with the ResorceBroker ReplicaManager functionalities In order to certificate a site the CMT uses dedicated grid services RB: gridit-cert-rb.cnaf.infn.it BDII: gridit-cert-rb.cnaf.infn.it In this way we avoid to have an uncertificate site in the production grid. The same grid services should be used for test activities.
Support: regional operation center First level support: Italian ROC shifts The Italian ROC provides geographically based local front line support to Virtual Organization, Users and Resources Centres Provided through daily shifts Check list to be covered during the shift Periodic (every 15 days) phone conference ROC/CIC teams and site managers ROC report to GDA Shitf example, weekly based: Second level support: CIC on Duty Weekly shift CIC tools
Support system Problems Communication : ROC on Duty and site managers Site managers to Central management team and viceversa Site certification during installation/upgrade -GGUS to ROC
Tickets statistics starting date: August 2005 530 total 131 from GGUS
Voms proxy VO 28-30 Apr May June July Aug Sep 01-09 Oct total argo 3 3 bio 81 62 8 53 9 213 cdf 31 808 1029 868 867 777 243 4623 compchem 35 2 4 78 7 135 enea 1 37 139 43 229 gridit 41 48 45 110 24 268 inaf 6 5 20 34 infngrid 298 274 177 151 409 69 1387 ingv 13 18 12 59 planck 11 theophys 22 virgo 10 60 1141 1493 1241 1108 1627 406 7064
Job status 1/2 6/dec/2005 10.33
Job status 2/2 6/dec/2005 10.33
Jobs per site dal 21/11 al 12/04 Total jobs =121406
N.B senza T1 per vedere meglio le percentuali
Jobs per VO dal 21/11 al 04/12
Job report 21/11 - 05/12 n.b INFN T1 non included
Job report 21/11 - 05/12
Supported hardware and platform LCG is officially supported on the following platforms. i386 Standard PC and clones based on the Intel i386 architecture and compatible processors IA64 on Itanium / Itanium 2 (Spaci for Italy – openlab for Cern) amd64 AMD64-based systems (32 bit) Official support means that the release install media is known to work, that the architecture can self-compile tself,
Minimal hardware request CPU: For all nodes a Pentium with more than 500MHz Memory: More than 256MB Disc: The middleware uses about 1GB, in addition the RBs require at least 20GB for the storage of the sandboxes. The WNs need adequate scratch space of at least 5GB for each job run at the same time. The SEs storage size depends on the applications Shared Filesystem: A small shared file system is currently required for the storage of the experiments software. Network: A network card with at least 10Mbit is required.
Software request Il middleware LCG supporta ufficialmente: Scientific Linux versione 3 The base SL distribution is basically Enterprise Linux, recompile from source. Porting in corso su MacOS X, Solaris, EMT64, FC4, AIX, IRIX… Per la precisione… (segue)
Porting in corso… CERN/UVienna/Apple MacOS X port available (focus on UI: WMS, …) Grid-Ireland WN ports available for CentOS 4.1, Suse 9.3, RedHat 7.3/9 Work in progress on MacOS X, Solaris, EMT64, FC4, AIX, IRIX GSI (Germany) Debian port (UI and WN?) IRB (Croatia) Debian: tar fixes (UI), chroot (CE+WN), converting RPMs to DEBs (ongoing); FreeBSD: tar (UI) HPC2N Umea (Sweden) Porting gLite to Ubuntu (Debian) EGRID (Italy) LiveCD with all service nodes, UI-only relocatable installation
Pre-production activities CNAF site is already part of the PPS: Two more sites (Bari and Padova) will join the PPS infrastructure soon
Certification services INFN Grid Certification Testbed to test and certificate the Grid software developed inside the INFN: gLite and LCG. to certify new INFN-GRID releases installation Five sites: INFN-TORINO, INFN-PADOVA, INFN- CNAF, INFN-ROMA1 and INFN-BARI. The activity is carried out in strict collaboration with the INFN-LCG-EGEE development teams, the EGEE Pre Production Service, ECGI and the Experiment task forces http://grid-it.cnaf.infn.it/certification/
CNAF CERTIFICATION / PRE-PRODUCTION cert-rb-02 (WMS+LB) Cert Sites cert-rls-01 (gLite1.2FireMan Cat.) glite-rb-00 (1.4 WMS+LB) LCG-2.6.0 Site gLite-1.3 Site cert-mon-it (1.2 R-GMA server With Registry/Schema) EGEE Production BDII cert-rb-03 (gLite 1.4 WMS+LB) cert-pbox-01 (PBOX server) cert-bdii-01 (LCG-2.6.0 BDII) Services for PBOX TESTS devrb (rb) devui (ui) Release Creation/Test +3 servers dedicated to STORM tests pre-ui-01 (gLite 1.1 UI) cert-voms-01 (gLite 1.3 VOMS Server) cert-voms-02 (gLite1.1 VOMS Server) cert-ui-01 (gLite 1.2 with bulk UI) gLite-1.2 Site cert-rb-01 (1.2 WMS+LB) APT Repository cert-mon (gLite 1.2 R-GMA Server) ALL PPS
Conclusioni L’infrastruttura nazionale di produzione fornisce ora risorse e supporto alle applicazioni nazionali I passi successivi: Allargare l’infrastruttura a nuove applicazioni Migliorare l’organizzazione in modo da includere facilmente altre risorse, risorse di progetti, universita’ etc. Garantire la condivisione ma fornire anche priorita’ e regole per rispettare i desiderata dei proprietari delle risorse Fornire spazio adeguato per la sperimentazione dei nuovi sviluppi di middleware su ampia scala