Introduzione al progetto INFNGRID Alessandro Paolini (INFN-CNAF) III corso di formazione INFN per amministratori di siti GRID INFN-CATANIA 2 – 6 Novembre 2009
Primary components of the production grid The primary components of the Italian Production Grid are: Computing and storage resources Access point to the grid Services Other elements are as much fundamental for the working, managing and monitoring of the grid: Middleware Monitoring tool Accounting tool Management and control infrastructure Users
GRID Management Grid management is performed by the Italian Regional Operation Center (ROC). The main activities are: Production of Infngrid release and test Deployment of the release to the sites, support to local administrators and sites certification Periodical check of the resources and services status Support at an Italian level Support at an European level Introduction of new Italian sites in the grid Introduction of new regional VOs in the grid
The Italian Regional Operation Center (ROC) Operations Coordination Centre (OCC) Management, oversight of all operational and support activities Regional Operations Centres (ROC) providing the core of the support infrastructure, each supporting a number of resource centres within its region Grid Operator on Duty Grid User Support (GGUS) At FZK, coordination and management of user support, single point of contact for users One of 10 existing ROC in EGEE
ROC Shifts About 15 supporters perform a checking activity composed of 1 shift per week, with 3 person per shift: 2 people provides a first support to sites and users (1st line supporters team) 1 person (2nd line supperter) takes place in case of more complex problems The main activities is: Checking of the grid status and problem warning, tailing them until their solution if possible Doing sites certification during the deployment phases Checking of the ticket still opened and pressing the expert or the site-managers for answering and solving them Evidenziare il cambiamento nel sistema dei turni
Users and sites support EGEE make use of the GGUS (Global Grid UserSupport) ticketing system Each ROC utilizes different tools interfaced to GGUS in a bidirectional Way. By means of Web services, it is possible to: Transfer tickets from the global to regional system Transfer tickets from the regional to the global system The user groups support, whom ticket will be addressed, are defined Either in GGUS either in the regional systems In the Italian Regional Operation Centre the ticketing system utilized is based on XOOPS/xHelp
Sistema di supporto italiano
Interface to GGUS I ticket in arrivo da GGUS, se riguardano un sito in particolare, vengono assegnati automaticamente al sito nel nostro sistema Per i ticket arrivati da GGUS è possibile scegliere se inviare la risposta a GGUS o mantenerla interna al nostro sistema
Service Availability Monitoring (SAM) SAM jobs are launched every hour and allow to find out submission problem, among which batch system errors, CA not updated and replicas errors There are also more specific tests for SRMv2, SE, LFC, CREAMCE, WMS, BDII..
Service Availability Monitoring (SAM) CE Tests
Service Availability Monitoring (SAM) RGMA (MONBOX) Tests Test if the service host certificate is valid On MONBOX host certificate is also present in rgma and tomcat (hidden) directories which usually are: /etc/tomcat5/ /opt/glite/var/rgma/.certs/ So when you change the host certificate you have to restart also /etc/rc.d/init.d/rgma-servicetool /etc/rc.d/init.d/tomcat5
SAM Admin SAM jobs available for both EGEE production and preproduction sites each site-manager can submit new sam test on his site Each ROC can submit new tests job of the site of own region
GSTAT GSTAT queries the Information System every 5 minutes The sites and nodes checked are those registered in the GOC DB The inconsistency of the information published and the eventual missing of a service that a site should publish are reported as an error
Introducing a new site Before entering in grid, each site have to accept several norms of behaviour, described in a Memorandum of Understanding (MoU). The COLG (Grid Local Coordinator) read and sign it, and they fax this document to INFN-CNAF. Moreover all sites must provide this email alias: grid-prod@. This alias will be used to report problems and it will be added to the site managers' mailing list. Of course it should include all site managers of your grid site. At this point, IT-ROC register site and site-managers in the GOC-DB, and create a supporter-operative group in the ticketing system XOOPS. site-managers have to register in XOOPS, so they can be assigned to their supporter-operative groups; each site-manager has to register in the test VOs infngrid and dteam
Introducing a new site Site-managers install the middleware, following the instructions distribuited by the Release Team (http://www.italiangrid.org/grid_operations/site_manager/IG_repository) . When finished, they make some preliminary test and then they make the request to be certified by own ROC. IT-ROC log a ticket to communicate with site-managers during the certification.
Memorandum of Understanding Every site have to: Provide computing and storage resources. Farm dimensions (at least 10 cpu) and storage capacity will be agreed with each site Guarantee sufficient man power to manage the site: at least 2 persons Manage efficently the site resources: middleware installation and upgrade, patch application, configuration changes as requested by CMT and do that by the maximum time stated for the several operation Answer to the ticket by 24 hours (T2) or 48 hours (other sites) from Mon to Fry Check from time to time own status Guarantee continuity to site management and support, also in holidays period Partecipate to SA1/Production-Grid phone conferences an meetings and compile weekly pre report Keep updated the information on the GOC DB Enable test VOs (ops, dteam and infngrid), with a higher priority than other VOs Eventual non-fulfilment noticed by ROC will be referred to the biweekly INFNGRID phone conferences, then to COLG, eventually to EB
Registration in GOC DB 3 phases: candidate Monitoring is OFF uncertified Monitoring can be turned on certified Monitoring is ON
Availability & Reliability (I) Viene preso in cosiderazione lo stato dei servizi CE,SE,SRM e sBDII come risulta dagli esiti dei test SAM Viene applicato un AND logico tra questi servizi ed un OR logico tra i servizi dello stesso tipo nel caso un sito abbia più istanze di uno stesso servizio Un sito deve risultare disponibile (available) almeno il 70% del tempo al mese (la disponibilità giornaliera è misurata sulle 24 ore) L’affidabilità (reliability) del sito deve essere di almeno 75% al mese (Reliability = Availability / (Availability + Unscheduled Downtime)) I siti che a fine mese non raggiungono la soglia minima di Availability o Reliability, devono fornire spiegazioni I siti che per 3 mesi consecutivi non superano la soglia del 50% verranno sospesi
Availability & Reliability (II) I periodi di scheduled downtime devono essere dichiarati in anticipo sul GOC-DB Gli Scheduled Downtime incidono negativamente sulla availability, ma non sulla reliability I downtime più lunghi di un mese sono concessi in casi eccezionali, e devono essere prima approvati dal ROC
GRID Services Allow you to use the grid resources: Resource Broker (RB) / Workload Management System (WMS): they are responsible for the acceptance of submitted jobs and for sending those jobs to the appropriate resources Information System (IS): provides information about the grid resources and their status Virtual Organization Management System (VOMS): database for the authentication and authorization of the users Gridice: monitoring of resources, services and jobs Home Location Register (HLR): database for the accounting informations of the usage of resources LCG file catalog (LFC): file catalog File Transfer Service (FTS): file movements in an efficient and reliable way MonBox: collector for local data of R-GMA
Access to the GRID Access by means of an User Interface (UI). It could be: A dedicated PC, installed in a similar way to the others grid elements A web portal: https://genius.ct.infn.it/ To access the GRID you need a personal certificate released by a Certification Authority trusted by EGEE/LCG infrastructure: the user authentication is performed through X-509 certificates To be authorized to submit jobs you have to belong to a Virtual Organisation (VO). A VO is a kind of users group usually working on the same project and using the same application software on the grid.
General Purpose Services Test sites scope MYPROXY BDII myproxy.cnaf.infn.it Italian sites scope gridit-cert-rb.cnaf.infn.it LFC gridit-bdii-01.cnaf.infn.it WMS lfcserver.cnaf.infn.it prod-bdii-01.pd.infn.it top-bdii03.cnaf.infn.it top-bdii01.cnaf.infn.it top-bdii02.cnaf.infn.it egee-bdii.cnaf.infn.it gridbdii.fe.infn.it gridit-wms-01.cnaf.infn.it LB lb009.cnaf.infn.it glite-rb-00.cnaf.infn.it prod-lb-01.pd.infn.it prod-wms-01.pd.infn.it albalonga.cnaf.infn.it egee-wms-01.cnaf.infn.it VOMS replica WMS+LB+BDII voms.cnaf.infn.it wms-lb.ct.infn.it voms-01.pd.infn.it glite-rb-01.cnaf.infn.it voms2.cnaf.infn.it voms-02.pd.infn.it EGEE sites scope SAM ADMIN Test sites scope replica
Servizi Controllati da NAGIOS Le macchine del cnaf sono controllate da nagios: in caso di fallimenti, sono abilitati l’invio di email o di sms In caso di down di un HOST, viene inviato un sms per i seguenti servizi grid: FTS, LFC, WMSMON, VOMS (master E replica), CE, NTP I servizi che girano sulle macchine sono controllati ogni 5 minuti. In caso risultino CRITICAL (viene inviato un sms), vengono fatti altri 4 controlli, distanziati di 1 minuto uno dall'altro. Al secondo di questi controlli, quando lo stato del check e' CRITICAL SOFT 2, viene fatto il tentativo di restart del servizio HOST CHECK RESTART FTS FTS AGENTS TO DO FTS BDII LFC lfc-dli SI' lfc-deamon globus-mds mysql VOMS voms ports BDII bdii NO CE site-bdii NTP ntpd WMS WMS-POOL Update dinamico del DNS: I servizi sono controllati ogni minuto. In caso risultino CRITICAL, vengono fatti altri 3 controlli, distanziati di 1 minuto uno dall'altro. Nel caso in cui lo stato CRITICAL persista, viene rimosso l'IP del host che ospita il servizio critico dall'alias top-bdii.grid.cnaf.infn.it Host coinvolti: top-bdii01.cnaf.infn.it top-bdii02.cnaf.infn.it top-bdii03.cnaf.infn.it prod-bdii-01.pd.infn.it gridbdii.fe.infn.it Le motivazioni
Accounting using DGAS DGAS (Distributed Grid Accounting System) is fully deployed in INFNGrid (15 site HLRs + 1 HLR of 2nd level (testing). The site HLR is a service designed to manage a set of ‘accounts’ for the Computing Elements of a given computing site. For each job executed on a Computing Element (or a on local queue), the Usage Record for that job is stored on the database of the site HLR. Each site HLR can: Receive Usage Records from the registered Computing Elements. Answer to site manager queries such as: Datailed job list queries (with many search keys: per user, VO, FQAN ,CEId…) Aggregate usage reports, such as per hour, day, month…, with flexible search criteria. Optionally forward Usage Records to APEL database. Optionally forward Usage Records to a VO specific HLR. Site HLR Site layer Usage Metering Resource’s layer -Aggregate site info -VO (with role/group) usage on the site. Detailed Resource Usage info Job level info GOC
Tier1 & Tier2 HLRs 11 Home Location Register di sito per Tier1 e Tier2 4 HLRs per i siti medio-piccoli host sito hlr-t1.cr.cnaf.infn.it INFN-T1 prod-hlr-02.ct.infn.it INFN-CATANIA prod-hlr-01.pd.infn.it INFN-PADOVA prod-hlr-01.ba.infn.it INFN-BARI atlashlr.lnf.infn.it INFN-FRASCATI t2-hlr-01.lnl.infn.it INFN-LEGNARO prod-hlr-01.mi.infn.it INFN-MILANO t2-hlr-01.na.infn.it INFN-NAPOLI, INFN-NAPOLI-ATLAS gridhlr.pi.infn.it INFN-PISA t2-hlr-01.roma1.infn.it INFN-ROMA1, INFN-ROMA1-CMS, INFN-ROMA1-VIRGO grid005.to.infn.it INFN-TORINO HLR prod-hlr-01.pd.infn.it (INFN-PADOVA) reference for central-northern area sites HLR prod-hlr-01.ct.infn.it (INFN-CATANIA) reference for central-southern area sites HLR infn-hlr-01.ct.pi2s2.it (INFN-CATANIA) reference for COMETA Project sites HLR dgas-enmr.cerm.unifi.it reference for CIRMMP site
VO Dedicated Services CDF: 2 WMS, 2 LB CMS: 8 WMS, 3 LB ALICE: ATLAS: LHCB: 3 WMS, 3 LB
Experimental Services Tests su alcuni componenti rilasciati dagli sviluppatori, in parallelo con SA3 Applicazione delle ultime patch appena rilasciate su alcuni WMS presenti in produzione, per consentire alle VO di testarne la compatibilità con i loro tools CreamCE: in collaborazione con alcuni siti in cui sono state installate diverse istanze
Deployment Status (I) 56 Siti di produzione e SITE STATUS GRISU-COMETA-INFN-LNS CERTIFIED INFN-NAPOLI-ATLAS GRISU-COMETA-ING-MESSINA INFN-NAPOLI-CMS GRISU-COMETA-UNICT-DIIT INFN-NAPOLI-PAMELA GRISU-COMETA-UNICT-DMI INFN-PADOVA GRISU-COMETA-UNIPA INFN-PADOVA-CMS GRISU-CYBERSAR-CAGLIARI INFN-PARMA GRISU-CYBERSAR-PORTOCONTE INFN-PERUGIA GRISU-ENEA-GRID INFN-PISA GRISU-SPACI-NAPOLI INFN-ROMA1 GRISU-SPACI-LECCE INFN-ROMA1-CMS GRISU-UNINA INFN-ROMA1-TEO INFN-BARI INFN-ROMA1-VIRGO INFN-BOLOGNA INFN-ROMA3 INFN-CAGLIARI INFN-T1 INFN-CATANIA INFN-TORINO INFN-CNAF INFN-TRIESTE INFN-CNAF-LHCB SISSA-TRIESTE INFN-FERRARA SNS-PISA INFN-FRASCATI UNI-PERUGIA INFN-GENOVA UNINA-EGEE INFN-LNL-2 INFN-CS long downtime INFN-LNS INFN-LECCE farm migration to sl4 INFN-MILANO-ATLASC INFN-ROMA2 INFN-NAPOLI ITB-BARI Farm reinstallation INFN-NAPOLI-ARGO SPACI-CS-IA64 SUSPENDED 56 Siti di produzione e 6 siti dedicati al training 57 Siti attivi 40 siti INFN 22 siti di altri enti (cnr, enea, esa, inaf, sissa, spaci, uniCT, uniPA, uniPG) SITE STATUS GILDA-INFN-CATANIA CERTIFIED GILDA-PADOVA GILDA-SIRIUS GILDA-VEGA ICEAGE-CATANIA CIRMMP CNR-ILC-PISA CNR-PROD-PISA ESA-ESRIN GRISU-COMETA-INAF-CT GRISU-COMETA-INFN-CT
Release INFNGRID Based on gLite3 The release INFNGRID 3.1 is available for SL4 (32bit & 64bit) and sl5 (64bit) Several customizations: additional VOs (~20) Secure Storage System CreamCE accounting (DGAS): New profile (HLR server) + additional packages on CE and WN monitoring (GRIDICE) Quattor (collaboration with CNAF-T1) Dynamic Information-Providers for LSF: corrected configuration, new vomaxjobs (3.1/SL4) Preconfigured/improved support for MPI (MPICH, MPICH2) GRelC (Grid Relational Catalog) StoRM (Storage Resource Manager) GFAL Java API & NTP GRIDFTP server vers sl4/i386, x86_64 HYDRA (Encrypted File Storage Solution) Work-in-progress: AMGA Web Interface GSAF (Grid Storage Access Framework) gLite for Windows with torque/maui support
VO Regionali 2837 utenti registrati in CDF VO utenti argo 29 bio 70 compassit 11 compchem 89 cyclops 21 egrid 28 enea 15 enmr.eu 126 euchina 62 euindia 69 eumed 104 eticsproject.eu 5 glast.org 9 gridit 148 inaf 27 infngrid 226 ingv 13 libi 18 lights.infn.it 24 pacs.infn.it 4 pamela planck 39 superbvo.org theophys 85 tps.infn.it 6 virgo 2837 utenti registrati in CDF
Introducing a new VO When an experiment asks to enter in grid and to form a new VO, it is necessary a formal request follwed by some technical steps. Formal Part: Needed resources and economical contribution to agree between the experiment and the grid Executive Board (EB) Pick out the software that will be used and verify its functioning Verify the possibility of the support in the several INFN-GRID production sites Communicate to IT-ROC the names of VO-managers, Software-managers, persons responsible of resources and of the support for the software experiment for the users in every site Software requisites, kind of job and the storage final destination (CASTOR, SE, experiment disk server)
Introducing a new VO Once the Executive Board (EB) has approved the experiment request, the technical part begins: IT-ROC will create the VO on its voms server (if doesn’t exist one) IT-ROC will create the VO support group on the ticketing system VO-manager fill in the VO identity card on the CIC portal IT-ROC will make known the existence of the new VO and inform the sites how to enable it
CIC Portal and VO Identity Card CIC Portal has been created as a part of the SA1 activity. It is dedicated to ensure: to be a management and operations tool to be an entry point for all Egee actors for their operational needs to manage the available informations about EGEE VOs and related VOs to monitor and ensure grid day-to-day operations on grid resources and services Every VO supported in EGEE has to be registered in the VO identity card: General informations VOMS informations Contacts
Freedom of Choice for Resources The Freedom of Choice for Resources is a VO Policy enforcement tool, to manipulate top-level BDIIs. It is fully integrated with the SAM ( Service Availiblity Monitoring ) framework. FCR allows the VOs to define a preference on Grid resources, optionally taking the SAM test results in account as well. Only VO responsibles (VO Software Managers, etc.) can get access to the FCR Admin Pages , where they can modify their VO's FCR profile. They can select the: set of Critical Tests for all services set of Site Resources (CEs and SEs) to be used by the VO set of Central Service Nodes ( Note : this will be used in the future) Changes are written to the database (that's shared with SAM ), and an LDAP ldif file is created, which the top-level BDIIs download in every 2 minutes in order to apply Site Resources changes.
HLRmon
Novità sui dati perTier1&Tier2 Nuova sezione dedicata a tier1 e tier2: grafici sull’utilizzo delle risorse di calcolo e di storage (VO lhc e non) Novità presentata al direttivo INFN del 30 Aprile, molto apprezzata
WMS MONITOR (I)
WMS MONITOR (II)
Useful links Italian production grid: http://www.italiangrid.org/ Ticketing system: https://ticketing.cnaf.infn.it/checklist-new/ SAM: https://lcg-sam.cern.ch:8443/sam/sam.py HLR MON: https://dgas.cnaf.infn.it/hlrmon/report/charts.php WMS MON: https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php gLite Middleware: http://glite.web.cern.ch/glite/default.asp