La presentazione è in caricamento. Aspetta per favore

La presentazione è in caricamento. Aspetta per favore

Stato della Grid di produzione

Presentazioni simili


Presentazione sul tema: "Stato della Grid di produzione"— Transcript della presentazione:

1 Stato della Grid di produzione
Alessandro Paolini (INFN-CNAF) Workshop CCR ed INFNGRID 2009 Porto Palau 11 – 15 Maggio 2009

2 Primary components of the production grid
The primary components of the Italian Production Grid are: Computing and storage resources Access point to the grid Services Other elements are as much fundamental for the working, managing and monitoring of the grid: Middleware Monitoring tool Accounting tool Management and control infrastructure Users

3 GRID Management Grid management is performed by the Italian Regional Operation Center (ROC). The main activities are: Production of Infngrid release and test Deployment of the release to the sites, support to local administrators and sites certification Periodical check of the resources and services status Support at an Italian level Support at an European level Introduction of new Italian sites in the grid Introduction of new regional VOs in the grid

4 The Italian Regional Operation Center (ROC)
Operations Coordination Centre (OCC) Management, oversight of all operational and support activities Regional Operations Centres (ROC) providing the core of the support infrastructure, each supporting a number of resource centres within its region Grid Operator on Duty Grid User Support (GGUS) At FZK, coordination and management of user support, single point of contact for users One of 10 existing ROC in EGEE

5 Central Management Team
Central Management Team (CMT) shifts since 4 year ago about 20 supporters, daily shift from Monday to Friday, 2 people per shift, a report is compiled at the end of the shift Grid supervision, ticket followup until resolution Site certification Chasing site managers and experts for quick problem resolution Duplication of tickets (COD tickets and CMT tickets)

6 ROC Shifts: New Model Supporters distinguished into two groups:
IT ROC 1st line 2 people on shift No relationship with C-COD, this is a responsibility of 2nd line Opens tickets Relies on 2nd line support for complex tickets IT ROC 2nd line smaller team of experts (about 3 FTE, including INFN Grid release team) Works on tickets opened by 1st line 1 person on shift IT ROC C-COD representative and interface to C-COD Responsible of suspension in the region Evidenziare il cambiamento nel sistema dei turni

7 Users and sites support
EGEE make use of the GGUS (Global Grid UserSupport) ticketing system Each ROC utilizes different tools interfaced to GGUS in a bidirectional Way. By means of Web services, it is possible to: Transfer tickets from the global to regional system Transfer tickets from the regional to the global system The user groups support, whom ticket will be addressed, are defined Either in GGUS either in the regional systems In the Italian Regional Operation Centre the ticketing system utilized is based on XOOPS/xHelp

8 Sistema di supporto italiano

9 GRID Services Allow you to use the grid resources:
Resource Broker (RB) / Workload Management System (WMS): they are responsible for the acceptance of submitted jobs and for sending those jobs to the appropriate resources Information System (IS): provides information about the grid resources and their status Virtual Organization Management System (VOMS): database for the authentication and authorization of the users Gridice: monitoring of resources, services and jobs Home Location Register (HLR): database for the accounting informations of the usage of resources LCG file catalog (LFC): file catalog File Transfer Service (FTS): file movements in an efficient and reliable way MonBox: collector for local data of R-GMA

10 General Purpose Services
Test sites scope MYPROXY BDII myproxy.cnaf.infn.it Italian sites scope gridit-cert-rb.cnaf.infn.it LFC WMS gridit-bdii-01.cnaf.infn.it lfcserver.cnaf.infn.it gridit-wms-01.cnaf.infn.it LB lb009.cnaf.infn.it prod-bdii-01.pd.infn.it top-bdii03.cnaf.infn.it top-bdii01.cnaf.infn.it top-bdii02.cnaf.infn.it egee-bdii.cnaf.infn.it glite-rb-00.cnaf.infn.it prod-lb-01.pd.infn.it prod-wms-01.pd.infn.it albalonga.cnaf.infn.it egee-wms-01.cnaf.infn.it VOMS replica RB voms.cnaf.infn.it egee-rb-01.cnaf.infn.it voms-01.pd.infn.it WMS+LB+BDII voms2.cnaf.infn.it voms-02.pd.infn.it EGEE sites scope wms-lb.ct.infn.it replica

11 Servizi Controllati da NAGIOS
Le macchine del cnaf sono controllate da nagios: in caso di fallimenti, sono abilitati l’invio di o di sms In caso di down di un HOST, viene inviato un sms per i seguenti servizi grid: FTS, LFC, WMSMON, VOMS (master E replica), CE, NTP I processi che girano sulle macchine sono controllati ogni 5 minuti. In caso risultino CRITICAL (viene inviato un sms), vengono fatti altri 4 controlli, distanziati di 1 minuto uno dall'altro. Al secondo di questi controlli, quando lo stato del check e' CRITICAL SOFT 2, viene fatto il tentativo di restart del servizio HOST CHECK RESTART FTS FTS AGENTS TO DO FTS BDII LFC lfc-dli SI' lfc-deamon globus-mds mysql VOMS voms ports BDII bdii NO CE site-bdii NTP ntpd WMS WMS-POOL Update dinamico del DNS: I servizi sono controllati ogni minuto. In caso risultino CRITICAL, vengono fatti altri 3 controlli, distanziati di 1 minuto uno dall'altro. Nel caso in cui lo stato CRITICAL persista, viene rimosso l'IP del host che ospita il servizio critico dall'alias top-bdii.grid.cnaf.infn.it Host coinvolti: top-bdii01.cnaf.infn.it top-bdii02.cnaf.infn.it top-bdii03.cnaf.infn.it prod-bdii-01.pd.infn.it Le motivazioni

12 Accounting using DGAS DGAS (Distributed Grid Accounting System) is fully deployed in INFNGrid (13 site HLRs + 1 HLR of 2nd level (testing). The site HLR is a service designed to manage a set of ‘accounts’ for the Computing Elements of a given computing site. For each job executed on a Computing Element (or a on local queue), the Usage Record for that job is stored on the database of the site HLR. Each site HLR can: Receive Usage Records from the registered Computing Elements. Answer to site manager queries such as: Datailed job list queries (with many search keys: per user, VO, FQAN ,CEId…) Aggregate usage reports, such as per hour, day, month…, with flexible search criteria. Optionally forward Usage Records to APEL database. Optionally forward Usage Records to a VO specific HLR. Site HLR Site layer Usage Metering Resource’s layer -Aggregate site info -VO (with role/group) usage on the site. Detailed Resource Usage info Job level info GOC

13 Tier1 & Tier2 HLRs 11 Home Location Register di sito per Tier1 e Tier2
HLR prod-hlr-01.ct.infn.it   (INFN-CATANIA) reference for central-southern area sites host sito hlr-t1.cr.cnaf.infn.it INFN-T1  prod-hlr-02.ct.infn.it  INFN-CATANIA  prod-hlr-01.pd.infn.it  INFN-PADOVA prod-hlr-01.ba.infn.it  INFN-BARI atlashlr.lnf.infn.it  INFN-FRASCATI t2-hlr-01.lnl.infn.it  INFN-LEGNARO prod-hlr-01.mi.infn.it  INFN-MILANO  t2-hlr-01.na.infn.it INFN-NAPOLI, INFN-NAPOLI-ATLAS gridhlr.pi.infn.it INFN-PISA t2-hlr-01.roma1.infn.it  INFN-ROMA1, INFN-ROMA1-CMS, INFN-ROMA1-VIRGO grid005.to.infn.it  INFN-TORINO ENEA-INFO INFN-ROMA3 INFN-CAGLIARI CYBERSAR-CAGLIARI INFN-LECCE SPACI-CS-IA64 INFN-LNS UNINA-EGEE INFN-NAPOLI-ARGO SPACI-LECCE INFN-NAPOLI-CMS SPACI-NAPOLI 11 Home Location Register di sito per Tier1 e Tier2 2 HLRs per i siti medio-piccoli CNR-ILC-PISA INFN-GENOVA CNR-PROD-PISA INFN-PARMA INAF-TRIESTE INFN-PERUGIA INFN-CNAF INFN-TRIESTE INFN-BOLOGNA SNS-PISA INFN-FERRARA UNIV-PERUGIA INFN-FIRENZE HLR prod-hlr-01.pd.infn.it  (INFN-PADOVA) reference for central-northern area sites

14 VO Dedicated Services CDF: 2 WMS, 2 LB CMS: 8 WMS, 3 LB ALICE: ATLAS:
LHCB: 3 WMS, 3 LB

15 Experimental Services
Tests su alcuni componenti rilasciati dagli sviluppatori, in parallelo con SA3 Applicazione delle ultime patch appena rilasciate su alcuni WMS presenti in produzione, per consentire alle VO di testarne la compatibilità con i loro tools CreamCE: in collaborazione con alcuni siti in cui sono state installate diverse istanze

16 Deployment Status (I) 50 Siti in totale: 41 Siti attivi
SITE STATUS CIRMMP CERTIFIED INFN-PISA CNR-ILC-PISA INFN-ROMA1 CNR-PROD-PISA INFN-ROMA1-CMS CYBERSAR-CAGLIARI INFN-ROMA1-TEO ENEA-INFO INFN-ROMA1-VIRGO ESA-ESRIN INFN-ROMA3 INFN-BARI INFN-T1 INFN-BOLOGNA INFN-TORINO INFN-CAGLIARI INFN-TRIESTE INFN-CATANIA SISSA-TRIESTE INFN-CNAF SNS-PISA INFN-CNAF-LHCB SPACI-NAPOLI INFN-FRASCATI SPACI-LECCE INFN-GENOVA SPACI-CS-IA64 INFN-LNL-2 UNI-PERUGIA INFN-LNS UNINA-EGEE INFN-MILANO INFN-CS TESTs ONGOING INFN-NAPOLI INFN-FERRARA INFN-NAPOLI-ARGO INFN-LECCE farm migration to sl4 INFN-NAPOLI-ATLAS INFN-MILANO-ATLASC INFN-NAPOLI-CMS INFN-ROMA2 INFN-NAPOLI-PAMELA INAF-TRIESTE HW PROBLEMS INFN-PADOVA INFN-CASCINA INFN-PARMA INFN-FIRENZE Supp. Unavailable INFN-PERUGIA ITB-BARI Farm reinstallation 50 Siti in totale: 41 Siti attivi 3 siti in fase di certificazione 35 siti INFN 15 siti di altri enti (cnr, enea, esa, inaf, spaci, univ.PG) 1 sito con architettura IA64

17 Statistiche di Availability
Leggero miglioramento rispetto alla media del 2008 Siamo tra i peggiori (penultimi in Marzo ed Aprile) N.B. Le medie dei ROC sono pesate in base alle risorse dei siti

18 Statistiche di Availability
Tra i siti con bassi valori nell’ultimo mese: UNINA-EGEE: in prod dal 30 aprile, primi test falliti a causa del firewall INFN-ROMA1-CMS: problema con il supporto del sito INFN-CNAF-LHCB: down del tier1 e bassa reliability Gruppo di lavoro per aiutare i siti a migliorare le statistiche Ogni mese bisogna fornire spiegazioni sulla situazione dei siti che hanno ottenuto bassi valori

19 Release INFNGRID Based on gLite3
The O.S. transition phase is finished: the release INFNGRID: 3.0 for SL3 is no more supported Several customizations: additional VOs (~20) Secure Storage System CreamCE accounting (DGAS): New profile (HLR server) + additional packages on CE and WN monitoring (GRIDICE) Quattor (collaboration with CNAF-T1) Dynamic Information-Providers for LSF: corrected configuration, new vomaxjobs (3.1/SL4) Preconfigured support for MPI GRelC (Grid Relational Catalog) StoRM (Storage Resource Manager) GFAL Java API & NTP GRIDFTP server vers sl4/i386, x86_64 Work-in-progress: patched MyProxy (long-live proxy delegation with voms extensions) AMGA Web Interface GSAF (Grid Storage Access Framework) gLite for Windows with torque/maui support StoRM 1.4  : General improvement on directory structure; Many new feature in BE; More flexibility in FE configuration GridICE: Support for the new lemon-agent & lemon-sensor-linux X86_64 profiles: Direct testing of latest deployed profiles » more x86_64 hosts installation on our certification testbed SL5 test: First test with OS installation, repository mirroring (CREAM is on the go with new SL5 build)

20 VO Regionali 2405 utenti registrati in CDF VO utenti argo 29 bio 67
compassit 8 compchem 82 cyclops 15 egrid 28 enea 14 enmr.eu 65 euchina 60 euindia 68 eumed 104 eticsproject.eu 4 glast.org gridit 144 inaf 27 infngrid 220 ingv 13 libi 17 lights.infn.it 22 pacs.infn.it 3 pamela 21 planck 38 superbvo.org theophys 80 tps.infn.it 6 virgo 39 2405 utenti registrati in CDF

21 Un po’ di accounting… Numero di Job per VO Numero di Job per sito
INFN-PISA INFN-PADOVA Numero di Job per VO INFN-NAPOLI- PAMELA theophys INFN-NAPOLI- ARGO INFN-ROMA1-VIRGO enmr.eu Numero di Job per sito argo glast.org pamela

22 HLRmon Sviluppato dall’INFN
Direttivo 30 aprile apprezzato il nuovo portale con nuove pagine per tier1 e tier2 e l’accounting sullo storage Nuovi grafici Sviluppato dall’INFN

23 Novità sui dati perTier1&Tier2
Nuova sezione dedicata a tier1 e tier2: grafici sull’utilizzo delle risorse di calcolo e di storage (VO lhc e non) Novità presentata al direttivo INFN del 30 Aprile, molto apprezzata

24 WMS MONITOR (I)

25 WMS MONITOR (II)

26 Useful links Italian grid project: http://grid.infn.it/
Italian production grid: HLR MON: WMS MON: gLite Middleware:


Scaricare ppt "Stato della Grid di produzione"

Presentazioni simili


Annunci Google