La presentazione è in caricamento. Aspetta per favore

La presentazione è in caricamento. Aspetta per favore

Introduzione al progetto INFNGRID

Presentazioni simili


Presentazione sul tema: "Introduzione al progetto INFNGRID"— Transcript della presentazione:

1 Introduzione al progetto INFNGRID
Alessandro Paolini (INFN-CNAF) III corso di formazione INFN per amministratori di siti GRID INFN-CATANIA 2 – 6 Novembre 2009

2 Primary components of the production grid
The primary components of the Italian Production Grid are: Computing and storage resources Access point to the grid Services Other elements are as much fundamental for the working, managing and monitoring of the grid: Middleware Monitoring tool Accounting tool Management and control infrastructure Users

3 GRID Management Grid management is performed by the Italian Regional Operation Center (ROC). The main activities are: Production of Infngrid release and test Deployment of the release to the sites, support to local administrators and sites certification Periodical check of the resources and services status Support at an Italian level Support at an European level Introduction of new Italian sites in the grid Introduction of new regional VOs in the grid

4 The Italian Regional Operation Center (ROC)
Operations Coordination Centre (OCC) Management, oversight of all operational and support activities Regional Operations Centres (ROC) providing the core of the support infrastructure, each supporting a number of resource centres within its region Grid Operator on Duty Grid User Support (GGUS) At FZK, coordination and management of user support, single point of contact for users One of 10 existing ROC in EGEE

5 ROC Shifts About 15 supporters perform a checking
activity composed of 1 shift per week, with 3 person per shift: 2 people provides a first support to sites and users (1st line supporters team) 1 person (2nd line supperter) takes place in case of more complex problems The main activities is: Checking of the grid status and problem warning, tailing them until their solution if possible Doing sites certification during the deployment phases Checking of the ticket still opened and pressing the expert or the site-managers for answering and solving them Evidenziare il cambiamento nel sistema dei turni

6 Users and sites support
EGEE make use of the GGUS (Global Grid UserSupport) ticketing system Each ROC utilizes different tools interfaced to GGUS in a bidirectional Way. By means of Web services, it is possible to: Transfer tickets from the global to regional system Transfer tickets from the regional to the global system The user groups support, whom ticket will be addressed, are defined Either in GGUS either in the regional systems In the Italian Regional Operation Centre the ticketing system utilized is based on XOOPS/xHelp

7 Sistema di supporto italiano

8 Interface to GGUS I ticket in arrivo da GGUS, se riguardano un sito in particolare, vengono assegnati automaticamente al sito nel nostro sistema Per i ticket arrivati da GGUS è possibile scegliere se inviare la risposta a GGUS o mantenerla interna al nostro sistema

9 Service Availability Monitoring (SAM)
SAM jobs are launched every hour and allow to find out submission problem, among which batch system errors, CA not updated and replicas errors There are also more specific tests for SRMv2, SE, LFC, CREAMCE, WMS, BDII..

10 Service Availability Monitoring (SAM)
CE Tests

11 Service Availability Monitoring (SAM)
RGMA (MONBOX) Tests Test if the service host certificate is valid On MONBOX host certificate is also present in rgma and tomcat (hidden) directories which usually are: /etc/tomcat5/ /opt/glite/var/rgma/.certs/ So when you change the host certificate you have to restart also /etc/rc.d/init.d/rgma-servicetool /etc/rc.d/init.d/tomcat5

12 SAM Admin SAM jobs available for both EGEE production and preproduction sites each site-manager can submit new sam test on his site Each ROC can submit new tests job of the site of own region

13 GSTAT GSTAT queries the Information System every 5 minutes
The sites and nodes checked are those registered in the GOC DB The inconsistency of the information published and the eventual missing of a service that a site should publish are reported as an error

14 Introducing a new site Before entering in grid, each site have to accept several norms of behaviour, described in a Memorandum of Understanding (MoU). The COLG (Grid Local Coordinator) read and sign it, and they fax this document to INFN-CNAF. Moreover all sites must provide this alias: This alias will be used to report problems and it will be added to the site managers' mailing list. Of course it should include all site managers of your grid site. At this point, IT-ROC register site and site-managers in the GOC-DB, and create a supporter-operative group in the ticketing system XOOPS. site-managers have to register in XOOPS, so they can be assigned to their supporter-operative groups; each site-manager has to register in the test VOs infngrid and dteam

15 Introducing a new site Site-managers install the middleware, following the instructions distribuited by the Release Team ( . When finished, they make some preliminary test and then they make the request to be certified by own ROC. IT-ROC log a ticket to communicate with site-managers during the certification.

16 Memorandum of Understanding
Every site have to: Provide computing and storage resources. Farm dimensions (at least 10 cpu) and storage capacity will be agreed with each site Guarantee sufficient man power to manage the site: at least 2 persons Manage efficently the site resources: middleware installation and upgrade, patch application, configuration changes as requested by CMT and do that by the maximum time stated for the several operation Answer to the ticket by 24 hours (T2) or 48 hours (other sites) from Mon to Fry Check from time to time own status Guarantee continuity to site management and support, also in holidays period Partecipate to SA1/Production-Grid phone conferences an meetings and compile weekly pre report Keep updated the information on the GOC DB Enable test VOs (ops, dteam and infngrid), with a higher priority than other VOs Eventual non-fulfilment noticed by ROC will be referred to the biweekly INFNGRID phone conferences, then to COLG, eventually to EB

17 Registration in GOC DB 3 phases: candidate Monitoring is OFF
uncertified Monitoring can be turned on certified Monitoring is ON

18 Availability & Reliability (I)
Viene preso in cosiderazione lo stato dei servizi CE,SE,SRM e sBDII come risulta dagli esiti dei test SAM Viene applicato un AND logico tra questi servizi ed un OR logico tra i servizi dello stesso tipo nel caso un sito abbia più istanze di uno stesso servizio Un sito deve risultare disponibile (available) almeno il 70% del tempo al mese (la disponibilità giornaliera è misurata sulle 24 ore) L’affidabilità (reliability) del sito deve essere di almeno 75% al mese (Reliability = Availability / (Availability + Unscheduled Downtime)) I siti che a fine mese non raggiungono la soglia minima di Availability o Reliability, devono fornire spiegazioni I siti che per 3 mesi consecutivi non superano la soglia del 50% verranno sospesi

19 Availability & Reliability (II)
I periodi di scheduled downtime devono essere dichiarati in anticipo sul GOC-DB Gli Scheduled Downtime incidono negativamente sulla availability, ma non sulla reliability I downtime più lunghi di un mese sono concessi in casi eccezionali, e devono essere prima approvati dal ROC

20 GRID Services Allow you to use the grid resources:
Resource Broker (RB) / Workload Management System (WMS): they are responsible for the acceptance of submitted jobs and for sending those jobs to the appropriate resources Information System (IS): provides information about the grid resources and their status Virtual Organization Management System (VOMS): database for the authentication and authorization of the users Gridice: monitoring of resources, services and jobs Home Location Register (HLR): database for the accounting informations of the usage of resources LCG file catalog (LFC): file catalog File Transfer Service (FTS): file movements in an efficient and reliable way MonBox: collector for local data of R-GMA

21 Access to the GRID Access by means of an User Interface (UI). It could be: A dedicated PC, installed in a similar way to the others grid elements A web portal: To access the GRID you need a personal certificate released by a Certification Authority trusted by EGEE/LCG infrastructure: the user authentication is performed through X-509 certificates To be authorized to submit jobs you have to belong to a Virtual Organisation (VO). A VO is a kind of users group usually working on the same project and using the same application software on the grid.

22 General Purpose Services
Test sites scope MYPROXY BDII myproxy.cnaf.infn.it Italian sites scope gridit-cert-rb.cnaf.infn.it LFC gridit-bdii-01.cnaf.infn.it WMS lfcserver.cnaf.infn.it prod-bdii-01.pd.infn.it top-bdii03.cnaf.infn.it top-bdii01.cnaf.infn.it top-bdii02.cnaf.infn.it egee-bdii.cnaf.infn.it gridbdii.fe.infn.it gridit-wms-01.cnaf.infn.it LB lb009.cnaf.infn.it glite-rb-00.cnaf.infn.it prod-lb-01.pd.infn.it prod-wms-01.pd.infn.it albalonga.cnaf.infn.it egee-wms-01.cnaf.infn.it VOMS replica WMS+LB+BDII voms.cnaf.infn.it wms-lb.ct.infn.it voms-01.pd.infn.it glite-rb-01.cnaf.infn.it voms2.cnaf.infn.it voms-02.pd.infn.it EGEE sites scope SAM ADMIN Test sites scope replica

23 Servizi Controllati da NAGIOS
Le macchine del cnaf sono controllate da nagios: in caso di fallimenti, sono abilitati l’invio di o di sms In caso di down di un HOST, viene inviato un sms per i seguenti servizi grid: FTS, LFC, WMSMON, VOMS (master E replica), CE, NTP I servizi che girano sulle macchine sono controllati ogni 5 minuti. In caso risultino CRITICAL (viene inviato un sms), vengono fatti altri 4 controlli, distanziati di 1 minuto uno dall'altro. Al secondo di questi controlli, quando lo stato del check e' CRITICAL SOFT 2, viene fatto il tentativo di restart del servizio HOST CHECK RESTART FTS FTS AGENTS TO DO FTS BDII LFC lfc-dli SI' lfc-deamon globus-mds mysql VOMS voms ports BDII bdii NO CE site-bdii NTP ntpd WMS WMS-POOL Update dinamico del DNS: I servizi sono controllati ogni minuto. In caso risultino CRITICAL, vengono fatti altri 3 controlli, distanziati di 1 minuto uno dall'altro. Nel caso in cui lo stato CRITICAL persista, viene rimosso l'IP del host che ospita il servizio critico dall'alias top-bdii.grid.cnaf.infn.it Host coinvolti: top-bdii01.cnaf.infn.it top-bdii02.cnaf.infn.it top-bdii03.cnaf.infn.it prod-bdii-01.pd.infn.it gridbdii.fe.infn.it Le motivazioni

24 Accounting using DGAS DGAS (Distributed Grid Accounting System) is fully deployed in INFNGrid (15 site HLRs + 1 HLR of 2nd level (testing). The site HLR is a service designed to manage a set of ‘accounts’ for the Computing Elements of a given computing site. For each job executed on a Computing Element (or a on local queue), the Usage Record for that job is stored on the database of the site HLR. Each site HLR can: Receive Usage Records from the registered Computing Elements. Answer to site manager queries such as: Datailed job list queries (with many search keys: per user, VO, FQAN ,CEId…) Aggregate usage reports, such as per hour, day, month…, with flexible search criteria. Optionally forward Usage Records to APEL database. Optionally forward Usage Records to a VO specific HLR. Site HLR Site layer Usage Metering Resource’s layer -Aggregate site info -VO (with role/group) usage on the site. Detailed Resource Usage info Job level info GOC

25 Tier1 & Tier2 HLRs 11 Home Location Register di sito per Tier1 e Tier2
4 HLRs per i siti medio-piccoli host sito hlr-t1.cr.cnaf.infn.it INFN-T1  prod-hlr-02.ct.infn.it  INFN-CATANIA  prod-hlr-01.pd.infn.it  INFN-PADOVA prod-hlr-01.ba.infn.it  INFN-BARI atlashlr.lnf.infn.it  INFN-FRASCATI t2-hlr-01.lnl.infn.it  INFN-LEGNARO prod-hlr-01.mi.infn.it  INFN-MILANO  t2-hlr-01.na.infn.it INFN-NAPOLI, INFN-NAPOLI-ATLAS gridhlr.pi.infn.it INFN-PISA t2-hlr-01.roma1.infn.it  INFN-ROMA1, INFN-ROMA1-CMS, INFN-ROMA1-VIRGO grid005.to.infn.it  INFN-TORINO HLR prod-hlr-01.pd.infn.it  (INFN-PADOVA) reference for central-northern area sites HLR prod-hlr-01.ct.infn.it   (INFN-CATANIA) reference for central-southern area sites HLR infn-hlr-01.ct.pi2s2.it   (INFN-CATANIA) reference for COMETA Project sites HLR dgas-enmr.cerm.unifi.it  reference for CIRMMP site

26 VO Dedicated Services CDF: 2 WMS, 2 LB CMS: 8 WMS, 3 LB ALICE: ATLAS:
LHCB: 3 WMS, 3 LB

27 Experimental Services
Tests su alcuni componenti rilasciati dagli sviluppatori, in parallelo con SA3 Applicazione delle ultime patch appena rilasciate su alcuni WMS presenti in produzione, per consentire alle VO di testarne la compatibilità con i loro tools CreamCE: in collaborazione con alcuni siti in cui sono state installate diverse istanze

28 Deployment Status (I) 56 Siti di produzione e
SITE STATUS GRISU-COMETA-INFN-LNS CERTIFIED INFN-NAPOLI-ATLAS GRISU-COMETA-ING-MESSINA INFN-NAPOLI-CMS GRISU-COMETA-UNICT-DIIT INFN-NAPOLI-PAMELA GRISU-COMETA-UNICT-DMI INFN-PADOVA GRISU-COMETA-UNIPA INFN-PADOVA-CMS GRISU-CYBERSAR-CAGLIARI INFN-PARMA GRISU-CYBERSAR-PORTOCONTE INFN-PERUGIA GRISU-ENEA-GRID INFN-PISA GRISU-SPACI-NAPOLI INFN-ROMA1 GRISU-SPACI-LECCE INFN-ROMA1-CMS GRISU-UNINA INFN-ROMA1-TEO INFN-BARI INFN-ROMA1-VIRGO INFN-BOLOGNA INFN-ROMA3 INFN-CAGLIARI INFN-T1 INFN-CATANIA INFN-TORINO INFN-CNAF INFN-TRIESTE INFN-CNAF-LHCB SISSA-TRIESTE INFN-FERRARA SNS-PISA INFN-FRASCATI UNI-PERUGIA INFN-GENOVA UNINA-EGEE INFN-LNL-2 INFN-CS long downtime INFN-LNS INFN-LECCE farm migration to sl4 INFN-MILANO-ATLASC INFN-ROMA2 INFN-NAPOLI ITB-BARI Farm reinstallation INFN-NAPOLI-ARGO SPACI-CS-IA64 SUSPENDED 56 Siti di produzione e 6 siti dedicati al training 57 Siti attivi 40 siti INFN 22 siti di altri enti (cnr, enea, esa, inaf, sissa, spaci, uniCT, uniPA, uniPG) SITE STATUS GILDA-INFN-CATANIA CERTIFIED GILDA-PADOVA GILDA-SIRIUS GILDA-VEGA ICEAGE-CATANIA CIRMMP CNR-ILC-PISA CNR-PROD-PISA ESA-ESRIN GRISU-COMETA-INAF-CT GRISU-COMETA-INFN-CT

29 Release INFNGRID Based on gLite3
The release INFNGRID 3.1 is available for SL4 (32bit & 64bit) and sl5 (64bit) Several customizations: additional VOs (~20) Secure Storage System CreamCE accounting (DGAS): New profile (HLR server) + additional packages on CE and WN monitoring (GRIDICE) Quattor (collaboration with CNAF-T1) Dynamic Information-Providers for LSF: corrected configuration, new vomaxjobs (3.1/SL4) Preconfigured/improved support for MPI (MPICH, MPICH2) GRelC (Grid Relational Catalog) StoRM (Storage Resource Manager) GFAL Java API & NTP GRIDFTP server vers sl4/i386, x86_64 HYDRA (Encrypted File Storage Solution) Work-in-progress: AMGA Web Interface GSAF (Grid Storage Access Framework) gLite for Windows with torque/maui support

30 VO Regionali 2837 utenti registrati in CDF VO utenti argo 29 bio 70
compassit 11 compchem 89 cyclops 21 egrid 28 enea 15 enmr.eu 126 euchina 62 euindia 69 eumed 104 eticsproject.eu 5 glast.org 9 gridit 148 inaf 27 infngrid 226 ingv 13 libi 18 lights.infn.it 24 pacs.infn.it 4 pamela planck 39 superbvo.org theophys 85 tps.infn.it 6 virgo 2837 utenti registrati in CDF

31 Introducing a new VO When an experiment asks to enter in grid and to form a new VO, it is necessary a formal request follwed by some technical steps. Formal Part: Needed resources and economical contribution to agree between the experiment and the grid Executive Board (EB) Pick out the software that will be used and verify its functioning Verify the possibility of the support in the several INFN-GRID production sites Communicate to IT-ROC the names of VO-managers, Software-managers, persons responsible of resources and of the support for the software experiment for the users in every site Software requisites, kind of job and the storage final destination (CASTOR, SE, experiment disk server)

32 Introducing a new VO Once the Executive Board (EB) has
approved the experiment request, the technical part begins: IT-ROC will create the VO on its voms server (if doesn’t exist one) IT-ROC will create the VO support group on the ticketing system VO-manager fill in the VO identity card on the CIC portal IT-ROC will make known the existence of the new VO and inform the sites how to enable it

33 CIC Portal and VO Identity Card
CIC Portal has been created as a part of the SA1 activity. It is dedicated to ensure: to be a management and operations tool to be an entry point for all Egee actors for their operational needs to manage the available informations about EGEE VOs and related VOs to monitor and ensure grid day-to-day operations on grid resources and services Every VO supported in EGEE has to be registered in the VO identity card: General informations VOMS informations Contacts

34 Freedom of Choice for Resources
The Freedom of Choice for Resources is a VO Policy enforcement tool, to manipulate top-level BDIIs. It is fully integrated with the SAM ( Service Availiblity Monitoring ) framework. FCR allows the VOs to define a preference on Grid resources, optionally taking the SAM test results in account as well. Only VO responsibles (VO Software Managers, etc.) can get access to the FCR Admin Pages , where they can modify their VO's FCR profile. They can select the: set of Critical Tests for all services set of Site Resources (CEs and SEs) to be used by the VO set of Central Service Nodes ( Note : this will be used in the future) Changes are written to the database (that's shared with SAM ), and an LDAP ldif file is created, which the top-level BDIIs download in every 2 minutes in order to apply Site Resources changes.

35 HLRmon

36 Novità sui dati perTier1&Tier2
Nuova sezione dedicata a tier1 e tier2: grafici sull’utilizzo delle risorse di calcolo e di storage (VO lhc e non) Novità presentata al direttivo INFN del 30 Aprile, molto apprezzata

37 WMS MONITOR (I)

38 WMS MONITOR (II)

39 Useful links Italian production grid: http://www.italiangrid.org/
Ticketing system: SAM: HLR MON: WMS MON: gLite Middleware:


Scaricare ppt "Introduzione al progetto INFNGRID"

Presentazioni simili


Annunci Google