Laura Perini: ATLAS Computing 22 Settembre 2004 1 ATLAS Computing Update e richieste 2005 22 Settembre 2004 Laura Perini (Milano)

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 1 ATLAS Computing Update e richieste 2005 22 Settembre 2004 Laura Perini (Milano)

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 2 Outline Update di quanto mostrato il 22 Giugno l Le principali novità da giugno sono nell’area del DC2 nIn corso da luglio nSvolto tutto con GRID nFase di simulazione vicina alla fine con > 8 M eventi da Geant4 nCon rilevante impegno INFN La presentazione quindi si concentra sullo svolgimento del DC2, con parte finale su previsioni DC3 che motivano le principali richieste. nmolto in inglese, riprendendo da recente (21 settembre) presentazione ATLAS a sw week ma integrando INFN specifico l Milestones, dettaglio delle richieste, preparazione del MoU per il calcolo e sw sono stati discussi coi referees e verranno presentati da loro

3 ATLAS Data challenges  DC1 (2002-2003) o Put in place full software chain  Simulation of the data  Reconstruction o Production system  Tools (bookkeeping; monitoring; …)  Intensive use of Grid  DC2 (Summer 2004) o New software o New “automated” production system o Full use of Grids o Test of Computing Model  DC3 (End 2006) o Final test before data taking

4 ATLAS-DC2 operation  Consider DC2 as a three-part operation: o part I: production of simulated data (July-September 2004)  running on “Grid”  Worldwide o part II: test of Tier-0 operation (October 2004)  Do in 10 days what “should” be done in 1 day when real data-taking start  Input is “Raw Data” like  output (ESD+AOD) will be distributed to Tier-1s in real time for analysis o part III: test of distributed analysis on the Grid (Oct.-Dec. 2004)  access to event and non-event data from anywhere in the world both in organized and chaotic ways  Requests o ~30 Physics channels ( 10 Millions of events) o Several millions of events for calibration (single particles and physics samples)

5 More on Phase I: Data preparation  DC2 Phase I o Part 1: Event generation  Physics processes --> 4-momentum of particles  Several Event generators (Pythia; Herwig; …) o Part 2: Detector simulation (Geant4)  Tracking of particles through the detector  Records interaction of particle with sensitive elements of the detector o Part 3: Pile-up and digitization  Pile-up: superposition of “background” events with the “signal” event  Digitization: response of the sensitive elements of the detector  Output, called byte-stream data, “looks-like” “Raw Data” o Part 4: Data transfer (to CERN Tier-0)  ~35 TB in 4 weeks o Part 5: Event mixing  Physics events are “mixed” in “ad-hoc” proportion

6 ATLAS Production System  Automated version of previous ATLAS DC1 production system  Components o Supervisor: Windmill (US) o Executors (one per Grid or “legacy batch”) :  Capone (Grid3) (US)  Dulcinea (NorduGrid) (Scandinavia)  Lexor (LCG) (Italy)  “Legacy systems” (Germany-FZK; France-Lyon) o Data Management System (DMS): Don Quijote (CERN) o Bookkeeping: AMI (LPSC-Grenoble) o Production Data base (Oracle)  Definition and status of the jobs

7 ATLAS Production system LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super prodDB dms RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea

8 Supervisor -Executors Windmill numJobsWanted executeJobs getExecutorData getStatus fixJob killJob Jabber communication pathway executors Don Quijote (file catalog) Prod DB (jobs database) execution sites (grid) 1. lexor 2. dulcinea 3. capone 4. legacy supervisors execution sites (grid)

9 ATLAS DC2 Phase I  Started beginning of July and still running  On 3 Grids o LCG  Including some non-ATLAS sites (Legnaro, Torino)  Using in production mode the LCG-Grid-Canada interface 3 sites are accessible through this interface(TRIUMF) –Uni. Victoria, Uni. Alberta and WestGrid(SFU/TRIUMF) o NorduGrid  Several Scandinavian super-computer resources o Grid3  Harnessing opportunistic computing resources that are not dedicated to ATLAS (e.g. US CMS sites)

10 Current Grid3 Status (3/1/04) (http://www.ivdgl.org/grid2003) 28 sites, multi-VO shared resources ~2000 CPUs dynamic – roll in/out

11 NorduGrid & Co. Resources:  7 countries:  Sites for ATLAS: 22 o Dedicated: 3, the rest is shared  CPUs for ATLAS: ~3280 o Effectively available: ~800  Storage Elements for ATLAS: 10 o Capacity: ~14 TB, all shared

Current LCG-2 sites: 7/9/04 73 Sites 7700 CPU 26 sites at 2_2_0 33 sites at 2_1_1 others at ?? 29 pass all tests

13 ATLAS DC2 Phase I  Main difficulties at the initial phase o For all Grids  Debugging the Production System  On LCG and Grid3 several instances of the Supervisor have to be run for better coping with the instability of the system. As a consequence the Production System was more difficult to handle. o LCG  Mis-configuration of sites; Information system (wrong or missing information); Job submission and Resource Broker; Jobs ranking.  Data management(copy & register); Stage in/out problems o NorduGrid  Replica Location Service (Globus) hanging several times per day  Mis-configuration of sites  Access to the conditions database o Grid3  Data Management - RLS interactions  Software distribution problems  Load on gatekeepers  Some problems with certificates (causing jobs to abort) o Good collaboration with Grid teams to solve the problems

14 ATLAS DC2 Phase I  Not all problems solved o NorduGrid  RLS; Access to the conditions database; Storage elements died … o Grid3  Try to avoid single points of failure (adding new servers)  Lack of storage management in some sites o LCG  Still some problems with resource broker and information system  And data management (copy and register) and stage in/out problems o For all  Slowness of the response of the Production Database (Oracle) Problem that appears after ~6 weeks of running and which is still not fully understood (mix software and hardware problems? being worked with IT-DB). Has been solved!  Consequences: we did not succeed (yet) to run as many jobs as expected per day o In “good time-slots” the rate of about 2000 jobs running at the same time on LCG was sustained for 5-10 days  Nevertheless should be completed by end-September and is “Grid” only

17 ATLAS DC2 status (CPU usage for simulation)

19 Statistiche e problemi LCG  8 M eventi prodotti con Geant4 o 100 k jobs da 24 ore circa o 30TB di output e 1470 kSpI2k*months Da qui in poi ci concentriamo su LCG o Breve sommario quantitativo problemi trovati da 1-8 a 7-9 (prima nostra categorizzazione era troppo approssimativa) o 750 jobs falliti per misconfigurazione siti (E1) o 1985 “ per WLMS e servizi collegati (E2) o 4350 “ per Data Man. e servizi collegati (E3) Jobs finiti bene nello stesso periodo 29303 (OK)  Efficienza LCG = OK/(OK+E1+E2+E3)= 81%  Ricalcolata da 1-8 a 20-9 effic. non cambia significativamente, ma la prossima release LCG dovrebbe ridurre a <1/3 errori WLMS Ma l’efficienza globale è più bassa, ci sono stati problemi anche nella parte ATLAS (circa 7000 non molto diverso da LCG) e circa 5000 di difficile assegnazione  Efficienza DC2(parte LCG)=OK/(OK+FAILED)= 62%

20 Problemi in DC2  Vale la pena notare che il peggiore singolo problema incontrato non è stato né di LCG né del production system  Dal 7 di agosto fino al 25 siamo stati limitati a meno di 1000 jobs al giorno da un rallentamento imprevisto e non capito del server ORACLE del CERN che gestisce la nostra production DB  NO comment….

21 Jobs distribution on LCG Preliminary

22 Jobs distribution on Grid3 Preliminary

23 Jobs distribution on NorduGrid Preliminary

24 Sommario DC2  La fase di simulazione è quasi conclusa con successo!  L’uso quasi esclusivo e diretto delle 3 grid è un enorme passo avanti o Per ATLAS e per LCG che ha superato un test realistico e fatto un debugging decisivo vedi ad es. da EIS https://edms.cern.ch/file/495809//Broker- Requirements.pdf https://edms.cern.ch/file/495809//Broker- Requirements.pdf  Il production system sarà la base su cui costruiremo l’evoluzione delle attività di produzione ( continua, DC3 e se possibile Test Beam Combinato TBC) e analisi  Importanza del TBC o Ora in presa dati, ma poi intenzione di produrre con ATHENA e analizzare in modo il più “standard DC” possibile.  Bridge fra comunità rivelatori e calcolo

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 25 Risorse disponibili in Italia per DC2 l Le risorse “garantite” per il DC2 di ATLAS in Italia sono l 280 kSI2K (metà al CNAF e metà nei 4 Tier2) n60 MI, 50 Roma1, 20 NA, 8 LNF l 24 TB disco Raid (8 al CNAF e 16 nei Tier2) n7 MI, 5 Roma1, 3.2 NA, 0.8 LNF  attenzione: una parte considerevole (20-25%) è già occupata da dati ATLAS e non sarà disponibile per DC2 l Da qui in poi una serie di plots con le informazioni CPU e storage utilizzato dai siti, Milano Roma1, Napoli, LNF e CNAF

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 26 Mi-CPU (62 CPUs)

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 27

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 30 LNF: 6 box = 12 CPU

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 31 CNAF

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 32 Esempio Disco: Milano ma uso disco principale non per fase 1

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 33 ATLAS Computing Timeline POOL/SEAL release (done) ATLAS release 7 (with POOL persistency) (done) LCG-1 deployment (done) ATLAS complete Geant4 validation (done) ATLAS release 8 (done) DC2 Phase 1: simulation production (in progress) DC2 Phase 2: intensive reconstruction (the real challenge!) Combined test beams (barrel wedge) Computing Model paper Computing Memorandum of Understanding ATLAS Computing TDR and LCG TDR DC3: produce data for PRR and test LCG-n Physics Readiness Report Start commissioning run GO! 2003 2004 2005 2006 2007 NOW

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 34 Final prototype: DC3 l We should consider DC3 as the “final” prototype, for both software and computing infrastructure ntentative schedule is Q4-2005 to end Q1-2006  cosmic run will be later in 2006 l This means that on that timescale (in fact, earlier than that, if we have learned anything from DC1 and DC2) we need: na complete s/w chain for “simulated” and for “real” data  including aspects missing from DC2: trigger, alignment etc. na deployed Grid infrastructure capable of dealing with our data nenough resources to run at ~50% of the final data rate for a sizable amount of time (one month) l After DC3 surely we will be forced to sort out problems day-by- day, as the need arises, for real, imperfect data coming from the DAQ: no time for more big developments!

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 35 Preparazione in Italia per DC3 (1) l DC3 prevede la produzione dei dati equivalenti ad un mese di presa dati di ATLAS al 50% del rate finale (1.5x10 8 eventi) l Infrastruttura hardware (dovremmo avere a disposizione di ATLAS in Italia almeno il 10% delle risorse totali necessarie per DC3): nCPU e mass storage per produzione della simulazione nel Tier-1 e nei Tier-2 (1.5x10 7 eventi simulati): ~2000 kSI2k.mesi, ~75 TB  ~600 kSI2k per 4 mesi fra Tier-1 e tutti i Tier-2 (effic. 80%) nrate da Tier-0 a Tier-1: 16 MB/s (RAW) + 10 MB/s (ESD) + 1 MB/s (AOD)  ~240 Mbit/s per un mese consecutivo nrate da Tier-1 ad ogni Tier-2: 1 MB/s (AOD)  ~ 8 Mbit/s per ogni Tier-2, anche questo per un mese consecutivo

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 36 Preparazione in Italia per DC3 (2) nmass storage nel Tier-1 per RAW, ESD, AOD e TAG:  (16+10+1 MB/s)x(30 giorni)x(8x10 4 sec/giorno) ~ 65 TB nspazio disco nel Tier-1 per ESD, AOD e TAG:  (10+1 MB/s)x(30 giorni)x(8x10 4 sec/giorno) ~ 25 TB nspazio disco in ogni Tier-2 per AOD e TAG:  (1 MB/s)x(30 giorni)x(8x10 4 sec/giorno) ~ 2.5 TB nCPU e spazio disco nel Tier-1 e nei Tier-2 per gli utenti che fanno analisi  considerevole soprattutto per i Tier-2 l NB: a tutti questi numeri vanno aggiunte le efficienze di utilizzo di CPU, reti, dischi e nastri  (fattori importanti soprattutto per le reti)

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 37 Preparazione in Italia (3) l Tier-3: nogni istituto (Tier-3) deve essere collegato al resto del sistema Grid con banda passante sufficiente per poter lavorare ovunque con la stessa efficienza e lo stesso accesso alle risorse locali e globali  questo non è particolarmente difficile ma bisogna pensarci per tempo l Infrastruttura software: ndata le competenze esistenti, la comunità italiana si deve coinvolgere più direttamante nello sviluppo del sistema di “Distributed Analysis”  che ha ancora bisogno in ogni caso di parecchio lavoro... nbisognerebbe anche organizzare un “help desk” virtuale italiano per aiutare chi comincia un’analisi a partire senza perdere tempo nanche il sistema di job e Grid monitoring è un campo di sviluppo utile ad ATLAS e anche a EGEE

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 38 Richieste 2005 : Tier1&2 perDC3 l Il modello che si persegue e’ n2/3 CPU al CNAF e 1/3 Tier2 (indipendente da numero Tier2) nNastro tutto al CNAF nDisco per ora come CPU, in futuro AOD e formati piu’ ridotti in Tier2 il resto in Tier1  Distribuzione Tier1-Tier2 per DC3: T1 350k SI2K con 45 TB somma T2 250kSI2K con 30 TB nRichieste AGGIUNTIVE per 2005 T1 210 kSI2K 37 TB disco RAID 150 TB nastro  somma T2 110kSI2K e 14 TB proposta LNF 16 kSI2K e 2 TB per ciascuno di altri 3 (MI, Roma1,Na) 30 kSI2K e 4 TB disco RAID = 5 TB RAW  Costi 2.2 kEuro per 1 biprocessore rack mounted da 2kSI2k, 3 kEuro per 1TB disco RAW.

Laura Perini: ATLAS Computing CSN1@Assisi- 22 Settembre 2004 39 FINE DELLA MIA ULTIMA PRESENTAZIONE da rap.naz.calc. l Seguono slides di backup…

Laura Perini: ATLAS Computing Referees- 27 luglio 2004 40 Richieste 2005 (3)

Laura Perini: ATLAS Computing 22 Settembre 2004 1 ATLAS Computing Update e richieste 2005 22 Settembre 2004 Laura Perini (Milano)

Presentazioni simili

Presentazione sul tema: "Laura Perini: ATLAS Computing 22 Settembre 2004 1 ATLAS Computing Update e richieste 2005 22 Settembre 2004 Laura Perini (Milano)"— Transcript della presentazione:

Presentazioni simili

Sul progetto

Feed-back

Entrare

Autorizzarsi attraverso i social network:

Laura Perini: ATLAS Computing 22 Settembre 2004 1 ATLAS Computing Update e richieste 2005 22 Settembre 2004 Laura Perini (Milano)

Presentazioni simili

Presentazione sul tema: "Laura Perini: ATLAS Computing 22 Settembre 2004 1 ATLAS Computing Update e richieste 2005 22 Settembre 2004 Laura Perini (Milano)"— Transcript della presentazione:

Presentazioni simili

Sul progetto

Feed-back