Domenico Elia1 Calcolo ALICE: stato e richieste Domenico Elia Riunione Referee Calcolo LHC / Bologna, Riunione con Referee Calcolo LHC Bologna, 25 Maggio 2015
Domenico Elia2Riunione Referee Calcolo LHC / Bologna, Outline ALICE Computing status: impiego delle risorse 2014 performance siti italiani, attività di R&D su VAF evoluzione CM per Run2 Richieste finanziarie: situazione CPU e storage nei Tier-2, dismissioni richieste 2016 per Tier-1 e Tier-2
Domenico Elia3Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Resource usage in 2014 Overall CPU/DISK/TAPE usage: T1, T2 over pledge (opportunistic, extra-WLCG) DISK usage ~70% (87% full wherever good network connection) TAPE usage ~90% (but T1, will improve with Run2) CERN-RRB
Domenico Elia4Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Resource usage in 2014 Main activity 2014 / May 2015: Run1 data reprocessing and associated MC: pp 2010 (pass4), pp and pPb 2012, pPb 2013 (pass3) full detector recalibration + improved software, all with the same code pp 2011 reprocessing being evaluated (overlap with Run2) further MC productions (~120 cycles): requests from PWGs (68% pp, 18% pPb, 14% PbPb) first large-scale production for Run2 (new detector setup) ~4% generations dedicated to upgrade studies (Run3) analysis (user and organized trains) ALICE recommissioning for Run2: test of upgraded detector readout, trigger, DAQ, recording chain cosmics trigger data taking with Offline processing
Domenico Elia5Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Resource usage in 2014 Main activity 2014 / May 2015: Run1 data reprocessing and associated MC further MC productions (~120 cycles) analysis (user and organized trains) Average: ~45K concurrent jobs ~99.5% availability 85% CPU T0, T1 79% CPU T2
Domenico Elia6Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Resource usage in 2014 Main activity 2014 / May 2015: Run1 data reprocessing and associated MC further MC productions (~120 cycles) analysis (user and organized trains) MC productions: all centres RAW data processing: T0/T1 only User analysis: all centres Organized analysis: all centres individual analysis decreased by 50% in the period still ample room to increase the share of organized analysis reducing individual analysis by factor 2 could still give ~2-5% gain in efficiency
Domenico Elia7Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Resource usage in 2014 Main activity 2014 / May 2015: Run1 data reprocessing and associated MC further MC productions (~120 cycles) analysis (user and organized trains) ALICE recommissioning for Run2 New and upcoming WLCG sites: KR-KISTI (Korea), T1: in production in 2014 NRC-KI (Russia), T1: in production in 2014 UNAM (Mexico), T2: rumping up, MoU signed in Nov 2014 COMSATS (Pakistan), T2: MoU signed in March 2015 CHPC (South Africa), T2: MoU signed in April 2015
Domenico Elia8Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites ~15%
Domenico Elia9Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites Storage availability:
Domenico Elia10Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites Resource T1: CPU ~150% pledge (2015) DISK ~85% pledge 2014) TAPE largely underused (~700 TB)
Domenico Elia11Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites Resource T2: largely benefits from strict internal coordination monthly meetings (performance recording) + annual workshop overall ~35% increase in total WCT from 2013 to 2014 large upgrade in 2 sites (ReCaS): BARI (almost ready, in production beginning of June) CATANIA (in production since April, ~1500 core, 1 PB: Catania-VF)
Domenico Elia12Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites Bari Torino PD-LNL Catania Pledge: Catania-VF
Domenico Elia13Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites Resource T2: largely benefits from strict internal coordination monthly meetings (performance recording) + annual workshop overall ~35% increase in total WCT from 2013 to 2014 large upgrade in 2 sites (ReCaS): BARI (almost ready, in production beginning of June) CATANIA (in production since April, ~1500 core, 1 PB: Catania-VF) monitoring T2 data from APEL (Andrea Guarise): 2014 e 2015 pledge values in place HS06 SI2K conversion factor from sites (BDII) checked/updated WCT cross-checked vs exp (MonALISA) and local monitorings
Domenico Elia14Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites (*) In Aprile Catania-VF monitorata solo attraverso MonAlisa, su EGI da Maggio
Domenico Elia15Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites (*) In Aprile Catania-VF monitorata solo attraverso MonAlisa, su EGI da Maggio Coordinamento Tier-2 Conversione ore in Wall_h_kSi2k dei dati EGI
Domenico Elia16Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Performance of the Italian sites (*) In Aprile Catania-VF monitorata solo attraverso MonAlisa, su EGI da Maggio Bari (-16%) Torino (-15%) PD-LNL (-7‰) Catania (+2%) Coordinamento Tier-2
Domenico Elia17Riunione Referee Calcolo LHC / Bologna, ALICE Computing status R&D activity on the VAF In the framework of the STOA-LHC PRIN: Torino VAF going to be accounted for ALICE similar (test) cloud infrastructures deployed in 2014: Bari, Cagliari, Legnaro, Trieste (OpenStack) Catania could join (new T2 infrastructure) many activities ongoing, reported to CHEP’15 (April): Interoperating Cloud-based Virtual Farms for the ALICE experiment at the LHC (Trieste) Monitoring of IaaS and scientific applications on the Cloud using Elasticsearch ecosystem (Torino) Managing competing elastic grid and cloud computing applications using OpenNebula (Torino) Local storage federation through XRootD architecture for interactive distributed analysis (Bari) white paper to be finalized by end of 2015 futher (connected) activities: parallel computing (TS, BA), dashboard for the Italian sites (BA)
Domenico Elia18Riunione Referee Calcolo LHC / Bologna, Targeting integrated luminosity 1 nb -1 for PbPb: by combination of Run1 and Run2 statistics consistent with the ALICE approved programme 4-fold increase in instant luminosity for PbPb Detector upgrades: complete TRD/PHOS, new DCAL Double event rate of TPC/TRD: consolidation of TPC and TRD readout electronics Increased capacity of HLT and DAQ: rate up to 8 GB/sec to T0 (for Heavy-Ion data taking) ALICE Computing status Evolution of CM for Run2
Domenico Elia19Riunione Referee Calcolo LHC / Bologna, ALICE Grid model largely unchanged in Run2: integration of every new computing centre average 2 replicas of analysis objects: dependency on resource stability, 1 copy for least popular data low differentiation of tasks: T0/T1 still raw data keepers/producers all other tasks (MC + analysis) performed everywhere tasks generally sent to data, but data can go to tasks if needed: jobs go to data, in case of failure read from closest replica (<5%) ALICE global data distribution by exclusive use of xrootd protocol analysis input mostly on AODs (limited use of ESDs) push analyzers to organized trains (LEGO framework) ALICE Computing status Evolution of CM for Run2
Domenico Elia20Riunione Referee Calcolo LHC / Bologna, Main software and process improvements: new version of the software framework (AliRoot 5.x): effort to improve performance of ALICE reconstruction software use TRD points in the fit (improve high-momentum resolution) reduce memory requirements during calibration and reconstruction use of HLT for online Raw data compression (factor 4): already tested in Run1, implies reduction of tape Tier-0/1 use of HLT for calibration: move first calibration iteration to online use of HLT track seeds for offline reconstruction improve performance of GEANT4 simulation for ALICE further development of fast and parametrized simulation ALICE Computing status Evolution of CM for Run2
Domenico Elia21Riunione Referee Calcolo LHC / Bologna, Main software and process improvements Additional improvements: start adapting ALICE distributed computing to Cloud, using of HLT farm for offline processing corresponds to additional 3% CPU resources improving performance of the organized analysis trains speeding up and improving the efficiency of the analysis activity by active data management explore contributed resources: ie spare CPU cycles on supercomputers collaborating with other experiment on this issue ALICE Computing status Evolution of CM for Run2
Domenico Elia22Riunione Referee Calcolo LHC / Bologna, Basic assumptions for Run2 resource estimate: same CPU power needed for reconstruction 25% larger raw event size: additional detectors, detector coverage higher track multiplicity with increased beam energy and pileup MC productions: 100% pp, pPb % PbPb events ALICE Computing status Evolution of CM for Run2 T1/2 2016: +25% T1/2 2016: +17%
Domenico Elia23Riunione Referee Calcolo LHC / Bologna, Summary Run2: data volume in the period expected ~3x Run1 focus of the Grid development will be on improving the analysis efficiency and decreasing the turnaround time of organized trains several other software and process improvements site performance and stability will continue to be a key factor for success of the ALICE offline computing planned resource increase expected to meet the demands, working on data popularity monitoring and replica limitation Run3: TDR submitted to LHCC, final discussion first week of June
Domenico Elia24Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Situazione CPU/Storage Italia In produzione al Tier-1: CPU:22800 HS06 (pledge 2015) DISK:1920 TB (pledge 2014*) In produzione ai Tier-2 (+ Cagliari): BariCatania Padova- LNL TorinoCagliariTotale HS TB Disponibili (incluso obsoleti non ancora dismessi) Maggio 2015 A pledge 2015 (3380 TB) in Ottobre Risorse 2015: in produzione ReCaS CT, il resto tra Giugno e Settembre (prox slide) Pledge 2015: HS TB
Domenico Elia25Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Situazione CPU/Storage Tier-2 Assegnazione 2015 parte ReCaS: Bari (in produzione a Giugno): rimpiazzi BA: 1568 HS TB Catania (in produzione da inizio Aprile): rimpiazzi CT: 1075 HS06, rimpiazzi CA: 840 HS06 parte crescita netta totale ALICE: 1550 HS TB
Domenico Elia26Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Situazione CPU/Storage Tier-2 Assegnazione 2015 parte ReCaS: Bari (in produzione a Giugno):1568 HS TB Catania (già in produzione): 3465 HS TB Assegnazione 2015 parte CSN3 (224.5 k€): quota CPU (32.5 k€), stornata su PD-LNL e TO: rimpiazzi PD-LNL: 1280 HS06, rimpiazzi TO: 1400 HS06 quota Storage (192 k€): completamento crescita netta totale + rimpiazzi PD-LNL e TO gara gestita a Bari (capitolato quasi pronto, in GE a Giugno) attesa: 6 x 180 TB (1 BA, 2 TO, 3 PD-LNL) = 1.08 PB consistente risparmio con sole espansioni (BA e PD-LNL) potrebbero servire ~15 kEuro aggiuntivi (2 server TO)
Domenico Elia27Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Situazione CPU/Storage Tier-2 Aggiornata con risorse 2015: CPU:39216 HS06 in eccesso al pledge: 616 HS06 DISK: 4423 TB in eccesso al pledge: 42 TB BariCatania Padova- LNL TorinoCagliariTotale HS TB Disponibili a fine 2015 (escluso tutte le dismissioni risorse 2015) Pledge 2015: HS TB
Domenico Elia28Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Dismissioni Anno di dismissione BariCatania LNL- Padova TorinoCagliariTotale HS TB HS TB HS TB
Domenico Elia29Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Dismissioni Anno di dismissione BariCatania LNL- Padova TorinoCagliariTotale HS TB HS TB HS TB II semestre 2016 2017
Domenico Elia30Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Dismissioni Anno di dismissione BariCatania LNL- Padova TorinoCagliariTotale HS TB HS TB HS TB Situazione complessiva Tier-2 nel 2016: CPU:39216 – 9768 = HS06 DISK:4423 – 567 = 3856 TB
Domenico Elia31Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Esito RRB Aprile 2015 Share INFN per 2016: CPU, DISK per Tier-1 e Tier-2: 18.5% (19.3% nel 2015) TAPE per Tier-1: 35.2% (41.1% nel 2015)
Domenico Elia32Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Richieste 2016: Tier-1 CPU Tier-1 (HS06) DISK Tier-1 (TBn) Pledged T1 Disp. – dismiss. T Scrutinati ALICE Delta Stima costo (k€) Totale (k€) Stima costi T2 (T1): 12 (14) € / HS06 e 220 (240) € / TBn T1: quota pledge da 4192 (2015) a 5491 TB (2016) Dismissioni Tier-1: non incluse
Domenico Elia33Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Richieste 2016: Tier-1 e Tier-2 CPU Tier-1 (HS06) DISK Tier-1 (TBn) CPU Tier-2 (HS06) DISK Tier-2 (TBn) Pledged T1 Disp. – dismiss. T Scrutinati ALICE Delta Stima costo (k€) Totale (k€) Overhead T2 (k€) 48.1 Stima costi T2 (T1): 12 (14) € / HS06 e 220 (240) € / TBn T1: quota pledge da 4192 (2015) a 5491 TB (2016) Dismissioni Tier-1: non incluse Overhead Tier-2: 6% CPU + 5% DISCO (rete) + 7% totale (server aggiuntivi)
Domenico Elia34Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Richieste 2016: per sito Tier-2 Dismissioni HS06 / TBk€ Bari156818,8 00,0 18,8 Catania00, ,6 LNL-Padova549666, ,2 123,2 Torino158419, ,5 53,5 Cagliari112013,4 204,4 17,8 Dismissioni totale HS06 / TBk€ , ,7 242,0 Crescita netta HS06 / TBk€ , ,2 144,8 Dismissioni + crescita HS06 / TBk€ , ,0 386,7 Richiesta completa
Domenico Elia35Riunione Referee Calcolo LHC / Bologna, Richieste finanziarie Richieste 2016: per sito Tier-2 Dismissioni HS06 / TBk€ Bari156818,8 00,0 18,8 Catania00, ,6 LNL-Padova549666, ,2 123,2 Torino158419, ,5 53,5 Cagliari112013,4 204,4 17,8 Dismissioni totale HS06 / TBk€ , ,7 242,0 Crescita netta HS06 / TBk€ , ,2 144,8 Dismissioni + crescita HS06 / TBk€ , ,0 386,7 HS06 / TBk€ ,8 00,0 18,8 00, ,0 00,0 66, ,0 00,0 19, ,4 204,4 17,8 HS06 / TBk€ ,2 204,4 121,6 HS06 / TBk€ , ,2 144,8 HS06 / TBk€ , ,6 266,4 Richiesta completa Con rinvio dismissioni storage
Domenico Elia36Riunione Referee Calcolo LHC / Bologna, BACKUP
Domenico Elia37Riunione Referee Calcolo LHC / Bologna, ALICE CM: Focus on Run2 Physics programme, upgrades Targeting integrated luminosity 1 nb -1 for PbPb: by combination of Run1 and Run2 statistics consistent with the ALICE approved programme 4-fold increase in instant luminosity for PbPb
Domenico Elia38Riunione Referee Calcolo LHC / Bologna, ALICE CM: Focus on Run2 Infrastructure improvements Focus on SE stability: major factor for successful analysis and high CPU efficiency goal for all SEs: > 98% availability
Domenico Elia39Riunione Referee Calcolo LHC / Bologna, ALICE CM: Focus on Run2 Infrastructure improvements Focus on SE stability LHCone programme Network use will increase IPv6 adoption Refurbishment of SAM/SUM tests: WLCG monitoring consolidation projet, advanced status Site tests will reflect more and more the VO tests: in the ALICE case provided by MonALISA
Domenico Elia40Riunione Referee Calcolo LHC / Bologna, ALICE CM: Focus on Run2 Infrastructure improvements Focus on SE stability LHCone programme: brings substantial improvement in inter-site connectivity allows for further diluition of boundaries between sites and tasks Europe largely covered, focus on South America and Asia Network use will increase: large data volumes, more to transfer between sites remote access to storage
Domenico Elia41Riunione Referee Calcolo LHC / Bologna, ALICE CM: Focus on Run2 Infrastructure improvements Focus on SE stability LHCone programme Network use will increase IPv6 adoption: IPv4 address depletion is already a fact for new sites ALICE services are IPv6 ready xrootd v.4 should be IPv6 ready (release end of May) other sevices are being brought into compliance
Domenico Elia42Riunione Referee Calcolo LHC / Bologna, ALICE Computing status Nuova infrastruttura virtuale CT