NaNet: a flexible and configurable low-latency NIC for real-time GPU-based systems alessandro.lonardo, andrea.biagioni, francesca.locicero, piero.vicini.

Slides:



Advertisements
Presentazioni simili
Primary Italian Saying How You Are.
Advertisements

Cache Memory Prof. G. Nicosia University of Catania
ISA Server 2004 Enterprise Edition Preview. ISA Server 2004.
Sviluppo di un’interfaccia Camera Link - FPGA
SINTESI PUNTI SALIENTI Da shared bus a network on chip Concetto di nodo Standard commerciali shared bus: AMBA, Wishbone, CoreConnect Opzioni avanzate shared.
5-1 ATM Crediti Parte delle slide seguenti sono adattate dalla versione originale di J.F Kurose and K.W. Ross (© All Rights Reserved)
La rete in dettaglio: rete esterna (edge): applicazioni e host
GPU: processori manycore
Biometry to enhance smart card security (MOC using TOC protocol)
Architettura FDDI (Parte Prima).
Queuing or Waiting Line Models
BUS di comunicazione Da Testo Maeran. BUS caratteristiche generali Semplicità (minori costi) Standard (in modo che chiunque produce HW lo possa.
BUS DI CPU M. Mezzalama - M. Rebaudengo Politecnico di Torino
La sicurezza può essere fornita in ciascuno degli strati: applicazione, trasporto, rete. Quando la sicurezza è fornita per uno specifico protocollo dello.
Motor Sizing.
Stato Redi-Go (1) Partecipanti : – Padova M. Bellato, R. Isocrate, G. Rampazzo, F, Montecassiano – Legnaro D. Bortolato, A. Gozzelino, M. Gulmini, G. Maron,
Frequency Domain Processing
Esempi di sistemi di trigger MUSE FPGA-simple CMS-I level CMS-II level TINA TRASMA.
Ottimizzazione del tempo morto
Distributed System ( )7 TCP/IP four-layer model.
Applicazione dei protocolli TPSN ed FTSP per la sincronizzazione di smart sensor nelle reti wireless.
A NEW WINDOW ON THE V ENDING WORLD. RISERVATO – CONFIDENTIAL : Il presente documento è di proprietà della società elevend s.r.l.., che se ne riserva tutti.
CENTRAL PROCESSOR UNIT (CPU) 1/2 E’ l’unità che regola e controlla tutti I processi nel microcontroller. E’ formata da diverse sottounità tra cui: Instruction.
PINK FLOYD DOGS You gotta be crazy, you gotta have a real need. You gotta sleep on your toes. And when you're on the street. You gotta be able to pick.
INSIDE UPDATES M. Giuseppina Bisogni PISA, 16/09/2014.
Sezione di Padova Contributo alla costruzione dei layers 6,7 Possibili task per PADOVA:  precedente proposta: R&D della procedura di assemblaggio degli.
SOTTOSISTEMA DI MEMORIA
G. Martellotti Roma RRB 16 Aprile Presentazione M&O cat A (per LHCb i M&O cat B sono gestiti autonomamente e non sono scrutinati fino al 2005/2006)
Metodi Quantitativi per Economia, Finanza e Management Lezioni n° 7-8.
Taccani1 7.4 Identification ANALISI DEI PERICOLI Hazard Analysis Identificazione Valutazione Misure di Controllo Control Measures Assessment.
MyEconLab_Univerità degli studi di Milano, corso Prof.ssa Valentina Raimondi How to Access MyEconLab 1.
Internetworking livello III Prof. Alfio Lombardo.
1 Amministrazione dei processi nel sistema operativo Unix (Bach: the Design of the Unix Operating System (cap: 6, 7, 8)
Università degli studi di L’Aquila Anno Accademico 2006/2007 Corso di: Algoritmi e Dati Distribuiti Titolare: Prof. Guido Proietti Orario: Martedì:
Accoppiamento scalare
SUMMARY Time domain and frequency domain RIEPILOGO Dominio del tempo e della frequenza RIEPILOGO Dominio del tempo e della frequenza.
Viruses.
Mobilità tra i Paesi del Programma KA103 A.A. 2014/2015 (KA103) Mobility Tool+ e il Rapporto Finale Claudia Peritore Roma luglio 2015.
L A R OUTINE D EL M ATTINO Ellie B.. Io mi sono svegliata alle cinque del mattino.
27/5/2004 P. Morettini 1 Computing per il DAQ di ATLAS Workshop CCR, Castiadas (CA) 27 Maggio 2004 Nell’ambito della CCR (come pure della CSN1) si affronta.
MUG-TEST A. Baldini 29 gennaio 2002
Summary Module 1 – Unit 1 (Current, potential difference, resistance) RIEPILOGO Modulo 1 – Unità 1 (Corrente, tensione, resistenza)
SUMMARY Different classes and distortions RIEPILOGO Le diverse classi e le distorsioni RIEPILOGO Le diverse classi e le distorsioni.
Tipologie e caratteristiche degli amplificatori a retroazione
MS - NA62 TDAQ INFN – Settembre 2011 TDAQ in generale Distribuzione clock/trigger: progetto definito, moduli finali in arrivo (?), installazione prevista.
CEFRIEL Deliverable R4.2.5 Implementazione di un MAC adattativo per reti WiMax Alessandro Lapiana Roma – 24 novembre ‘05.
 SLP Tests in VME test stand: Saverio – Pierluigi (Daniel)  New test stand: Enrico (Saverio – Pierluigi)  VME Tests versus new test stand tests (Enrico-Saverio-Pierluigi)
Project Review Novembrer 17th, Project Review Agenda: Project goals User stories – use cases – scenarios Project plan summary Status as of November.
SUMMARY Interconnection of quadripoles RIEPILOGO Interconnessione di quadripoli RIEPILOGO Interconnessione di quadripoli.
Il DAQ dei pixel per il fascio di test di BTeV S. Magni, D. Menasce L. Uplegger.
Laboratorio II, modulo 2 (Fisica) Tecniche di Acquisizione Dati (Informatica) Giovanni Ambrosi Matteo Duranti
Progetti 2015/2016. Proj1: Traduzione di regole snort in regole iptables Snort: – analizza i pacchetti che transitano in rete, confrontandoli con un database.
Online U. Marconi Milano, 21/9/ ~ 9000 optical link 40 MHz PCIe based readout 30 MHz × 100 kB/evt 5 Gb/s, 300 m long fibres from the FEE directly.
MS - NA62 TDAQ INFN – Maggio 2012 NA62 TDAQ status report M. Sozzi Incontro con referee INFN CERN – 25 Maggio 2012.
Servers Outlook Server for INFN – End 2007 Outlook.
Prof. Giacomo Dalseno USARE MOODLE Differenti forme di interazione.
LSPE-SWIPE Crate for readout electronics 3 Nov 2014 Marco Incagli, Franco Spinella.
Music Television Special Features. Etere For Music TV Integrated Music scheduling Music Video Cross fade Titling control.
Obiettivi «6 mesi» - Unità di Pisa. Persone coinvolte Gianluca Lamanna Gianluca Lamanna – Coordinatore, algoritmi di pattern recognition Marco Sozzi Marco.
S VILUPPO ELETTRONICA PER EMC BELLEII INFN ROMA3 Diego Tagnani 10/06/2014 ROMA 3 : D. P.
Firmware per il Trigger del RICH Cristiano Santoni Università degli studi di Perugia INFN - Sezione di Perugia Meeting generale GAP 13/01/2014.
CROMATOGRAFIA LIQUIDA (L.C.) Gel filtrazione (Size-exclusion chromatography) Scambio ionico (Ion exchange chromatography) Affinità (Affinity chromatography)
STMan Advanced Graphics Controller. What is STMan  STMan is an advanced graphic controller for Etere automation  STMan is able to control multiple graphics.
IL problema della ricostruzione delle tracce. K - X 1 =37 Fotorivelatori Luce Energia: E Posizione: z X 2 =58X 3 =45.
Dichiarazione dei servizi di sito nel GOCDB
Analisi dei dati dell’Esperimento ALICE
Introduzione L0.
Study of Bc in CMSSW: status report Silvia Taroni Sandra Malvezzi Daniele Pedrini INFN Milano-Bicocca.
Preliminary results of DESY drift chambers efficiency test
CdS 2017: embargo fino a TAUP2017
Transcript della presentazione:

NaNet: a flexible and configurable low-latency NIC for real-time GPU-based systems alessandro.lonardo, andrea.biagioni, francesca.locicero, piero.vicini 1CONF TITLE - DATE – Alessandro Lonardo - INFN

People Involved Alessandro Lonardo Pier Stanislao Paolucci Davide Rossetti Piero Vicini Roberto Ammendola Andrea Biagioni Ottorino Frezza Francesca Lo Cicero Francesco Simula Laura Tosoratto 2CONF TITLE - DATE – Alessandro Lonardo - INFN Felice Pantaleo Marco Sozzi Gianluca Lamanna Riccardo Fantechi

■ The goal is the precision measurement of the ultra-rare decay process  Extremely sensitive to any unknown new particle. ■ Many (uninteresting) events: 10 7 decays/s. ■ Ultra-rare: 1 target event on particle decays  Expect to collect ~100 events in 2/3 years of operation  Need to select the interesting events, filtering efficiently and quickly the data stream coming from the detectors. The NA62 Experiment at CERN 3CONF TITLE - DATE – Alessandro Lonardo - INFN

Multi-level Trigger Architecture in HEP Experiments Multi-level triggering reduces the amount of data to a manageable level operating on a set of successive stages between event detection and the storage of their associated data. 4CONF TITLE - DATE – Alessandro Lonardo - INFN Lower level: custom hardware, simple patterns in reduced information from a few fast detectors; ►move full detector data from buffers to PC memories if OK Higher levels: hierarchical chain of simplified reconstruction algorithms on some detectors, specific features of interesting events, using switched computing farms; ►store detector data to permanent storage if OK Offline analysis: full reconstruction algorithms ►reduced data samples for further physics analysis

The NA62 Trigger and Data Acquisition System 5CONF TITLE - DATE – Alessandro Lonardo - INFN April 15 th, 2012 L0 trigger Trigger primitives Data CDR O(kHz) GigaEth SWITCH L1/L2 PC RICHMUVCEDARLKRSTRAWSLAV L0TP L0 1 MHz 10 MHz L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC 100 kHz L1 trigger L1/2 L0 trigger: hardware synchronous FPGA custom-design 10 MHz to 1 MHz Max latency 1 ms L1/L2 trigger: software asynchronous switched PC farm 1MHz to some kHz

Using GPUs in the NA62 L1/L2 Trigger 6CONF TITLE - DATE – Alessandro Lonardo - INFN April 15 th, 2012 L0 trigger Trigger primitives Data CDR O(kHz) GigaEth SWITCH L1/L2 PC RICHMUVCEDARLKRSTRAWSLAV L0TP L0 1 MHz 10 MHz L1/L2 PC 100 kHz L1 trigger L1/2 Exploit GPU computing power to scale down the number of nodes in the farm:  event analysis paralelizzation on many- core architectures. Not the topic of this talk.

Using GPUs in the NA62 L0 Trigger 7CONF TITLE - DATE – Alessandro Lonardo - INFN April 15 th, 2012 L0 trigger Trigger primitives Data CDR O(kHz) GigaEth SWITCH L1/L2 PC RICHMUVCEDARLKRSTRAWSLAV L0TP L0 1 MHz 10 MHz L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC 100 kHz L1 trigger L1/2 Replace custom hardware with a GPU-based system performing the same task but:  Programmable  Upgradable  Scalable  Cost effective  Increasing selection efficiency of interesting events implementing more demanding algorithms. The topic of this talk.

Ring-imaging Čerenkov detector: charged relativistic particles produces cone- shaped light flash in Neon- filled vessel.  100 ps time resolution  10 MHz event rate  20 photons detected on average per event (hits on photo-detectors )  48 byte per event (max 32 hits per event) Beam Beam pipe 2 spots with 1000x 18mm diameter photo-detectors each Using GPUs in the NA62 L0 Trigger: the RICH Detector Case Study 8CONF TITLE - DATE – Alessandro Lonardo - INFN April 15 th, 2012 L0 trigger Trigger primitives Data CDR O(kHz) GigaEth SWITCH L1/L2 PC RICHMUVCEDARLKRSTRAWSLAV L0TP L0 1 MHz 10 MHz L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC 100 kHz L1 trigger L1/2

 Matching of partial circular hit pattern on photo- detectors done on GPU Beam Beam pipe 2 spots with 1000x 18mm diameter photo-detectors each Using GPUs in the NA62 L0 Trigger: the RICH Detector Case Study 9CONF TITLE - DATE – Alessandro Lonardo - INFN April 15 th, 2012 L0 trigger Trigger primitives Data CDR O(kHz) GigaEth SWITCH L1/L2 PC RICHMUVCEDARLKRSTRAWSLAV L0TP L0 1 MHz 10 MHz L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC 100 kHz L1 trigger L1/2

NA62 RICH L0 Trigger Processor Requirements  Network Protocols/Topology: UDP over Point-to-Point (no switches) GbE.  Throughput Input event primitive data rate < 400MB/s (on 4 GbE links) Output of trigger results < 50 MB/s (on 1 GbE link) Output trigger signal on LVDS link  System response latency < 1 ms determined by the size of Readout System memory buffer storing event data candidated to be passed to higher trigger levels. 10 CONF TITLE - DATE – Alessandro Lonardo - INFN Readout System L0TP latency < 1 ms BW < 400 MB/s

GPU-Based RICH Level 0 Trigger Processor (commodity NIC) ■ lat L0TP-GPU = lat proc + lat comm  Total system latency dominated by communication latency ■ Aggregated input link bandwidth ≅ 400 MB/s  Throughput dominated by link bandwidth 11 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 1 GbE 0 GbE 3 Readout System TTC

GPU-Based RICH Level 0 Trigger Processor – Processing Latency ■ lat proc : time needed to perform rings pattern-matching on the GPU with input and output data on device memory. ■ Test setup: supermicro X8DTG-DF motherboard (Intel Tylersburg chipset), dual Intel Xeon E5620, Nvidia S2050, Intel Gigabit Network Connection, CentOS 5.9, CUDA 4.2, Nvidia driver ■ Measurement: for (int sample=0; sample < MAXSAMPLES; sample++) { starting_cycles[sample]=get_cycles(); math >>(data_gpu, r_data_gpu, utility); cudaDeviceSynchronize(); stopping_cycles[sample]=get_cycles(); } ■ Cycles are read from the x86 Time Stamp Counter 12 CONF TITLE - DATE – Alessandro Lonardo - INFN

GPU-Based RICH Level 0 TP Processing Latency Distribution 13 CONF TITLE - DATE – Alessandro Lonardo - INFN

GPU-Based RICH Level 0 TP Processing Latency vs. Input Events ■ lat proc is stable ■ max 1/10 of the time budget available ■ what about moving events data to be processed, from RICH Readout boad to the GPU memory? 14 CONF TITLE - DATE – Alessandro Lonardo - INFN

GPU-Based RICH Level 0 Trigger Processor – Communication Latency ■ lat comm : time needed to receive input event data from GbE NIC to GPU memory and to send back results from GPU memory to Host memory. 15 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE Readout board

GPU-Based RICH Level 0 Trigger Processor – Communication Latency 16 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE  20 events data (1404 byte) sent from Readout board to the GbE NIC are stored in a receiving host kernel buffer.  T=0 start of send operation on the Readout Board. μs

GPU-Based RICH Level 0 Trigger Processor – Communication Latency 17 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE μs Data are copied from kernel buffer to a user space buffer. 99

GPU-Based RICH Level 0 Trigger Processor – Communication Latency 18 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 104 μs Data are copied from system memory to GPU memory

GPU-Based RICH Level 0 Trigger Processor – Communication Latency 19 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 134 Ring pattern-matching GPU Kernel is executed, results are stored in device memory. μs

GPU-Based RICH Level 0 Trigger Processor – Communication Latency 20 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE Results are copied from GPU memory to system memory (322 bytes – 20 results) 139 μs

GPU-Based RICH Level 0 Trigger Processor – Communication Latency 21 CONF TITLE - DATE – Alessandro Lonardo - INFN 109 μs  lat comm ≅ 110 μs  4 x lat proc GPU MEM Chipset CPU SYSTEM MEM PCIe GbE

GPU-Based RICH Level 0 Trigger Processor – Communication Latency 22 CONF TITLE - DATE – Alessandro Lonardo - INFN GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 99 μs  Fluctuations on the GbE component of lat comm may hinder the real-time requisite, even at low events count. sockperf: Summary: Latency is usec sockperf: Total observations; each percentile contains observations sockperf: ---> observation = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> observation =

NaNet ■ Problem: lower communication latency and its fluctuations. ■ How? 1. Injecting directly data from the NIC into the GPU memory with no intermediate buffering. 2. Offloading the CPU from network stack protocol management, avoiding OS jitter effects. ■ NaNet solution:  Use the APEnet+ FPGA-based NIC implementing GPUDirect RDMA.  Add a network stack protocol management offloading engine to the logic (UDP Offloading Engine). 23CONF TITLE - DATE – Alessandro Lonardo - INFN

APEnet+ 3D Torus Network: ■ Scalable (today up to 32K nodes) ■ Direct Network: no external switches. APEnet+ Card: ■ FPGA based (ALTERA EP4SGX290) ■ PCI Express x16 slot, signaling capabilities for up to dual x8 Gen2 (peak 4+4 GB/s) ■ Single I/O slot width, 4 torus links, 2-d (secondary piggy-back double slot width, 6 links, 3-d torus topology) ■ Fully bidirectional torus links, 34 Gbps ■ Industry standard QSFP ■ A DDR3 SODIMM bank APEnet+ Logic: ■ Torus Link ■ Router ■ Network Interface  NIOS II 32 bit microcontroller  RDMA engine  GPU I/O accelerator. ■ PCI x8 Gen2 Core 24CONF TITLE - DATE – Alessandro Lonardo - INFN

APEnet+ GPUDirect P2P Support APEnet+ Flow ■ PCIe P2P protocol between Nvidia Fermi/Kepler and APEnet+  First non-Nvidia device supporting it in  Joint development with Nvidia.  APEnet+ board acts as a peer. ■ No bounce buffers on host. APEnet+ can target GPU memory with no CPU involvement. ■ GPUDirect allows direct data exchange on the PCIe bus. ■ Real zero copy, inter-node GPU-to- host, host-to-GPU and GPU-to-GPU. ■ Latency reduction for small messages. 25CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet Architecture ■ NaNet is based on APEnet+ ■ Multiple link tech support  apelink bidir 34 Gbps  1 GbE  10 GbE SFP+ ■ New features:  UDP offload: extract the payload from UDP packets  NaNet Controller: encapsulate the UDP payload in a newly forged APEnet+ packet and send it to the RX NI logic 26CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet UDP Offloader NiosII UDP Offload : ■ Open IP: P_Offload_Example ■ It implements a method for offloading the UDP packet traffic from the Nios II. ■ It collects data coming from the Avalon Streaming Interface of the 1/10 GbE MAC. ■ It redirects UDP packets into an hardware processing data path. ■ Current implementation provides a single 32-bit width channel. ■ 6.4 gbps (working on a 64 bit/12.8 gbps for 10 GbE) ■ The output of the UDP Offload is the PRBS packet (Size + Payload) ■ Ported to Altera Stratix IV EP4SGX230 (the project was based on Stratix II 2SGX90), (instead of 35 MHz). 27CONF TITLE - DATE – Alessandro Lonardo - INFN PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory 1/10 GbE port 1/10 GbE port

NaNet Controller NaNet Controller: ■ It manages the 1/10 GbE flow by encapsulating packets in the APEnet+ packet protocol (Header, Payload, Footer) ■ It implements an Avalon Streaming Interface ■ It generates the Header for the incoming data, analyzing the PRBS packet and several configuration registers. ■ It parallelizes 32-bit data words coming from the Nios II subsystem into 128-bit APEnet+ data words. ■ It redirects data-packets towards the corresponding FIFO (one for the Header/Footer and another for the Payload) 28CONF TITLE - DATE – Alessandro Lonardo - INFN PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory 1/10 GbE port 1/10 GbE port

NaNet–1 (1 GbE) 29CONF TITLE - DATE – Alessandro Lonardo - INFN ■ Implemented on the Altera Stratix IV dev board ■ PHY Marvell 88E1111 ■ HAL based Nios II microcontroller firmware (no OS)  Configuration  GbE driving  GPU destination memory buffers management PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory 1GbE port 1GbE port

NaNet-10 (dual 10 GbE) work in progress 30CONF TITLE - DATE – Alessandro Lonardo - INFN ■ Implemented on the Altera Stratix IV dev board + Terasic HSMC Dual XAUI to SFP+ daughtercard ■ BROADCOM BCM8727 a dual- channel 10-GbE SFI-to-XAUI transceiver PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory 2x 10 GbE ports 2x 10 GbE ports

NaNet-10 for NA62 RICH L0 GPU Trigger Processor 31CONF TITLE - DATE – Alessandro Lonardo - INFN TTC Card PCIe 2.0 RICH Readout System RICH Readout System 1/10 GbE Switch 1 GbE SFP+ 10 GbE FPGA based L0 TP GPU based L0 TP

NaNet–Apelink (up to 6 34-Gbps channels) 32CONF TITLE - DATE – Alessandro Lonardo - INFN ■ Altera Stratix IV dev board +3 Link Daughter card ■ APEnet+ 4 Link board (+ 2 Link Daughter card)

NaNet-1 Data Flow ■ TSE MAC, UDP OFFLOAD and NaNet CTRL manage the GbE data flow and encapsulate data in the APEnet+ protocol. ■ RX DMA CTRL:  It manages CPU/GPU memory write process, providing hardware support for the Remote Direct Memory Access (RDMA) protocol  It communicates with the  C : It generates the request for the Nios necessary to perform the destination buffer address generation (FIFO REQ) It exploits the instruction collected in the FIFO RESP to instantiate the memory write process. ■ Nios II handles all the details pertaining to buffers registered by the application to implement a zero-copy approach of the RDMA protocol (OUT of the data stream). ■ EQ DMA CTRL generates a DMA write transfer to communicate the completion of the CPU/GPU memory write process. ■ A Performance Counter is used to analyze the latency of the GbE data flow 33CONF TITLE - DATE – Alessandro Lonardo - INFN MAC UDP OFFLOAD NaNet CTRL FIFO HEADER FIFO DATA FIFO REQ FIFO RESP RX DMA CTRL EQ DMA CTRL Nios II Subsystem PCI

UDP Packet Latency inside NaNet- 1 34CONF TITLE - DATE – Alessandro Lonardo - INFN MAC UDP OFFLOAD NaNet CTRL FIFO HD FIFO DT FIFO REQ FIFO RESP RX DMA CTRL EQ DMA CTRL Nios II Subsystem PCI ■ Total Latency of a 1472 bytes UDP Packet through the hardware path:  From the UDP OFFLOAD Start of Packet signal  To the EQ DMA CTRL Completion End signal ■ Appreciable latency variability (7.3  s ÷ 8.6  s) ■ Sustained Bandwidth ~119.7 MB/s

UDP Packet Latency Inside NaNet–1 Detailed Analysis 35CONF TITLE - DATE – Alessandro Lonardo - INFN MAC UDP OFFLOAD NaNet CTRL FIFO HD FIFO DT FIFO REQ FIFO RESP RX DMA CTRL EQ DMA CTRL Nios II Subsystem PCI ■ Fine grained analysis of the UDP packet hardware path to understand the latency variability ■ UDP TX (3.9  s, stable)  From the UDP offload Start of Packet  To the NaNet CTRL End Of Packet ■ Nios Wake Up (0.6  s ÷ 2.0  s, culprit)  From the FIFO REQ Write Enable  From the FIFO REQ Read Enable ■ Nios Processing (4.0  s, stable)  From the FIFO REQ Read Enable  From the FIFO RESP Write Enable ■ GPU Memory Write (0.5  s, stable)  From the FIFO REQ Read Enable  From the FIFO RESP Write Enable

NaNet-1 Test & Benchmark Setup ■ Supermicro SuperServer 6016GT-TF  X8DTG-DF motherboard (Intel Tylersburg chipset)  dual Intel Xeon E5620  Intel Gigabit Network Connection  Nvidia Fermi S2050  CentOS 5.9, CUDA 4.2, Nvidia driver ■ NaNet board in x16 PCIe 2.0 slot ■ NaNet GbE interface directly connected to one host GbE interface ■ Common time reference between sender and receiver (they are the same host!). ■ Ease data integrity tests. 36CONF TITLE - DATE – Alessandro Lonardo - INFN GbE x16 PCIe 2.0 slot

NaNet-1 Latency Benchmark A single host program: ■ Allocates and registers (pin) N GPU receive buffers of size P x1472 byte (the max UDP payload size) ■ In the main loop:  Reads TSC cycles (cycles_bs)  Sends P UDP packet with 1472 byte payload size over the host GbE intf  Waits (no timeout) for a received buffer event  Reads TSC cycles (cycles_ar)  Records latency_cycles = cycles_ar - cycles_bs ■ Dumps recorded latencies 37CONF TITLE - DATE – Alessandro Lonardo - INFN GbE x16 PCIe 2.0 slot

NaNet-1 Latency Benchmark 38CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet-1 Latency Benchmark 39CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet-1 Bandwidth Benchmark 40CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet-Apelink Latency Benchmark ■ The latency is estimated as half the round trip time in a ping- pong test ■ ~ 8-10 μs on GPU-GPU test ■ World record for GPU-GPU ■ NaNet case represented by H-G 41CONF TITLE - DATE – Alessandro Lonardo - INFN ■ APEnet+ G-G latency is lower than IB up to 128KB ■ APEnet+ P2P latency ~8.2 μs ■ APEnet+ staging latency ~16.8 μs ■ MVAPICH/IB latency ~17.4 μs

NaNet-Apelink Bandwidth Benchmark 42CONF TITLE - DATE – Alessandro Lonardo - INFN ■ Virtual to Physical Address Translation Implemented in the  C  Host RX ~ 1.6 GB/s  GPU RX ~ 1.4 GB/s (switching GPU P2P window before writing)  Limited by the RX processing ■ GPU TX curves:  P2P read protocol overhead ■ Virtual to Physical Address Translation Implemented in HW  Max Bandwidth ~2.2GB/s (physical link limit, loopback test ~2.4GB/s)  Strong impact on FPGA memory resources! For now limited up to 128KB  First implementation. Host RX only!

UNICA (Unified Network Interface Card Adaptor) ■ FPGA  Implementazione del controllo dello stack di trasmissione del protocollo  Pre-elaborazione dei dati consistente nella riorganizzazione degli stessi in stream idonei all’elaborazione tramite GPU ■ Gestione della sincronizzazione della risposta di trigger con il clock dell’esperimento NA62 (interfaccia TTC)  Daughter card (connessa tramite HSMC) con chip TTCrq  realizzabile (ma occupa molti pin della FPGA se si usano tutti i segnali di controllo)  Alternativa (preferita ma da verificare): usare 4 connettori SMA per trasmettere in LVDS lo stream TTC da decodificare poi nella FPGA (protocollo TTC + segnali di synch) ■ Ricezione time-stamp dell’evento ■ La risposta della scheda UNICA può consistere nella produzione di primitive complesse passate ad un successivo livello di elaborazione oppure nella risposta di trigger finale da inviare, dopo un'opportuna compensazione della latenza di risposta, ai moduli del sistema di read-out dell'esperimento 43GAP MEETING - June 19, APE - INFN

UNICA (Unified Network Interface Card Adaptor) ■ FPGA  Implementazione del controllo dello stack di trasmissione del protocollo  Pre-elaborazione dei dati consistente nella riorganizzazione degli stessi in stream idonei all’elaborazione tramite GPU ■ Gestione della sincronizzazione della risposta di trigger con il clock dell’esperimento NA62 (interfaccia TTC)  Daughter card (connessa tramite HSMC) con chip TTCrq  realizzabile (ma occupa molti pin della FPGA se si usano tutti i segnali di controllo)  Alternativa (preferita ma da verificare): usare 4 connettori SMA per trasmettere in LVDS lo stream TTC da decodificare poi nella FPGA (protocollo TTC + segnali di synch) ■ Ricezione time-stamp dell’evento ■ La risposta della scheda UNICA può consistere nella produzione di primitive complesse passate ad un successivo livello di elaborazione oppure nella risposta di trigger finale da inviare, dopo un'opportuna compensazione della latenza di risposta, ai moduli del sistema di read-out dell'esperimento 44GAP MEETING - June 19, 2013 – APE - INFN

UNICA (Unified Network Interface Card Adaptor) ■ FPGA  Implementazione del controllo dello stack di trasmissione del protocollo  Pre-elaborazione dei dati consistente nella riorganizzazione degli stessi in stream idonei all’elaborazione tramite GPU ■ Gestione della sincronizzazione della risposta di trigger con il clock dell’esperimento NA62 (interfaccia TTC)  Daughter card (connessa tramite HSMC) con chip TTCrq  realizzabile (ma occupa molti pin della FPGA se si usano tutti i segnali di controllo)  Alternativa (preferita ma da verificare): usare 4 connettori SMA per trasmettere in LVDS lo stream TTC da decodificare poi nella FPGA (protocollo TTC + segnali di synch) ■ Ricezione time-stamp dell’evento ■ La risposta della scheda UNICA può consistere nella produzione di primitive complesse passate ad un successivo livello di elaborazione oppure nella risposta di trigger finale da inviare, dopo un'opportuna compensazione della latenza di risposta, ai moduli del sistema di read-out dell'esperimento 45GAP MEETING - June 19, APE - INFN

NaNet Data Flow Phys Layer Protocol stack NaNet CTRL RX CTRL PCI CUSTOM LOGIC FLOW CTRL TX CTRL 128bit APEnet Protocol Encoder/de coder 128bit ■ NaNet CTRL manages the GbE flow by encapsulating packets in the APEnet+ packet protocol (Header, Payload, Footer) ■ 128bit word ■ Max payload size 4KB ■ FLOW CTRL is a simple switch able to redirect the GbE flow towards the pre- elaborating data custom logic:  Re-use of VHDL code developed in the framework of NaNet project  Immediate analisys of the benefits obtained with the pre-elaboration of GPU data (Custom Logic)  NO additional Avalon Streaming Interface

APEnet+ HEADER/FOOTER 47GAP MEETING - June 19, APE - INFN HEADERFOOTER

UNICA (Unified Network Interface Card Adaptor) ■ FPGA  Implementazione del controllo dello stack di trasmissione del protocollo  Pre-elaborazione dei dati consistente nella riorganizzazione degli stessi in stream idonei all’elaborazione tramite GPU ■ Gestione della sincronizzazione della risposta di trigger con il clock dell’esperimento NA62 (interfaccia TTC)  Daughter card (connessa tramite HSMC) con chip TTCrq  realizzabile (ma occupa molti pin della FPGA se si usano tutti i segnali di controllo)  Alternativa (preferita ma da verificare): usare 4 connettori SMA per trasmettere in LVDS lo stream TTC da decodificare poi nella FPGA (protocollo TTC + segnali di synch) ■ Ricezione time-stamp dell’evento ■ La risposta della scheda UNICA può consistere nella produzione di primitive complesse passate ad un successivo livello di elaborazione oppure nella risposta di trigger finale da inviare, dopo un'opportuna compensazione della latenza di risposta, ai moduli del sistema di read-out dell'esperimento 48GAP MEETING - June 19, APE - INFN

UNICA (Unified Network Interface Card Adaptor) ■ TTC  FPGA  TTC Cosa deve comunicare la FPGA al TTC?  TTC  FPGA Cose riceve la FPGA dal TTC?  Come si interfaccia il TTC con la custom logic di pre-elaborazione? ■ TRIGGER ■ TRIGGER GPU  cosa viene trasferito dalla GPU alla FPGA 49GAP MEETING - June 19, APE - INFN

50CONF TITLE - DATE – Alessandro Lonardo - INFN THANK YOU