Scaricare la presentazione
La presentazione è in caricamento. Aspetta per favore
PubblicatoVitale Fois Modificato 9 anni fa
1
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 1 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Gestione del Resource Broker Giuseppe Platania, INFN Catania Tutorial per Site Administrator Progetto PI2S2 Messina, 9-11.07.2007
2
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 2 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Outline – Components ● Network Server ● Workload Manager ● Job Controller ● Logging & Bookkeeping ● Log Monitor – Management of RB – Troubleshooting
3
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 3 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 RB components: NS ● Authentication – user CA is authorized – user DN is in the grid-mapfile – user certificate was revoked by his CA ● Authorization – Pool accounts mapping (LCMAPS) – Sandbox disk space – Size of input sandbox (< MAX_INPUT_SB_SIZE) – Creation of sandbox dir with input files and user proxy
4
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 4 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 RB components: WM and JC ● Workload Manager – Receives job submission command and put the user request in the WM queue – Match making: CE choise – Job file creation (job wrapper) being sent to JC ● Job Controller – Job submission to the choosen CE via Condorg – Input sandbox copy to the choosen CE
5
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 5 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 RB components: LB, Locallogger and LM ● Logging&Bookkeeping – Logging of all job events in its database ● Locallogger – It’s the LB proxy because it stores all job events even if LB doesn’t work ● Log Monitor – Condor log parsing and writing on LB database – If needed, job being resubmitted to WM queue
6
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 6 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Logs ● Log files can be found in /var/edgwl/ logging proxyrenewal logmonitor SandboxDir jobcontrol networkserver workload_manager ● Init scripts can be found in /etc/init.d/edg-wl-*
7
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 7 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 EGEE ResourceBroker 7 Possible job flags Flag Meaning SUBMITTEDsubmission logged in the LB WAITjob match making for resources READYjob being sent to executing CE SCHEDULEDjob scheduled in the CE queue manager RUNNING job executing on a WN of the selected CE queue DONEjob terminated without grid errors CLEAREDjob output retrieved ABORTjob aborted by middleware, check reason
8
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 8 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Management of RB
9
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 9 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 ● Checks to do – CA updates and CRL fetching (fetch-crl cron job) – VOMS servers’ certificate – Date (NTP synchronization) – Size of Sandboxdir – Mysql status – Daemons status ( for daemon in `ls /etc/init.d | grep edg-wl-` ; do /etc/init.d/$daemon status ; done ) – Configuration file /opt/edg/etc/edg_wl.conf ● Check if II_Contact string is pointing to the right top BDII – All reasons of the jobs are logged in /var/edgwl/logmonitor/log/events.log – GRIS: globus-mds status (test it running ldapsearch)
10
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 10 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 ● Suggestions – Create a separated partition for /var/edgwl dir – Backup every day lbserver20 database and store the file in a log server (es. mysqldump --databases lbserver20 --password=(your password) > `hostname - s`_databases_`date +%y-%m-%d`.sql ) – If needed, remove by hand old jobs directories stored under /var/edgwl/SandboxDir (often purge cron job doesn’t work well) – Who wishes to view all contents of lbserver20 database, can install phpmyadmin tool and restrict the access to the trusted machines (and users ……)
11
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 11 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Troubleshooting
12
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 12 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 If the edg-job-submit/edg-job-list-match commands returns the following error message: **** Error: API_NATIVE_ERROR **** Error while calling the "NSClient::multi" native api AuthenticationException: Failed to establish security context... **** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server it means that there are authentication problems between the UI and the Network Server Troubleshooting /1
13
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 13 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Solution (I) Check your Proxy. Maybe you have not a valid proxy. Remember to initialized the proxy with the VOMS extensions. $ voms-proxy-info --all subject : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/Email=giuseppe.platania@ct.infn.it/CN=proxy issuer : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/Email=giuseppe.platania@ct.infn.it identity : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/Email=giuseppe.platania@ct.infn.it type : proxy strength : 512 bits path : /tmp/x509up_u512 timeleft : 11:59:55 No VOMS extensions! !
14
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 14 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Solution (II) Verify the synchronization between the UI and the WMS. Check if nptd is running /etc/init.d/ntpd status ntpd (pid 1742) is running... and if the date is correctly !
15
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 15 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Inspect the log file /var/edgwl/networkserver/log/events.log 05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [ arguments = [ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && ( other.GlueCEStateStatus == "Production" ); RetryCount = 3; Arguments = "-f"; JobType = "normal "; Executable = "/bin/hostname"; CertificateSubject ="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/Email=giuseppe.plataniact.infn.it"; StdOutput = "hostname.out"; X509UserProxy = "/tmp/user.proxy.0xb74f6768.20060905170043677437"; OutputSandbox = { "hostname.err","hostname.out" }; VirtualOrganisation = "gilda"; rank = - other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError = "hostname.err"; 05 Sep, 17:01:49 -F- "Manager::run": Exception Caught during Client Authentication. No CRL installed or no CA supported
16
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 16 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Inspect the log file /var/edgwl/networkserver/log/events.log 05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [ arguments = [ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && ( other.GlueCEStateStatus == "Production" ); RetryCount = 3; Arguments = "-f"; JobType = "normal "; Executable = "/bin/hostname"; CertificateSubject ="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/Email=giuseppe.platania@ct.infn.it"; StdOutput = "hostname.out"; X509UserProxy = "/tmp/user.proxy.0xb74f6768.20060905170043677437"; OutputSandbox = { "hostname.err","hostname.out" }; VirtualOrganisation = "gilda"; rank = - other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError = "hostname.err"; 05 Sep, 17:01:49 -F- "Manager::run": Can’t authorize /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/Email=giuseppe.platania@ct.infn.it. No user DN in the grid-mapfile
17
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 17 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 If the edg-job-status commands returns Aborted reason: Cannot read JobWrapper output, both from Condor and from Maradona job did not start : batch system submission problem (e.g. batch system in crazy state) WN disk full - home directory absent or unwritable time not synchronized between CE and WN mismatch between forward and reverse DNS for CE name/IP- address WN cannot globus-url-copy from/to CE - WN cannot scp to/from CE Troubleshooting /2
18
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 18 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 job did finish : the WN could not do a globus-url-copy to the RB Globus could not send back the job wrapper stdout, e.g. because it was not copied back from the WN to the CE, or because globus-url-copy does not work from the CE to the RB. This combined set of problems still can have a single cause: - a firewall limiting outgoing connections (to ports 20000- 25000) - some CRLs out of date both on CE and WN - some CA files absent - wrong time (zone) on CE and WN Troubleshooting /2
19
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 19 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 If the edg-job-status commands returns Aborted reason: ************************************************************* BOOKKEEPING INFORMATION: Printing status info for the Job : https://wn-02-32- a.cr.cnaf.infn.it:9000/LHrUgJsLYN4q0VHnJNuz0Q Current Status: Aborted Status Reason: Cannot plan (a helper failed) reached on: Fri Sep 19 10:51:48 2003 ************************************************************* It means that a matchmaker failed because no suitable resources for a given job are found. Troubleshooting /2
20
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 20 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 middleware failure is due to Information Service problems: the service is down the information database is not updated and does not contain all the required information application software unavailable: the JDL requires a wrong/unsupported software version the site does not support the requested software the version required is new and the site has not yet updated the application software area wrong user request takes place when the user asks for: an unsopported CPU type an unsupported operating system unavailable memory an unsupported VO Troubleshooting /2
21
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 21 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 LINK of RB troubleshooting http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUp Faq#head-2c6e726a9368ae7ac0e052ce15fd52c7d3f600ef
22
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 22 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 Questions…
Presentazioni simili
© 2024 SlidePlayer.it Inc.
All rights reserved.