martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 1 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Gestione del Resource Broker Giuseppe Platania, INFN Catania Tutorial per Site Administrator Progetto PI2S2 Messina,
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 2 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Outline – Components ● Network Server ● Workload Manager ● Job Controller ● Logging & Bookkeeping ● Log Monitor – Management of RB – Troubleshooting
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 3 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina RB components: NS ● Authentication – user CA is authorized – user DN is in the grid-mapfile – user certificate was revoked by his CA ● Authorization – Pool accounts mapping (LCMAPS) – Sandbox disk space – Size of input sandbox (< MAX_INPUT_SB_SIZE) – Creation of sandbox dir with input files and user proxy
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 4 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina RB components: WM and JC ● Workload Manager – Receives job submission command and put the user request in the WM queue – Match making: CE choise – Job file creation (job wrapper) being sent to JC ● Job Controller – Job submission to the choosen CE via Condorg – Input sandbox copy to the choosen CE
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 5 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina RB components: LB, Locallogger and LM ● Logging&Bookkeeping – Logging of all job events in its database ● Locallogger – It’s the LB proxy because it stores all job events even if LB doesn’t work ● Log Monitor – Condor log parsing and writing on LB database – If needed, job being resubmitted to WM queue
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 6 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Logs ● Log files can be found in /var/edgwl/ logging proxyrenewal logmonitor SandboxDir jobcontrol networkserver workload_manager ● Init scripts can be found in /etc/init.d/edg-wl-*
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 7 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina EGEE ResourceBroker 7 Possible job flags Flag Meaning SUBMITTEDsubmission logged in the LB WAITjob match making for resources READYjob being sent to executing CE SCHEDULEDjob scheduled in the CE queue manager RUNNING job executing on a WN of the selected CE queue DONEjob terminated without grid errors CLEAREDjob output retrieved ABORTjob aborted by middleware, check reason
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 8 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Management of RB
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 9 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina ● Checks to do – CA updates and CRL fetching (fetch-crl cron job) – VOMS servers’ certificate – Date (NTP synchronization) – Size of Sandboxdir – Mysql status – Daemons status ( for daemon in `ls /etc/init.d | grep edg-wl-` ; do /etc/init.d/$daemon status ; done ) – Configuration file /opt/edg/etc/edg_wl.conf ● Check if II_Contact string is pointing to the right top BDII – All reasons of the jobs are logged in /var/edgwl/logmonitor/log/events.log – GRIS: globus-mds status (test it running ldapsearch)
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 10 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina ● Suggestions – Create a separated partition for /var/edgwl dir – Backup every day lbserver20 database and store the file in a log server (es. mysqldump --databases lbserver20 --password=(your password) > `hostname - s`_databases_`date +%y-%m-%d`.sql ) – If needed, remove by hand old jobs directories stored under /var/edgwl/SandboxDir (often purge cron job doesn’t work well) – Who wishes to view all contents of lbserver20 database, can install phpmyadmin tool and restrict the access to the trusted machines (and users ……)
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 11 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Troubleshooting
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 12 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina If the edg-job-submit/edg-job-list-match commands returns the following error message: **** Error: API_NATIVE_ERROR **** Error while calling the "NSClient::multi" native api AuthenticationException: Failed to establish security context... **** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server it means that there are authentication problems between the UI and the Network Server Troubleshooting /1
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 13 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Solution (I) Check your Proxy. Maybe you have not a valid proxy. Remember to initialized the proxy with the VOMS extensions. $ voms-proxy-info --all subject : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe issuer : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe identity : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe type : proxy strength : 512 bits path : /tmp/x509up_u512 timeleft : 11:59:55 No VOMS extensions! !
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 14 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Solution (II) Verify the synchronization between the UI and the WMS. Check if nptd is running /etc/init.d/ntpd status ntpd (pid 1742) is running... and if the date is correctly !
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 15 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Inspect the log file /var/edgwl/networkserver/log/events.log 05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [ arguments = [ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && ( other.GlueCEStateStatus == "Production" ); RetryCount = 3; Arguments = "-f"; JobType = "normal "; Executable = "/bin/hostname"; CertificateSubject ="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/ =giuseppe.plataniact.infn.it"; StdOutput = "hostname.out"; X509UserProxy = "/tmp/user.proxy.0xb74f "; OutputSandbox = { "hostname.err","hostname.out" }; VirtualOrganisation = "gilda"; rank = - other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError = "hostname.err"; 05 Sep, 17:01:49 -F- "Manager::run": Exception Caught during Client Authentication. No CRL installed or no CA supported
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 16 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Inspect the log file /var/edgwl/networkserver/log/events.log 05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [ arguments = [ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && ( other.GlueCEStateStatus == "Production" ); RetryCount = 3; Arguments = "-f"; JobType = "normal "; Executable = "/bin/hostname"; CertificateSubject ="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe StdOutput = "hostname.out"; X509UserProxy = "/tmp/user.proxy.0xb74f "; OutputSandbox = { "hostname.err","hostname.out" }; VirtualOrganisation = "gilda"; rank = - other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError = "hostname.err"; 05 Sep, 17:01:49 -F- "Manager::run": Can’t authorize /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe No user DN in the grid-mapfile
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 17 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina If the edg-job-status commands returns Aborted reason: Cannot read JobWrapper output, both from Condor and from Maradona job did not start : batch system submission problem (e.g. batch system in crazy state) WN disk full - home directory absent or unwritable time not synchronized between CE and WN mismatch between forward and reverse DNS for CE name/IP- address WN cannot globus-url-copy from/to CE - WN cannot scp to/from CE Troubleshooting /2
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 18 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina job did finish : the WN could not do a globus-url-copy to the RB Globus could not send back the job wrapper stdout, e.g. because it was not copied back from the WN to the CE, or because globus-url-copy does not work from the CE to the RB. This combined set of problems still can have a single cause: - a firewall limiting outgoing connections (to ports ) - some CRLs out of date both on CE and WN - some CA files absent - wrong time (zone) on CE and WN Troubleshooting /2
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 19 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina If the edg-job-status commands returns Aborted reason: ************************************************************* BOOKKEEPING INFORMATION: Printing status info for the Job : a.cr.cnaf.infn.it:9000/LHrUgJsLYN4q0VHnJNuz0Q Current Status: Aborted Status Reason: Cannot plan (a helper failed) reached on: Fri Sep 19 10:51: ************************************************************* It means that a matchmaker failed because no suitable resources for a given job are found. Troubleshooting /2
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 20 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina middleware failure is due to Information Service problems: the service is down the information database is not updated and does not contain all the required information application software unavailable: the JDL requires a wrong/unsupported software version the site does not support the requested software the version required is new and the site has not yet updated the application software area wrong user request takes place when the user asks for: an unsopported CPU type an unsupported operating system unavailable memory an unsupported VO Troubleshooting /2
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 21 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina LINK of RB troubleshooting Faq#head-2c6e726a9368ae7ac0e052ce15fd52c7d3f600ef
martedi 8 novembre 2005 Consorzio COMETA “Progetto PI2S2” FESR 22 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina Questions…