HOWTO get your jobs to run on the Grid

From GlueXWiki
Jump to: navigation, search

Note on this page

The contents of this page were last updated in 2010. As of May 2014, the contents of Using the Grid supersedes this page.

Introduction

What I'm going to outline below is how I got my jobs to run on the Grid and what I need to do it. I'll try to include documentation where I can and fixes for other OS's. My OSG client machine is a Debian Lenny distro here at the University of Regina. It was a random machine I had available but the OSG software works fine on it. (It didn't on my Mandriva 2010 distro on my desktop.) This was a month's worth of trial and error fixing bugs and firewall issues on the Grid with Richard Jones, but we have most of them worked out and everything has been running fine for the last week or so.

Updates to Blake's HOWTO are being made by Jake.

Step 1: Getting your Grid Certificate

The security for the grid is quite robust and as such, a signed certificate from a known signing authority is needed. As I am working in Canada, I used Westgrid to get my certificate from Grid Canada under a project already registered at the UofR. You will require a sponsor who will verify that you are part of their project if you are not the project leader. It took 2-3 weeks to get my certificate since people at Westgrid were on holidays at the time. Normally is should take a few days. REMEMBER THE PASSWORD YOU SUBMITTTED!

OSG user certificates are obtained through the CIlogon CA, operated by NCSA on behalf of the InCommon Federation. You can scan the CIlogon FAQ for instructions on how to request a certificate for use on the OSG.

The instructions on the page noted above involved completing the steps in your browser. It is possible to complete everything from a unix shell and command line. The only difference will be in the order of the steps, but the final result will be the same.

As noted in the instructions on the CIlogon FAQ page, you must download your certificate with the same browser on the same user on the same machine from which you requested the certificate. You will also need to download the CA certificate before you will be able to download your personal certificate. Following the steps on this page, you will end up with a .p12 file from which you will extract the cert.pem and key.pem files that are mentioned below. Once these files are created, you will no longer need the .p12 file and should remove it. When you import your certificate, you should take note of the dates of validity. It is useful to include this information in your final .p12 file (not the one I suggested removing above!). For example, uname-osg-7-2017.p12.

When my Westgrid account was finally setup, I was given a key pair (two files, the certificate and the private key for that certificate, cert.pem and key.pem). Keep these safe in a place no one can access them as they are not encrypted. For encryption security and use on the Grid I converted these into a PKCS12 file (usercred.p12) on my client machine using OpenSSL (OpenSSL must be installed on the client. This is generally a distro package i.e. "apt-get install openssl"). "bash$" indicates the shell prompt. The following command will convert your certificate and key to a PKCS12 file:

 bash$ openssl pkcs12 -export -in cert.pem -inkey key.pem -out usercred.p12

You will be prompted for an export password. This is the password you provided when you applied for or created your certificate.

In your home directory on your client create a directory called ".globus" and move your usercred.p12 file there.

bash$ mkdir -p ~/.globus
bash$ mv usercred.p12 ~/.globus/.

You will need to change the permissions on the files in your .globus directory to user only in order to generate a proxy.

bash$ chmod u=rw,go= ~/.globus/*

Step 2: Registering for the Gluex VO

To access the GlueX VO page, you must install your security certificate in your browser. It'll reject you otherwise.

For Firefox: Under Preferences > Advanced > Encryption > View Certificates > Import you can add your newly created personal certificate.

Now go to the Gluex VO Registration page and register as a user. I selected the simulation, production and software roles. There are a couple phases with approval processes in between. This will take a day or so to complete. Further information on the registration process can be found in the VOM Registration Service User and Admin Guide.

Step 3: Installing OSG Client software

I installed the software as root, though the OSG instructions imply that this can be done for a single user. I am also using the bash shell. Richard Jones installed it under his cue account at Jefferson Lab, under the directory ~jonesrt/osg-client. Total installed size was 1.0GB, probably a substantial part of your user quota, so installing as root is recommended whenever possible.

To install as root, change to super user. "bash$" indicates the shell prompt. Create a directory where the software will be installed:

bash$ su
bash# mkdir -p /usr/local/osg
bash# cd /usr/local/osg

Pacman Install

To install the OSG Client software, you will require the installer Pacman. I followed the instruction on the Open Science Grid Pacman Install site. Be sure to follow the instruction for OSG 1.2 only. (The Pacman install did not work with my Mandriva distro which is why I switched to the Debian machine.)

bash# wget http://atlas.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-3.28.tar.gz
bash# tar --no-same-owner -xzvf pacman-3.28.tar.gz
bash# cd pacman-3.28

For sh and bash shells:

bash# source setup.sh

For csh and tcsh shells:

tcsh# source setup.csh

Installing the OSG client

I am following the OSG 1.2 instructions from here.

bash# cd /usr/local/osg
bash# pacman -get http://software.grid.iu.edu/osg-1.2:client
 Do you want to add http://software.grid.iu.edu/osg-1.2 to trusted.caches? (y/n/yall):  yall

You maybe be prompted with other questions. Follow the example here.

Now you have to install the CA certificates.

bash# . ./setup.sh
bash# vdt-ca-manage setupca --location local --url osg

Finally, you must turn on the services that you want to run on the client. Normally the cron job that updates the CA certificates from the OSG repository (vdt-update-certs), the cron job that updates the crl's (fetch-crl), and the condor client (condor) to be enabled.

bash# vdt-control --list
bash# vdt-control --enable vdt-update-certs
bash# vdt-control --enable fetch-crl
bash# vdt-control --enable condor
bash# vdt-control --on condor

Now, everything should have setup fine. Let's check:

bash# source /usr/local/osg/setup.sh
bash# vdt-version

Everything but "CA Certificates" should have an "OK" beside it. I added the "source setup.sh" line to the bash profile so I would not have to manually source it every time. There are instructions for testing the OSG client software here: Validate Clients.

Setting up the OSG client

I've edited the system wide bash profile ("/etc/profile" in Debian) to source the OSG setup.sh:

VDT_LOCATION=/usr/local/osg
export VDT_LOCATION
if [ -r ${VDT_LOCATION}/setup.sh ]; then
. ${VDT_LOCATION}/setup.sh
fi

You should replace /usr/local/osg with the directory in which you performed the original pacman -get command above.

You must define your client's IP address or the grid will report an error that it doesn't know where the client is. Because there are firewall's involved we must also define some port ranges.These are the current ports set for the Gluex voms. I have edited /usr/local/osg/vdt/etc/vdt-local-setup.sh and put these there.

bash# emacs  $VDT_LOCATION/vdt/etc/vdt-local-setup.sh

Your vdt-local-setup.sh will look like:

   # This file is sourced by setup.sh.  Use it for any custom setup for this site.
   # This file will be preserved across VDT installations if OLD_VDT_LOCATION is set
   .
   export GLOBUS_HOSTNAME=your.ip.add.ress
   export GLOBUS_TCP_PORT_RANGE=45000,49999
   export GLOBUS_TCP_SOURCE_RANGE=45000,49999


Also you must specify the port range for use by Condor-g in the $CONDOR_CONFIG file. I put this at the end of Part 1 in $CONDOR_CONFIG. (/usr/local/osg/condor/etc/condor_config ) ( Note: condor must be restarted):

HIGHPORT = 49999
LOWPORT = 45000 


Then restart condor

bash# /etc/init.d/condor stop
bash# /etc/init.d/condor start

Note: if your client machine is also behind a firewall, you must open those ports and a few others. See the OSG Firewall Documentation for help.

Step 4: Running Test Jobs and Practice

Now, to see if it all actually worked!! ..or what didn't. You can exit super-user and perform tasks as a normal Linux user now.

Now that the OSG Client has been installed, you will have to configure it with your grid user certificate in order to start submitting jobs. Make sure you are logged in as a normal user and we will setup a proxy certificate:

bash$ voms-proxy-init -hours 24 -cert ~/.globus/usercred.p12 -voms Gluex

If you have installed your certificate in ~/.globus/usercred.p12 then the -cert option is not required. This command will generate a proxy certificate that is valid for 24 hours. Problems arise if the proxy expires before jobs complete so be sure to make this long enough. Use -help in the proxy-init commands to find out other options. You will be prompted to enter your certificate password. The proxy can be renewed at any time ( a job submitted to condor will retain the current proxy and will not change once submitted ).

To see the proxy information:

bash$ voms-proxy-info -all


Now, to try to run something on the grid:

bash$ globus-job-run grendl.phys.uconn.edu /bin/hostname -f

This should return "grendl.phys.uconn.edu". Full path names to the executable must be used as no environment variable or PATH is defined on the grid this way. The following can be used to discover the full path to the executable if it is unknown:

bash$ globus-job-run grendl.phys.uconn.edu /usr/bin/which hostname


Something more useful would be to look at the contents of a folder where I have built my software:

bash$ globus-job-run grendl.phys.uconn.edu /bin/bash -c 'ls -ltr $OSG_APP/Gluex'

If I wanted to copy a file from my client that I am working on to the grid I would use:

bash$ globus-url-copy file:////home/leverin/condor-tutorial/submit/README \
gsiftp://grendl.phys.uconn.edu/nfs/direct/app/Gluex/test/README_GRID


Copying a file from somewhere on the web would look like:

bash$ globus-url-copy http://www.jlab.org/Hall-D/datatables/hd_res_photon.root \
gsiftp://grendl.phys.uconn.edu/nfs/direct/app/Gluex/test/hd_res_photon.root

SRM SRM (Storage Resource Manager) is a protocol for Grid access to mass storage systems. The protocol itself is a collaboration (http://sdm.lbl.gov/srm-wg/) between Lawrence Berkeley (LBNL), Fermilab (FNAL), Jefferson (JLAB), CERN, and RAL. This is the management tool used for storing the large amounts of data produced by my simulation jobs.

The following will show the contents of the Gluex storage folder where my results are stored. I've saved my HDGEANT output as this is the most time intensive part of the job and will not likely need to be redone. The reconstruction/analysis is saved here so I can move it later for analysis.

bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0
bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/hdgeant_output
bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/analysis_output

Other handy commands are srmcp for copying files from the job directory, srmrm for removing files in storage and srmmkdir to make new folders in storage for organization. Like normal Linux commands with srm prepended. Again, it it necessary to specify the full path to the file locations.

bash$ srmcp file:///$JOB_HOME/hdgeant_cut.hddm srm://grinch.phys.uconn.edu/Gluex/testfolder/HDGEANT_OUTFILE
bash$ srmrm srm://grinch.phys.uconn.edu/Gluex/eta-pi0/testfolder/HDGEANT_OUTFILE
bash$ srmmkdir srm://grinch.phys.uconn.edu/Gluex/testfolder

Step 5: Job Management and Condor-G

The Grid currently uses condor as the job manager. This is the tutorial I followed and you should too: Job Management with Condor. It is fairly extensive. The only differences from submitting to the GlueX Grid will be that we will use a different executable and grid-resource.

This is a test script provided to me by Richard: download condor-g0 Untar this and use the submit file there instead of the submit file they have you create in the tutorial. We just want a simple but non-trivial program to execute and primetest does not exist on the Gluex grid.

Look at condor-g0.sub in the directory where you untar'd condor-g0.tgz

bash$ more condor-g0.sub

It should look something like:

executable=condor-g0.d/myscript.sh
arguments=TestJob 10
output=condor-g0.d/results.output
error=condor-g0.d/results.error
log=condor-g0.log
notification=never
universe=grid
grid_resource=gt2 gluskap.phys.uconn.edu/jobmanager-condor
#grid_resource=gt4 https://gluskap.phys.uconn.edu:9443 Condor
globusrsl = (condorsubmit=(requirements 'Arch == \"Intel\"'))
queue

Arguments to pass to your script can be put here. The example executable script is told to sleep for 10 seconds. The gt4 grid resource isn't functioning at the moment but gt2 works well enough for now so we use that. The cluster is a mix of 64-bit and 32-bit machines. The globusrsl command restricts the build to a 32-bit machine.


Now to submit the job and continue with the rest of the tutorial:

bash$ condor_submit condor-g0.sub

A trivial job will take a few minutes from submission to completion due to overhead in the condor process.

Step 6: Building your executables from source

What follows is what I had to do for my executables to work on the grid due the dynamic linking of the HallD libraries and other libraries need for the building and executing of the software. As such, I will link the scripts used to build HDDS, the HALLD software, a plugin and my custom executables based on the HallD software. (It's possible that an entirely static built binary can be submitted without having to compile source code on the cluster but that doesn't seem to work or was possible with what I needed.)

Building the HALLD source code

A setup script, setup.sh that defines the environment variables is found in $OSG_APP/Gluex/test. This is where a custom $HALLD_HOME should be defined. This is where any libraries and executables will be moved to and is currently set as:

HALLD_HOME=/nfs/direct/app/Gluex/test

Change this to something new if you want your code to be put there, place the new setup.sh in that folder and source it in your script submitted to condor.


HDDS

Now that HDDS is built separately from the HALLD source code, we need to build it first. The build files are here: download build_hdds.sh and build_hdds.sub. Remember to change the location of the setup.sh in build_hdds.sh. The script will download HDDS from the SVN repository build it it the job directory and then move it to HALLD_HOME and fix the permissions.

Then submit the build job to condor:

bash$ condor_submit build_hdds.sub

Check the log, error and output files for clues to success or failure. Look at HALLD_HOME to see if everything is where is should be and all the permissions are set properly.

bash$ more build_hdds.log
bash$ more build_hdds.error
bash$ more build_hdds.output
bash$ globus-job-run grendl.phys.uconn.edu /bin/ls -ltr /nfs/direct/app/Gluex/test


HALLD SRC

Once HDDS is built, we can now build the HallD source. I needed the bggen and hdgeant executables for background simulation.

The HALLD build files can be taken from here: build_src.sh and build_src.sub. Remember to change the location of the setup.sh in build_src.sh. The script will download the src from the SVN trunk build and move it.

Then submit the build job to condor:

bash$ condor_submit build_src.sub

Results can be checked like before.


HDParSim

For my reconstruction I need the HDParSim plugin that does not build with the standard source code. The submission files are here: build_hdparsim.sh and build_hdparsim.sub

Step 7: Building custom source code

filterHighE For doing background simulations, I needed to filter out a lot of events that wouldn't pass some simple energy and multiplicity cuts to keep the amount of disc space and CPU time reasonable. For this I used a JANA based program call filterHighE.

The source code for filterHighE is found here: filterHighE.tgz The submit files are here: build_filterHighE.sh and build_filterHighE.sub

Notice that the the transfer parameters in the submission file are uncommented now as we need to upload filterHighE.tgz to the job to be unpacked and built.

hddmcp To reduce overall disc space I cut out the unnecessary data from the hdgeant.hddm files. I used hddmcp to do this. The source code and executable are here: cuthddm.tgz. The README file in the tarball explains how to modify and build your own. It does not need to be compiled on the grid and can just be moved to the HALLD_HOME/bin folder. The permission must be set on the grid as they don't transfer from the Linux client.

bash$ globus-url-copy file:////home/leverin/gluex/my_src/cuthddm/hddmcp \
/nfs/direct/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2
bash$ globus-job-run grendl.phys.uconn.edu /usr/bin/chmod a+x \
/nfs/direct/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2/hddmcp

fcalTree4 My reconstruction code is here: fcalTree4.tgz. The submission files are here: build_fcaltree4.sh and build_fcaltree4.sub

I am interested in eta-pi0 reconstruction. I use HDParSim to handle the protons and identify DPhoton showers due to charged particles. The output is a ROOT file.

HDParsim needs the resolution tables found here HOWTO run the semi-parametric Monte Carlo. I moved the root files to /nfs/direct/app/Gluex/eta-pi0/lib and then linked to them for each job rather than moving them each time. This saves some time and bandwidth.

A Background Simulation and Analysis

For my jobs I require longer time and I will submit jobs under the simulation role.

bash$ voms-proxy-init -hours 72 -cert ~/.globus/usercred.p12 -voms Gluex:/Gluex/simulation

I've written a PERL script that creates job directories, the run.ffr file needed by bggen, the control.in file needed by hdgeant, and the submission files needed for condor. Each job has unique random number seeds so that each job is statistically different from each other. The perl script uses templates and then writes out the unique job files to the job folder.

A tarball of my job submission script is found here: jobsub.tgz

There is a loop in runfullanalysis.pl that determines the job numbers to submit. Edit this to vary the job number range. The bggen input card fort.15 has a line that controls the number of events to generate. The script moves the hdgeant_cut.hddm and fcaltree4.root file to storage and then deletes all the files in the job directory to clean things up and makes sure nothing large gets sent back to the client machine.