CRAB at Purdue Tutorial
- Introduction
- Setup local Environment and prepare user analysis code
- CRAB setup
- Data selection
- CRAB schedulers
- CRAB configuration
- Validate a CMSSW config file
- Run Crab
Introduction
First login on cms.rcac.purdue.edu.
ssh -l username cms.rcac.purdue.edu
or
ssh -l username hep.rcac.purdue.edu
Setup local Environment and prepare user analysis code
UI initialization
When you want to use CRAB, set up a LCG User Interface. It will allow you to access WLCG-affiliated resources in a transparent way.
source /group/cms/tools/glite/setup.sh
Proxy setup
Before you can use CRAB submitting job, you need to have a grid certificate. If you don't have it, please refer here to get it.
voms-proxy-init -voms cms -valid 168:0
CMS software initialization
CMSSW releases can be initialized on your user interface by executing following line:
source /cvmfs/cms.cern.ch/cmsset_default.sh
Prepare user analysis code
Install CMSSW project in a directory where you have enough user space.
scram project CMSSW CMSSW_6_0_0
cd CMSSW_6_0_0/src
cmsenv
CRAB setup
At Purdue, CRAB is installaed in following folder:
/group/cms/crab/CRAB
To know the latest release, check CRAB web page or crab development forum.
Setup on cms.rcac.purdue.edu:
In order to setup CRAB, source the script crab.(c)sh located in '/group/cms/crab/' from CMSSW working directory.
source /group/cms/crab/crab.sh
Data selection
To select data you want to access, use the DBS web page where available datasets are listed in DBS Data Discovery or Purdue site data. For this tutorial we'll use :
/WJets_TuneD6T_matchingup_7TeV-madgraph-tauola/Fall10-START38_V12-v1/GEN-SIM-RECO
CRAB schedulers
Following table shows list of crab schedulers supported at the site with their link to crab configuration file.
Scheduler | Pros | Cons | crab cfg file |
glite |
|
|
glite_cfg |
glidein |
|
|
glidein_cfg |
remoteGlidein (recommended one) |
|
let us know ! | remoteGlidein_cfg |
condor_g |
|
|
condorg_cfg |
Should I use crab server or not ?
We currently recommend to use remoteGlidein in non-server mode with
scheduler=remoteGlidein
use_server=0
Based on our experience, we believe remoteGlidein will offer the best experience. It is not possible to give a general recommendation. Crab server offers advantages but also some drawbacks with respect to submission with use_server=0. Main differences are listed below. Both modes of submitting jobs are supported and you should find what best fit your needs, your experience, time constraints etc.
- Advantages of crab server
- Job status is tracked automatically and available on server web page and dashboard
- Faster submission from the user interface
- Failed jobs will be automatically resubmitted when useful, picking a different site when this makes sense
- No limit on number of jobs per task (non-server submission currently limited at 500 jobs per task)
- Larger ISB size allowed (100MB vs. 10MB for non-server)
- Access to both glite and glidein schedulers for grid submission
- Advantages of direct submission
- One less layer between user and grid, less things can go wrong
- Problems are easier to debug
- A crab -status command talks directly to middleware and report status in real time, also sending it to dashboard
- Disadvantages of crab server
- At times server looses tracks of some jobs/tasks, you need to resubmit those, even if completed OK
- Status update may be lagging behind e.g. dashboard or direct query to grid and there is no way to force an update
- Some failures in job submission are not as easy to understand
- Disadvantages of direct submission
- You need to resubmit by hand failed (e.g. aborted) jobs
- Unless you issue crab -status, jobs status may not be fully update in dashboard
- Can only submit up to 500 jobs in one go
Bottom line in most cases could be
- With server, you will need to resubmit successful jobs that server lost
- With client, you will need to resubmit failed jobs and pay more attention to the list of available sites
- In any case, if jobs fail because of cmsRun problems you have to debug and resubmit yourself
CRAB configuration
Modify the CRAB configuration file 'crab.cfg' according to your needs: a fully documented template is available at '$CRABDIR/python/crab.cfg'. For more information, see the crab configuration parameters. For this tutorial, the only relevant sections of the file are [CRAB], [CMSSW], and [USER]. The configuration file should be located at the same location as the CMSSW parameter-set to be used by CRAB. Save the crab configuration file:
crab.cfg
with the following content:
[CRAB]
scheduler = remoteGlidein
use_server = 0
[CMSSW]
datasetpath = /WJets_TuneD6T_matchingup_7TeV-madgraph-tauola/Fall10-START38_V12-v1/GEN-SIM-RECO
pset = tutorial.py
total_number_of_events = 100
number_of_jobs = 1
output_file = outfile.root
[USER]
return_data = 0
copy_data = 1
storage_element= T2_US_Purdue
user_remote_dir = MyFirstTest
[GRID]
rb = CERN
se_white_list = xrootd.rcac.purdue.edu
Download crab.cfg or tutorial.py here.
Validate a CMSSW config file
Before submitting created jobs, a user can validate the CMSSW config file launching
crab -validateCfg tutorial.py
In this way, your CMSSW config file will be controlled and validated by corresponding python API.
Run Crab
Once your crab.cfg is ready and the whole underlying environment is set up, you can start to run CRAB. CRAB supports a command line help which can be useful for the first time. You can get it via:
crab -h
Job Creation
The job creation checks the availability of the selected dataset and preparesall the jobs for submission according to the selected job splitting specifed on the crab.cfg.
The creation process creates a CRAB project directory (default: crab_0__) in the current working directory, where the related crab configuration file is cached for further usage, avoiding interference
with other (already created) projects.
CRAB also allows the user to chose a project name, so that it can be used later to distinguish multiple CRAB projects in the same directory.
crab -create
Job Submission
With the submission command it's possible to specify a combination of jobs and job-ranges separated by comma (e.g.: =1,2,3-4), the default is all.
To submit all jobs of the last created project with the default name, it's enough to execute the following command:
crab -submit
to submit a specific project:
crab -submit -c <dir name>
Job Status Check
Check the status of the jobs in the latest CRAB project with the following command:
crab -status
for check a specific project:
crab -status -c <dir name>
Job Output Retrieval
For the jobs which are in status done it's possible to retrieve their output back to the UI. The following command retrieves the output of all jobs with status done of the last created CRAB project:
crab -getoutput all
to get the output of a specific project:
crab -getoutput all -c <dir name>
it can be repeated as long as there are jobs in status done.
Job Aborted Retrieval
For the jobs which are in status aborted it's impossible to retrieve their output back to the UI. The following command retrieves the error information of all jobs:
crab -postMortem all -c <dir name>
Final plot
All 10 jobs produce a histogram output file which can be combined using ROOT in the res directory:
hadd dummy.root dummy_*.root
Copying output back to your desktop
To get data from HDFS, we can use gfal-copy to copy back to your local machine:
gfal-copy -vf davs://xrootd.rcac.purdue.edu/store/user/<username>/<userdir>/test.txt
file:////tmp/test.txt
Inspecting output file using xrootd
Local users can look at a root file at a root session by using xrootd command:.x roottest.C
roottest.C
{
gInterpreter.AddIncludePath("/cvmfs/cms.cern.ch/slc5_amd64_gcc462/cms/cmssw/CMSSW_6_0_0/src/");
gSystem->Load("libFWCoreFWLite");
AutoLibraryLoader::enable();
TFile *f = new TXNetFile ("root://xrootd.rcac.purdue.edu//store/mc/Summer08/DiPion_E300_Eta5
/GEN-SIM-RAW/IDEAL_V9_v1/0027
/BCF0E000-B87F-DD11-8E24-001EC9AAA021.root","READ");
TTree* tree = (TTree*)f->Get("Events");
cout<<" Events:"<tree->GetEntries()<<endl;
f->Close();
}