Home » User Information » CMSSW Job Tutorials » CRAB at Purdue Tutorial

CRAB at Purdue Tutorial

Introduction
Setup local Environment and prepare user analysis code
CRAB setup
- Setup on cms.rcac.purdue.edu:
Data selection
CRAB schedulers
CRAB configuration
Validate a CMSSW config file
Run Crab
- Job Creation

Introduction

First login on cms.rcac.purdue.edu.

ssh -l username cms.rcac.purdue.edu

ssh -l username hep.rcac.purdue.edu

Setup local Environment and prepare user analysis code

UI initialization

When you want to use CRAB, set up a LCG User Interface. It will allow you to access WLCG-affiliated resources in a transparent way.

source /group/cms/tools/glite/setup.sh

Proxy setup

Before you can use CRAB submitting job, you need to have a grid certificate. If you don't have it, please refer here to get it.

voms-proxy-init -voms cms -valid 168:0

CMS software initialization

CMSSW releases can be initialized on your user interface by executing following line:

source /cvmfs/cms.cern.ch/cmsset_default.sh

Prepare user analysis code

Install CMSSW project in a directory where you have enough user space.

scram project CMSSW CMSSW_6_0_0 
cd CMSSW_6_0_0/src
cmsenv

CRAB setup

At Purdue, CRAB is installaed in following folder:

/group/cms/crab/CRAB

To know the latest release, check CRAB web page or crab development forum.

Setup on cms.rcac.purdue.edu:

In order to setup CRAB, source the script crab.(c)sh located in '/group/cms/crab/' from CMSSW working directory.

source /group/cms/crab/crab.sh

Data selection

To select data you want to access, use the DBS web page where available datasets are listed in DBS Data Discovery or Purdue site data. For this tutorial we'll use :

/WJets_TuneD6T_matchingup_7TeV-madgraph-tauola/Fall10-START38_V12-v1/GEN-SIM-RECO

CRAB schedulers

Following table shows list of crab schedulers supported at the site with their link to crab configuration file.

Scheduler	Pros	Cons	crab cfg file
glite	jobs are sent immediately to site, so you see that they are pending in local queue	choice of submission site is based on information provided by the site's information system which may lead to unoptimal and possibly poor decisions. Your jobs may run later. jobs may need to be resubmitted by you more often when using glite	glite_cfg
glidein	glidein picks sites based on where jobs run. Your jobs may run sooner. glideing protects your job from failing due to site instability.	jobs stay in global condor queue until they start, and you do not know if it is due to: site is busy. there is no site which matches your job requirements. unknown problem	glidein_cfg
remoteGlidein (recommended one)	it is the equivalent of direct submission to gLite. it protects your job from failing due to site instability. get all the benefits of glidein scheduler w/o any of the CrabServer drawbacks. fast command execution, fast submission, real time status update, can tell job status before dashboard rather then after. more features coming in near future.	let us know !	remoteGlidein_cfg
condor_g	It has one less intermediate layer than glite and can be more reliable. You may have higher priortity at your local site than other grid users	uses OSG tools to reach OSG sites. cannot be used with use_server=1. no protection of job failing due to site instability.	condorg_cfg

Should I use crab server or not ?

We currently recommend to use remoteGlidein in non-server mode with

scheduler=remoteGlidein
use_server=0

Based on our experience, we believe remoteGlidein will offer the best experience. It is not possible to give a general recommendation. Crab server offers advantages but also some drawbacks with respect to submission with use_server=0. Main differences are listed below. Both modes of submitting jobs are supported and you should find what best fit your needs, your experience, time constraints etc.

Advantages of crab server
- Job status is tracked automatically and available on server web page and dashboard
- Faster submission from the user interface
- Failed jobs will be automatically resubmitted when useful, picking a different site when this makes sense
- No limit on number of jobs per task (non-server submission currently limited at 500 jobs per task)
- Larger ISB size allowed (100MB vs. 10MB for non-server)
- Access to both glite and glidein schedulers for grid submission
Advantages of direct submission
- One less layer between user and grid, less things can go wrong
- Problems are easier to debug
- A crab -status command talks directly to middleware and report status in real time, also sending it to dashboard

Disadvantages of crab server
- At times server looses tracks of some jobs/tasks, you need to resubmit those, even if completed OK
- Status update may be lagging behind e.g. dashboard or direct query to grid and there is no way to force an update
- Some failures in job submission are not as easy to understand

Disadvantages of direct submission
- You need to resubmit by hand failed (e.g. aborted) jobs
- Unless you issue crab -status, jobs status may not be fully update in dashboard
- Can only submit up to 500 jobs in one go

Bottom line in most cases could be

With server, you will need to resubmit successful jobs that server lost
With client, you will need to resubmit failed jobs and pay more attention to the list of available sites
In any case, if jobs fail because of cmsRun problems you have to debug and resubmit yourself

CRAB configuration

Modify the CRAB configuration file 'crab.cfg' according to your needs: a fully documented template is available at '$CRABDIR/python/crab.cfg'. For more information, see the crab configuration parameters. For this tutorial, the only relevant sections of the file are [CRAB], [CMSSW], and [USER]. The configuration file should be located at the same location as the CMSSW parameter-set to be used by CRAB. Save the crab configuration file:

crab.cfg

with the following content:

[CRAB]
scheduler = remoteGlidein 
use_server = 0 
[CMSSW] 
datasetpath = /WJets_TuneD6T_matchingup_7TeV-madgraph-tauola/Fall10-START38_V12-v1/GEN-SIM-RECO
pset = tutorial.py 
total_number_of_events = 100 
number_of_jobs = 1 
output_file = outfile.root 
[USER] 
return_data = 0  
copy_data = 1 
storage_element= T2_US_Purdue 
user_remote_dir = MyFirstTest 
[GRID] 
rb = CERN
se_white_list  = xrootd.rcac.purdue.edu

Download crab.cfg or tutorial.py here.

Validate a CMSSW config file

Before submitting created jobs, a user can validate the CMSSW config file launching

crab -validateCfg tutorial.py

In this way, your CMSSW config file will be controlled and validated by corresponding python API.

Run Crab

Once your crab.cfg is ready and the whole underlying environment is set up, you can start to run CRAB. CRAB supports a command line help which can be useful for the first time. You can get it via:

crab -h

Job Creation

The job creation checks the availability of the selected dataset and preparesall the jobs for submission according to the selected job splitting specifed on the crab.cfg.

The creation process creates a CRAB project directory (default: crab_0__) in the current working directory, where the related crab configuration file is cached for further usage, avoiding interference
with other (already created) projects.

CRAB also allows the user to chose a project name, so that it can be used later to distinguish multiple CRAB projects in the same directory.

crab -create

Job Submission

With the submission command it's possible to specify a combination of jobs and job-ranges separated by comma (e.g.: =1,2,3-4), the default is all.

To submit all jobs of the last created project with the default name, it's enough to execute the following command:

crab -submit

to submit a specific project:

 crab -submit -c <dir name>

Job Status Check

Check the status of the jobs in the latest CRAB project with the following command:

crab -status

for check a specific project:

crab -status -c <dir name>

Job Output Retrieval

For the jobs which are in status done it's possible to retrieve their output back to the UI. The following command retrieves the output of all jobs with status done of the last created CRAB project:

crab -getoutput all

to get the output of a specific project:

crab -getoutput all -c <dir name>

it can be repeated as long as there are jobs in status done.

Job Aborted Retrieval

For the jobs which are in status aborted it's impossible to retrieve their output back to the UI. The following command retrieves the error information of all jobs:

crab -postMortem all -c <dir name>

Final plot

All 10 jobs produce a histogram output file which can be combined using ROOT in the res directory:

hadd dummy.root dummy_*.root

Copying output back to your desktop

To get data from HDFS, we can use gfal-copy to copy back to your local machine:

gfal-copy -vf  davs://xrootd.rcac.purdue.edu/store/user/<username>/<userdir>/test.txt 
file:////tmp/test.txt

Inspecting output file using xrootd

Local users can look at a root file at a root session by using xrootd command:.x roottest.C

roottest.C 
{ 
gInterpreter.AddIncludePath("/cvmfs/cms.cern.ch/slc5_amd64_gcc462/cms/cmssw/CMSSW_6_0_0/src/");
gSystem->Load("libFWCoreFWLite"); 
AutoLibraryLoader::enable();
TFile *f = new TXNetFile ("root://xrootd.rcac.purdue.edu//store/mc/Summer08/DiPion_E300_Eta5
/GEN-SIM-RAW/IDEAL_V9_v1/0027
/BCF0E000-B87F-DD11-8E24-001EC9AAA021.root","READ");
TTree* tree = (TTree*)f->Get("Events");
cout<<" Events:"<tree->GetEntries()<<endl;
f->Close(); 
}