Purdue CMS Tier-2 Center
Home » User Information » CMSSW Job Tutorials » CRAB at Purdue Tutorial

CRAB at Purdue Tutorial

Introduction

First login on cms.rcac.purdue.edu.

ssh -l username cms.rcac.purdue.edu

or

ssh -l username hep.rcac.purdue.edu

 

Setup local Environment and prepare user analysis code

UI initialization

When you want to use CRAB, set up a LCG User Interface. It will allow you to access WLCG-affiliated resources in a transparent way.

source /group/cms/tools/glite/setup.sh 

Proxy setup

Before you can use CRAB submitting job, you need to have a grid certificate. If you don't have it, please refer here to get it.

voms-proxy-init -voms cms -valid 168:0

 

CMS software initialization

CMSSW releases can be initialized on your user interface by executing following line:

source /cvmfs/cms.cern.ch/cmsset_default.sh  

Prepare user analysis code

Install CMSSW project in a directory where you have enough user space.

scram project CMSSW CMSSW_6_0_0 
cd CMSSW_6_0_0/src
cmsenv

 


CRAB setup 

At Purdue, CRAB is installaed in following folder:

/group/cms/crab/CRAB 


To know the latest release, check CRAB web page or crab development forum.

 

Setup on cms.rcac.purdue.edu:

In order to setup CRAB, source the script crab.(c)sh located in '/group/cms/crab/' from CMSSW working directory.

source /group/cms/crab/crab.sh 

Data selection

To select data you want to access, use the DBS web page where available datasets are listed in DBS Data Discovery or Purdue site data. For this tutorial we'll use :

/WJets_TuneD6T_matchingup_7TeV-madgraph-tauola/Fall10-START38_V12-v1/GEN-SIM-RECO

CRAB schedulers

Following table shows list of crab schedulers supported at the site with their link to crab configuration file.

Scheduler Pros Cons crab cfg file
glite
  • jobs are sent immediately to site, so you see that they are pending in local queue
  • choice of submission site is based on information provided by the site's information system which may lead to unoptimal and possibly poor decisions. Your jobs may run later.
  • jobs may need to be resubmitted by you more often when using glite
glite_cfg
glidein
  • glidein picks sites based on where jobs run. Your jobs may run sooner.
  • glideing protects your job from failing due to site instability.
  • jobs stay in global condor queue until they start, and you do not know if it is due to:
    • site is busy.
    • there is no site which matches your job requirements.
    • unknown problem
glidein_cfg
remoteGlidein (recommended one)
  • it is the equivalent of direct submission to gLite.
  • it protects your job from failing due to site instability.
  • get all the benefits of glidein scheduler
    w/o any of the CrabServer drawbacks.
  • fast command execution, fast submission, 
    real time status update, can tell job status 
    before dashboard rather then after.
  • more features coming in near future.
let us know ! remoteGlidein_cfg
condor_g
  • It has one less intermediate layer than glite and can be more reliable.
  • You may have higher priortity at your local site than other grid users
  • uses OSG tools to reach OSG sites.
  • cannot be used with use_server=1.
  • no protection of job failing due to site instability.
condorg_cfg

Should I use crab server or not ?

We currently recommend to use remoteGlidein in non-server mode with 


scheduler=remoteGlidein 
use_server=0 


Based on our experience, we believe remoteGlidein will offer the best experience. It is not possible to give a general recommendation. Crab server offers advantages but also some drawbacks with respect to submission with use_server=0. Main differences are listed below. Both modes of submitting jobs are supported and you should find what best fit your needs, your experience, time constraints etc.

  • Advantages of crab server
    • Job status is tracked automatically and available on server web page and dashboard
    • Faster submission from the user interface
    • Failed jobs will be automatically resubmitted when useful, picking a different site when this makes sense
    • No limit on number of jobs per task (non-server submission currently limited at 500 jobs per task)
    • Larger ISB size allowed (100MB vs. 10MB for non-server)
    • Access to both glite and glidein schedulers for grid submission
  • Advantages of direct submission
    • One less layer between user and grid, less things can go wrong
    • Problems are easier to debug
    • A crab -status command talks directly to middleware and report status in real time, also sending it to dashboard
  • Disadvantages of crab server
    • At times server looses tracks of some jobs/tasks, you need to resubmit those, even if completed OK
    • Status update may be lagging behind e.g. dashboard or direct query to grid and there is no way to force an update
    • Some failures in job submission are not as easy to understand
  • Disadvantages of direct submission
    • You need to resubmit by hand failed (e.g. aborted) jobs
    • Unless you issue crab -status, jobs status may not be fully update in dashboard
    • Can only submit up to 500 jobs in one go


Bottom line in most cases could be 

  • With server, you will need to resubmit successful jobs that server lost
  • With client, you will need to resubmit failed jobs and pay more attention to the list of available sites
  • In any case, if jobs fail because of cmsRun problems you have to debug and resubmit yourself

CRAB configuration

Modify the CRAB configuration file 'crab.cfg' according to your needs: a fully documented template is available at '$CRABDIR/python/crab.cfg'. For more information, see the crab configuration parameters. For this tutorial, the only relevant sections of the file are [CRAB], [CMSSW], and [USER]. The configuration file should be located at the same location as the CMSSW parameter-set to be used by CRAB. Save the crab configuration file:

crab.cfg 

with the following content:

[CRAB]
scheduler = remoteGlidein
use_server = 0
[CMSSW]
datasetpath = /WJets_TuneD6T_matchingup_7TeV-madgraph-tauola/Fall10-START38_V12-v1/GEN-SIM-RECO
pset = tutorial.py
total_number_of_events = 100
number_of_jobs = 1
output_file = outfile.root
[USER]
return_data = 0
copy_data = 1
storage_element= T2_US_Purdue
user_remote_dir = MyFirstTest
[GRID]
rb = CERN
se_white_list = cms-gridftp.rcac.purdue.edu

Download crab.cfg or tutorial.py here.

 

Validate a CMSSW config file

Before submitting created jobs, a user can validate the CMSSW config file launching

crab -validateCfg tutorial.py 

In this way, your CMSSW config file will be controlled and validated by corresponding python API.

Run Crab

Once your crab.cfg is ready and the whole underlying environment is set up, you can start to run CRAB. CRAB supports a command line help which can be useful for the first time. You can get it via:

crab -h 

 

Job Creation

The job creation checks the availability of the selected dataset and preparesall the jobs for submission according to the selected job splitting specifed on the crab.cfg.


The creation process creates a CRAB project directory (default: crab_0__) in the current working directory, where the related crab configuration file is cached for further usage, avoiding interference 
with other (already created) projects.


CRAB also allows the user to chose a project name, so that it can be used later to distinguish multiple CRAB projects in the same directory.

 

crab -create 

 

Job Submission

 

With the submission command it's possible to specify a combination of jobs and job-ranges separated by comma (e.g.: =1,2,3-4), the default is all.

 

To submit all jobs of the last created project with the default name, it's enough to execute the following command:

 

crab -submit 

 

to submit a specific project:

 

 crab -submit -c <dir name> 

 

Job Status Check

 

Check the status of the jobs in the latest CRAB project with the following command:

crab -status 

for check a specific project:

crab -status -c <dir name> 

 

Job Output Retrieval

 

For the jobs which are in status done it's possible to retrieve their output back to the UI. The following command retrieves the output of all jobs with status done of the last created CRAB project:

crab -getoutput all

to get the output of a specific project:

crab -getoutput all -c <dir name> 

it can be repeated as long as there are jobs in status done.

 

Job Aborted Retrieval

For the jobs which are in status aborted it's impossible to retrieve their output back to the UI. The following command retrieves the error information of all jobs:

crab -postMortem all -c <dir name>

 

 

Final plot

All 10 jobs produce a histogram output file which can be combined using ROOT in the res directory:

 

hadd dummy.root dummy_*.root 

 

 

Copying output back to your desktop 

To get data from HDFS, we can use gfal-copy to copy back to your local machine:

gfal-copy -vf  gsiftp://cms-gridftp.rcac.purdue.edu/store/user/<username>/<userdir>/test.txt 
file:////tmp/test.txt

Inspecting output file using xrootd

Local users can look at a root file at a root session by using xrootd command:.x roottest.C

roottest.C 
{
gInterpreter.AddIncludePath("/cvmfs/cms.cern.ch/slc5_amd64_gcc462/cms/cmssw/CMSSW_6_0_0/src/");
gSystem->Load("libFWCoreFWLite");
AutoLibraryLoader::enable();
TFile *f = new TXNetFile ("root://xrootd.rcac.purdue.edu//store/mc/Summer08/DiPion_E300_Eta5
/GEN-SIM-RAW/IDEAL_V9_v1/0027
/BCF0E000-B87F-DD11-8E24-001EC9AAA021.root","READ");
TTree* tree = (TTree*)f->Get("Events");
cout<<" Events:"<tree->GetEntries()<<endl;
f->Close();
}

CPU Utilization

Raw Storage Use