A PROOF interface to the AliEn file catalog

Dario Berzano

2013-01-16


Table of Contents

1. Overview
An interface to the AliEn file catalog
2. Configuration
PROOF
PROOF-Lite
3. Usage
Query string format
Staging requests

1. Overview

Datasets have been invented to provide PROOF users a cleaner access to sets of uniform data: each dataset has a name which helps identifying the kind of data stored, plus some meta-information, such as:

  • default tree name

  • number of events in the default tree

  • file size

  • integrity information: is my file corrupted?

  • locality information: is my remote file available on a local storage?

Datasets are also used by the staging daemon afdsmgrd to trigger data staging, i.e. to request some data from being transferred from a remote storage to the local analysis facility disks.

PROOF datasets are handled by dataset manager, a generic catalog of datasets which has been historically implemented by the class TDataSetManagerFile, which stored each dataset inside a ROOT file.

This dataset manager has been conceived for a small number of datasets (hundreds) which reflected data stored on the local analysis facility disks. As the PROOF analysis model became popular in ALICE, the number of datasets grew posing many problems.

  • To give the possibility to process remote data, current datasets mimick file catalog functionalities by including also lists of files currently not staged on the local analysis facility.

  • Since users can create their own datasets, in many cases containing duplicate data, it has become demanding to provide maintenance and support.

  • Locality information in datasets is static: this means that, if a file gets deleted from a disk, the corresponding dataset(s) must be synchronized manually.

An interface to the AliEn file catalog

The new TDataSetManagerAliEn class is a new dataset manager which acts as an intermediate layer between PROOF datasets and the AliEn file catalog.

Dataset names do not represent any longer a static list of files: instead, it represents a query string to the AliEn file catalog that creates a dataset dynamically.

Locality information is also filled on the fly by contacting the local file server: for instance, in case a xrootd pool of disks is used, fresh online information along with the exact host (endpoint) where each file is located is provided dynamically in a reasonable amount of time.

Both file catalog queries and locality information are cached on ROOT files: cache is shared between users and its expiration time is configurable.

Since dataset information is now volatile, a separate and more straightforward method for issuing staging requests has also been provided.

2. Configuration

PROOF

Using the new dataset manager requires the xpd.datasetsrc directive in the xproofd configuration file:

xpd.datasetsrc alien cache:/path/to/dataset/cache urltemplate:http://myserver:1234/data<path> cacheexpiresecs:86400
alien

Tells PROOF that the dataset manager is the AliEn interface (as opposed to file).

cache

Specify a path on the local filesystem of the host running user's PROOF master.

[Important]Important

This path is not a URL but just a local path. Moreover, the path must be visible from the host that will run each user's master, since a separate dataset manager instance is created per user.

[Warning]Warning

If the cache directory does not exist, it is created, if possible, with open permissions (rwxrwxrwx). On a production environment it is advisable to create the cache directory manually beforehand with the same permissions.

urltemplate

Template used for translating between an alien:// URL and the local storage's URL.

<path> is written literally and will be substituted with the full AliEn path without the protocol.

[Tip]Example of a URL translation

Questo è un testo normale che si può anche, volendo, espandere su più righe. Ora dal momento che non so più che cosa dire, finirò di scrivere lasciando una riga a metà.

  • Template URL:

    root://alice-caf.cern.ch/<path>
  • Source URL:

    alien:///alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
  • Resulting URL:

    root://alice-caf.cern.ch//alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
cacheexpiresecs

Number of seconds before cached information is considered expired and refetched (e.g., 86400 for one day).

PROOF-Lite

By default, PROOF-Lite creates on the client session (which acts as a master as well) a file-based dataset manager. To enable the AliEn dataset manager in a PROOF-Lite session, run:

gEnv->SetValue("Proof.DataSetManager",
  "alien cache:/path/to/dataset/cache "
  "urltemplate:root://alice-caf.cern.ch/<path> "
  "cacheexpiresecs:86400");
TProof::Open("");
[Important]Important

Please note that the environment must be set before opening the PROOF-Lite session!

Parameters are the same as described in the PROOF server configuration.

3. Usage

The new dataset manager is backwards-compatible with the legacy interface: each time you want to process or obtain a dataset, instead of specifying a string containing a dataset name you will specify a query string to the file catalog.

Query string format

The query string is the string you will use in place of the dataset name. It does not correspond to a static dataset: instead it represents a virtual dataset whose information is filled in on the fly.

There are two different formats you can use:

In the query string it is also possible to specify if you want to process data from AliEn, only staged data or data from AliEn in "cache mode".

Official data and Monte Carlo format

These are the string formats to be used respectively for official data and official Monte Carlo productions:

Data;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;Pass=<PASS>
Sim;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;
Period

The LHC period.

Example of valid values: LHC10h, LHC11h_2, LHC11f_Technical

Variant

Data variant, which might be ESDs (or ESD) for ESDs and AODXXX for AODs corresponding to the XXX set.

Example of valid values: ESDs, AOD073, AOD086

Run

Runs to be processed, in the form of a single run (130831), an inclusive range (130831-130833), or a list of runs and/or ranges (130831-130835,130840,130842).

Duplicate runs are automatically removed, so in case you specify 130831-130835,130833 run number 130833 will be processed only once.

Pass (only for data, not for Monte Carlo)

The pass number or name. In case you specify only a number X, it will be expanded to passX.

Example of valid values: 1, pass1, pass2, cpass1_muon

[Tip]Example of a valid string for official data
Data;Period=LHC10h;Variant=AOD086;Run=130831-130833;Pass=pass1

AliEn find format

Whenever a user would like to process data which has not been produced officially, or whose directory structure in the AliEn file catalog is non-standard, an interface to the AliEn shell's find command is provided.

This is the command format:

Find;BasePath=<BASEPATH>;FileName=<FILENAME>;Anchor=<ANCHOR>;TreeName=<TREENAME>;Regexp=<REGEXP>

Parameters BasePath and FileName are passed as-is to the AliEn find command, and are mandatory.

Parameters Anchor, TreeName and Regexp are optional.

Here's a detailed description of the parameters.

BasePath

Start search under the specified path on the AliEn file catalog.

Jolly characters are supported: the asterisk (*) and the percentage sign (%) are interchangeable.

Examples of valid values are: /alice/data/2010/LHC10h/000123456/*.*, /alice/cern.ch/user/d/dummy/my_pp_production/%.%.

FileName

File name to look for.

Examples of valid values are: root_archive.zip, aod_archive.zip, custom_archive.zip, AliAOD.root.

Anchor [optional]

In case FileName is a zip archive, the anchor is the name of a ROOT file inside the archive to point to.

Examples of valid values are: AliAOD.root, AliESDs.root, MyRootFile.root.

[Warning]Warning

Using the AliEn file catalog it is possible to point directly to a ROOT file stored in an archive without using the anchor.

There is however a substantial difference in how data is retrieved, especially during staging: auxiliary ROOT files (friends) are stored inside the archive along with the "main" file, so that when you use the archive as FileName with the proper Anchor you are still referencing to the same file, but you are giving instructions of downloading the archive.

Using the ROOT file name directly must be done in very special cases (i.e., to save space) and only when one is completely sure that no external files in the archive are required for analysis.

TreeName [optional]

Name of each file's default tree.

Examples of valid values are: /aodTree, /esdTree, /myCustomTree, /TheDirectory/TheTree.

Regexp [optional]

Additional extended regular expression applied after find command is run, to fine-grain search results.

Only alien:// paths matching the regular expression are considered, others are discarded.

Examples of valid values are: /[0-9]{6}/[0-9]{3,4}, \.root$.

[Note]Note

ROOT class TPMERegexp is used to perform regular expression matching.

[Tip]Example of a valid string for an arbitrary find command

Find;BasePath=/alice/data/2010/LHC10h/000139505/ESDs/pass1/*.*;FileName=root_archive.zip;Anchor=AliESDs.root

Data access modes

It is possible to append to the format string the Mode specifier that affects the way URLs are generated.

Mode=[local|remote|cache]

This parameter is optional and defaults to local. Description of each possible value follows:

local

Local storage is checked for the presence of data you requested. Output URLs will be relative to your local storage. Also, locality information (i.e., is your file staged?) is filled.

If you run a PROOF analysis on a dataset with this mode specified, only data marked as "staged" will be processed.

This method is the preferred one, since it does not overload the remote storage, and it enables users to process partially-staged datasets, or partially-reconstructed runs, without the need to manually update static datasets.

[Important]Important

This is the default if no mode is specified, and it is also the most efficient one.

Despite it might take some time (up to a couple of minutes to locate ~4000 files), returned information is always reliable (because it's dynamic) and speeds up analysis (because analysis will always be run only on files having local copies).

Moreover this information is cached for a configurable period of time, so that subsequent calls to the same dataset will be faster.

remote

Only AliEn URLs are returned.

A PROOF analysis run on a dataset with this mode specified will always obtain data from a remote storage, according to the AliEn file catalog.

[Warning]Warning

Tasks run on remote data are usually much slower than using local storage.

cache

URLs pointing to local copies of files are returned, but real file presence is not checked.

If local storage is configured for retrieving from AliEn files that are not available locally (which is the case of xrootd with vMSS), then data will be downloaded while analysis is running.

It is called cache mode because it treats the local storage as a cache for the remote storage.

[Warning]Warning

This mode is usually very slow on a busy analysis facility since retrieving data in real time without any kind of scheduling is inefficient. It also conflicts with the preferred method, which is to stage data asynchronously using the stager daemon.

Staging requests

Issuing staging requests and keeping track of them requires an auxiliary database that can be read and updated by the data stager daemon.

Whenever a staging request is issued, a ROOT file containing the dataset is saved in a special directory on the master's filesystem, monitored by the file stager.

PROOF configuration

In the xproofd configuration file, there is a directive to specify the directory used as repository for staging requests:

xpd.stagereqrepo /path/to/local/directory

Permissions on this directory must be kept open.

[Warning]Warning

Versions of the stager daemon prior to v1.0.7 do not support open permissions and the staging repository directive.

Request and monitor staging

Staging requests and monitoring can be done from within a PROOF session.

gProof->RequestStagingDataSet("QueryString")

Requests staging of the dataset specified via the query string.

Staging request is honored if the stager daemon is running.

[Tip]Tip

In order to avoid requesting to stage undesired data, it is advisable to check in advance the results of your query string:

gProof->ShowDataSet("QueryString")

TProof->ShowStagingStatusDataSet("QueryString"[, "opts"])

Shows progress status of a previously given staging request with data specified by the query string.

Options are optional, and passed as-is to the TFileCollection::Print() method.

[Tip]Tip

It is possible to show all the files marked as corrupted by the daemon:

gProof->ShowStagingStatusDataSet("QueryString", "C")

Or all the files successfully staged and not corrupted:

gProof->ShowStagingStatusDataSet("QueryString", "Sc")
gProof->GetStagingStatusDataSet("QueryString")

Gets a TFileCollection containing information on the staging request specified by the query string.

Works exactly like ShowStagingStatusDataSet() but returns an object instead of displaying information on the screen.