Table of Contents
Datasets have been invented to provide PROOF users a cleaner access to sets of uniform data: each dataset has a name which helps identifying the kind of data stored, plus some meta-information, such as:
default tree name
number of events in the default tree
file size
integrity information: is my file corrupted?
locality information: is my remote file available on a local storage?
Datasets are also used by the staging daemon afdsmgrd to trigger data staging, i.e. to request some data from being transferred from a remote storage to the local analysis facility disks.
PROOF datasets are handled by dataset manager, a generic catalog of datasets which has been historically implemented by the class TDataSetManagerFile
, which stored each dataset inside a ROOT file.
This dataset manager has been conceived for a small number of datasets (hundreds) which reflected data stored on the local analysis facility disks. As the PROOF analysis model became popular in ALICE, the number of datasets grew posing many problems.
To give the possibility to process remote data, current datasets mimick file catalog functionalities by including also lists of files currently not staged on the local analysis facility.
Since users can create their own datasets, in many cases containing duplicate data, it has become demanding to provide maintenance and support.
Locality information in datasets is static: this means that, if a file gets deleted from a disk, the corresponding dataset(s) must be synchronized manually.
The new TDataSetManagerAliEn
class is a new dataset manager which acts as an intermediate layer between PROOF datasets and the AliEn file catalog.
Dataset names do not represent any longer a static list of files: instead, it represents a query string to the AliEn file catalog that creates a dataset dynamically.
Locality information is also filled on the fly by contacting the local file server: for instance, in case a xrootd pool of disks is used, fresh online information along with the exact host (endpoint) where each file is located is provided dynamically in a reasonable amount of time.
Both file catalog queries and locality information are cached on ROOT files: cache is shared between users and its expiration time is configurable.
Since dataset information is now volatile, a separate and more straightforward method for issuing staging requests has also been provided.
Using the new dataset manager requires the xpd.datasetsrc
directive in the xproofd configuration file:
xpd.datasetsrc alien cache:/path/to/dataset/cache urltemplate:http://myserver:1234/data<path> cacheexpiresecs:86400
Tells PROOF that the dataset manager is the AliEn interface (as opposed to file
).
Specify a path on the local filesystem of the host running user's PROOF master.
![]() | Important |
---|---|
This path is not a URL but just a local path. Moreover, the path must be visible from the host that will run each user's master, since a separate dataset manager instance is created per user. |
![]() | Warning |
---|---|
If the cache directory does not exist, it is created, if possible, with open permissions ( |
Template used for translating between an alien://
URL and the local storage's URL.
<path>
is written literally and will be substituted with the full AliEn path without the protocol.
![]() | Example of a URL translation |
---|---|
Questo è un testo normale che si può anche, volendo, espandere su più righe. Ora dal momento che non so più che cosa dire, finirò di scrivere lasciando una riga a metà.
|
Number of seconds before cached information is considered expired and refetched (e.g., 86400 for one day).
By default, PROOF-Lite creates on the client session (which acts as a master as well) a file-based dataset manager. To enable the AliEn dataset manager in a PROOF-Lite session, run:
gEnv->SetValue("Proof.DataSetManager", "alien cache:/path/to/dataset/cache " "urltemplate:root://alice-caf.cern.ch/<path> " "cacheexpiresecs:86400"); TProof::Open("");
![]() | Important |
---|---|
Please note that the environment must be set before opening the PROOF-Lite session! |
Parameters are the same as described in the PROOF server configuration.
The new dataset manager is backwards-compatible with the legacy interface: each time you want to process or obtain a dataset, instead of specifying a string containing a dataset name you will specify a query string to the file catalog.
The query string is the string you will use in place of the dataset name. It does not correspond to a static dataset: instead it represents a virtual dataset whose information is filled in on the fly.
There are two different formats you can use:
specify data features (such as period and run numbers) for official data or Monte Carlo
specify the AliEn find command parameters
In the query string it is also possible to specify if you want to process data from AliEn, only staged data or data from AliEn in "cache mode".
These are the string formats to be used respectively for official data and official Monte Carlo productions:
Data;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;Pass=<PASS>
Sim;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;
The LHC period.
Example of valid values: LHC10h, LHC11h_2, LHC11f_Technical
Data variant, which might be ESDs
(or ESD
) for ESDs and AODXXX
for AODs corresponding to the XXX set.
Example of valid values: ESDs, AOD073, AOD086
Runs to be processed, in the form of a single run (130831), an inclusive range (130831-130833), or a list of runs and/or ranges (130831-130835,130840,130842).
Duplicate runs are automatically removed, so in case you specify 130831-130835,130833 run number 130833 will be processed only once.
The pass number or name. In case you specify only a number X, it will be expanded to passX.
Example of valid values: 1, pass1, pass2, cpass1_muon
![]() | Example of a valid string for official data |
---|---|
Data;Period=LHC10h;Variant=AOD086;Run=130831-130833;Pass=pass1 |
Whenever a user would like to process data which has not been produced officially, or whose directory structure in the AliEn file catalog is non-standard, an interface to the AliEn shell's find
command is provided.
This is the command format:
Find;BasePath=<BASEPATH>;FileName=<FILENAME>;Anchor=<ANCHOR>;TreeName=<TREENAME>;Regexp=<REGEXP>
Parameters BasePath
and FileName
are passed as-is to the AliEn find command, and are mandatory.
Parameters Anchor
, TreeName
and Regexp
are optional.
Here's a detailed description of the parameters.
Start search under the specified path on the AliEn file catalog.
Jolly characters are supported: the asterisk (*
) and the percentage sign (%
) are interchangeable.
Examples of valid values are: /alice/data/2010/LHC10h/000123456/*.*, /alice/cern.ch/user/d/dummy/my_pp_production/%.%.
File name to look for.
Examples of valid values are: root_archive.zip, aod_archive.zip, custom_archive.zip, AliAOD.root.
In case FileName
is a zip archive, the anchor is the name of a ROOT file inside the archive to point to.
Examples of valid values are: AliAOD.root, AliESDs.root, MyRootFile.root.
![]() | Warning |
---|---|
Using the AliEn file catalog it is possible to point directly to a ROOT file stored in an archive without using the anchor. There is however a substantial difference in how data is retrieved, especially during staging: auxiliary ROOT files (friends) are stored inside the archive along with the "main" file, so that when you use the archive as Using the ROOT file name directly must be done in very special cases (i.e., to save space) and only when one is completely sure that no external files in the archive are required for analysis. |
Name of each file's default tree.
Examples of valid values are: /aodTree, /esdTree, /myCustomTree, /TheDirectory/TheTree.
Additional extended regular expression applied after find command is run, to fine-grain search results.
Only alien://
paths matching the regular expression are considered, others are discarded.
Examples of valid values are: /[0-9]{6}/[0-9]{3,4}, \.root$.
![]() | Note |
---|---|
ROOT class TPMERegexp is used to perform regular expression matching. |
![]() | Example of a valid string for an arbitrary find command |
---|---|
Find;BasePath=/alice/data/2010/LHC10h/000139505/ESDs/pass1/*.*;FileName=root_archive.zip;Anchor=AliESDs.root
|
It is possible to append to the format string the Mode
specifier that affects the way URLs are generated.
Mode=[local|remote|cache]
This parameter is optional and defaults to local
. Description of each possible value follows:
Local storage is checked for the presence of data you requested. Output URLs will be relative to your local storage. Also, locality information (i.e., is your file staged?) is filled.
If you run a PROOF analysis on a dataset with this mode specified, only data marked as "staged" will be processed.
This method is the preferred one, since it does not overload the remote storage, and it enables users to process partially-staged datasets, or partially-reconstructed runs, without the need to manually update static datasets.
![]() | Important |
---|---|
This is the default if no mode is specified, and it is also the most efficient one. Despite it might take some time (up to a couple of minutes to locate ~4000 files), returned information is always reliable (because it's dynamic) and speeds up analysis (because analysis will always be run only on files having local copies). Moreover this information is cached for a configurable period of time, so that subsequent calls to the same dataset will be faster. |
Only AliEn URLs are returned.
A PROOF analysis run on a dataset with this mode specified will always obtain data from a remote storage, according to the AliEn file catalog.
![]() | Warning |
---|---|
Tasks run on remote data are usually much slower than using local storage. |
URLs pointing to local copies of files are returned, but real file presence is not checked.
If local storage is configured for retrieving from AliEn files that are not available locally (which is the case of xrootd with vMSS), then data will be downloaded while analysis is running.
It is called cache mode because it treats the local storage as a cache for the remote storage.
![]() | Warning |
---|---|
This mode is usually very slow on a busy analysis facility since retrieving data in real time without any kind of scheduling is inefficient. It also conflicts with the preferred method, which is to stage data asynchronously using the stager daemon. |
Issuing staging requests and keeping track of them requires an auxiliary database that can be read and updated by the data stager daemon.
Whenever a staging request is issued, a ROOT file containing the dataset is saved in a special directory on the master's filesystem, monitored by the file stager.
In the xproofd configuration file, there is a directive to specify the directory used as repository for staging requests:
xpd.stagereqrepo /path/to/local/directory
Permissions on this directory must be kept open.
![]() | Warning |
---|---|
Versions of the stager daemon prior to v1.0.7 do not support open permissions and the staging repository directive. |
Staging requests and monitoring can be done from within a PROOF session.
gProof->RequestStagingDataSet("QueryString")
Requests staging of the dataset specified via the query string.
Staging request is honored if the stager daemon is running.
![]() | Tip |
---|---|
In order to avoid requesting to stage undesired data, it is advisable to check in advance the results of your query string:
|
TProof->ShowStagingStatusDataSet("QueryString"[, "opts"])
Shows progress status of a previously given staging request with data specified by the query string.
Options are optional, and passed as-is to the TFileCollection::Print()
method.
![]() | Tip |
---|---|
It is possible to show all the files marked as corrupted by the daemon: gProof->ShowStagingStatusDataSet("QueryString", "C") Or all the files successfully staged and not corrupted: gProof->ShowStagingStatusDataSet("QueryString", "Sc") |
gProof->GetStagingStatusDataSet("QueryString")
Gets a TFileCollection
containing information on the staging request specified by the query string.
Works exactly like ShowStagingStatusDataSet()
but returns an object instead of displaying information on the screen.