Content:
- Introduction
- Addressed use-cases
- Client-side options
- Server-side options
- Summary of parameters and environment variables
1. Introduction
By default PROOF, the output of a PROOF query is kept in memory and available via the output list. It is known that for large outputs this can be problematic. The solution proposed by PROOF to this problem is to use files to swap objects from memory; the files can be either merged at the end or accessed via a global common view via a dataset.This technology provides indeed a good solution to the problem; however, it turned out to be difficult to setup for the average user.
To simplify access to this technology in particular - and to output handling more in general - a new set of options have been added to TProof::Process. These new options are the subject of these pages. They are available in the trunk (PROOF-Lite support starting from r45632) and in the 5.32 and 5.34 patch branches, starting from tags 5.34/02 and 5.32/05 .
One of the more frequent PROOF user questions is how to save to file the results of a run. This is not strictly connected to the technology used to handle the output but more with the fact the Terminate() method is not much used, if not to save the results to a file. Therefore a quick way to define a file where to save the results without having to re-implement the same code in each TSelector would certainly be welcome by may users. Another missing functionality is the possibility to save the partial results while processing , so that in case of a crash, not all is lost and can be partially recovered.
The options described in this page allow to control via the option field of TProof::Process the following cases:
- Define an output file where to save all the objects which are not already saved in other output files;
- Decide if the merging process to create the output file happens in memory or via file;
- Decide if partial results have to saved after each packet
- Give the possibility to the cluster administrator to control file-saving by setting a memory threshold above which object swapping to file is done whatever the user's setting will be;
- In alternative to file merging, give the possibility to create a dataset with the files created on the nodes; the user can then decide what to do with the dataset.
3.1. Enable save-to-file technology
The keyword 'stf' or 'savetofile' can be used in the option field to force merging via file. By default the final file is kept in the user data directory on the master. For example, this is how the 'ProofSimple' tutorial looks like when this option is passed:
root [1] p->SetParameter("ProofSimple_NHist", (Long_t) 16) // NB: ProofSimple_NHist is needed by the ProofSimple tutorial, not by the safe-to-file functionality
root [2] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "stf")
Mst-0: merging output objects ... done
Output file: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/output-cernvm24-1343918756-15791.q1.root
Mst-0: grand total: sent 8 objects, size: 1405 bytes
ntuple opts: 0 0
(Long64_t)0
Internally, PROOF creates a TProofOutputFile object and adds it to the output list:
root [3] p->GetOutputList()->Print()
Collection name='TList', class='TList', size=1
Info in <:print>: -------------- output-cernvm24-1343918756-15791.q1.root : start (cernvm28.cern.ch:1093) ------------
Info in <:print>: dir: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/
Info in <:print>: raw dir: /home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/
Info in <:print>: file name: output-cernvm24-1343918756-15791.q1.root
Info in <:print>: run type: create a merged file
Info in <:print>: merging option: keep remote
Info in <:print>: output file name: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/output-cernvm24-1343918756-15791.q1.root
Info in <:print>: ordinal: 0
Info in <:print>: -------------- output-cernvm24-1343918756-15791.q1.root : done -------------
The method TProof::GetOutput() can be used to access transparently the output objects: if the searched object is not found in the output list and the output list contains TProofOutputFile objects, these files are opened and searched for the object; this is what ProofSimple::Terminate does to make the tutorial behaviour unchanged.
3.2. Saving merged objects to an output file
It is possible to specify a file where to save the output object with the keywords 'of' or 'outfile'; for example
root [1] p->SetParameter("ProofSimple_NHist", (Long_t) 16) // NB: ProofSimple_NHist is needed by the ProofSimple tutorial, not by the safe-to-file functionality
root [2] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "of=test.root")
Mst-0: merging output objects ... done
Mst-0: grand total: sent 23 objects, size: 15506 bytes
ntuple opts: 0 0
Output saved to test.root
(Long64_t)0
The specified file path is interpreted fro the client machine and can be also a full URL. If 'master:' is prefixed then the path is interpreted from the master machine:
root [3] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "of=master:test.root")
Info in : unmodified script has already been compiled and loaded
Mst-0: merging output objects ... done
Output file: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1344002021-23918/test.root
Mst-0: grand total: sent 8 objects, size: 1005 bytes
ntuple opts: 0 0
(Long64_t)0
By default file are created in the user data directory on the master; a TProofOutputFile is always sent back to the user with the location of the file and the URL to open it remotely. If the option 'stf' is specified (or if the server side settings enforce it) then merging goes via file and the location of the intermediate file is notified via TProofOutputFile:
root [5] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "of=test.root;stf")
Info in : unmodified script has already been compiled and loaded
Mst-0: merging output objects ... done
Output file: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1344002021-23918/test.root
Mst-0: grand total: sent 8 objects, size: 1281 bytes
ntuple opts: 0 0
[TFile::Cp] Total 0.02 MB |====================| 100.00 % [1.9 MB/s]
Output successfully copied to test.root
(Long64_t)0
3.3. Creating a dataset
To create a dataset use the keyword 'ds':
root [6] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "ds=testds")
Info in : unmodified script has already been compiled and loaded
Registering dataset 'testds' ... OK (1 workers still sending)
Mst-0: merging output objects ... done
Mst-0: grand total: sent 8 objects, size: 26539 bytes
ntuple opts: 0 0
(Long64_t)0
Option 'safe-to-file' is enforced in this case. By default the dataset is only registered; to force verification add '|V':
root [7] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "ds|V")
Info in : unmodified script has already been compiled and loaded
Registering dataset 'dataset_cernvm24-1344002021-23918_q6' ... OK
Mst-0: merging output objects ... done
Mst-0: grand total: sent 8 objects, size: 26569 bytes
ntuple opts: 0 0
Collection name='TList', class='TList', size=12
Collection name='FeedbackList', class='TList', size=0
TParameter ProofSimple_NHist = 16
OBJ: TNamed PROOF_DefaultOutputOption ds:dataset_
TParameter PROOF_SavePartialResults = 1
OBJ: TNamed PROOF_QueryTag session-cernvm24-1344002021-23918:q6
OBJ: TNamed PROOF_FilesToProcess dataset:dataset_cernvm24-1344002021-23918_q6
OBJ: TNamed PROOF_Packetizer TPacketizerFile
OBJ: TNamed PROOF_VerifyDataSet dataset_cernvm24-1344002021-23918_q6
OBJ: TNamed PROOF_VerifyDataSetOption
TParameter PROOF_IncludeFileInfoInPacket = 1
OBJ: TNamed PROOF_MSS
OBJ: TNamed PROOF_StageOption
Registering dataset 'dataset_cernvm24-1344002021-23918_q6' ... OK
Mst-0: merging output objects ... done
Mst-0: grand total: sent 102 objects, size: 29567 bytes
Info in <:verifydataset>: dataset_cernvm24-1344002021-23918_q6: changed? 1 (# files opened = 24, # files touched = 0, # missing files = 0)
(Long64_t)0
In this example we see that is is not necessary to specify a name for the dataset: if missing, the name will be in the form 'dataset_
3.4 Summary of option keywords
The client-side keywords described in this section are summarized in Table 1.
Long name | Short | Description | Subsection |
---|---|---|---|
safetofile[=opt] | stf[=opt] | Control saving of partial results to file; the optional opt field is in the form o1*10 + o0 with o0 = 0 save if required by admin 1 force saving o1 = 1 save after each packet 0 save at query end Default is opt = 1 when the keyword is specified (0 if not specified). |
3.1 |
outfile=fileout | of=fileout | Enables saving the query output to file fileout. The path (which could be a full URL) is interpreted from the client session unless it starts with 'master', in which case it is created from the master session. Using 'of=master' saves the results in the master data directory with name If not specified, this option is internally set to 'master' in the case the administrator forces saving to file. |
3.2 |
dataset[=name] | ds[=name] | Enables creation of a dataset out of the saved files. The list of files is also returned in the output list in the form of a TFileCollection. The dataset is registered under name if registration is allowed by the administrator. The dataset is also verified if the name field contains '|V'; the sequence '|V' is in such a case removed from the final dataset name. The name is set to 'dataset_ if not specified. |
3.3 |
5. Summary of parameters and environment variables
The parameters and RC environment variables affecting the options defined in this section are summarized in Table 2.
Name | Type | Description | Remarks |
---|---|---|---|
PROOF_DefaultOutputOption | Param | String parameter used internally to pass the output file or the dataset name; in the first case it is in the form 'of:fileout', while for datasets it has the form 'ds:name' |
|
PROOF_SavePartialResults | Param | Int_t parameter containing the 'safetofile' option | |
ProofPlayer.SavePartialResults | RC var | As PROOF_SavePartialResults | Server side only |
ProofPlayer.SaveMemThreshold | RC var | Float_t parameter defining the threshold to force file saving; it is expressed as fraction of physical memory per core |
Server side only |