Introduction
One on the ideas behind PROOF is that outputs (a few histograms) are typically much smaller than input data (large trees). However, the PROOF technology can be used for a wider range of analysis than those producing a few 1D histograms and, if the output is a tree or many 3D large histograms, memory problems may come up.
As a first solution to this problem support for merging output objects via files has been introduced in version 5.18.00. The basic idea is that the output object expected to be large are saved locally in the worker instead to be added to the output list to be sent back to the master. What is sent back to the master are the coordinates of the file containing the objects. At the end of processing, the master will either merge directly the files into the final output file or create (and register) the file collection. A new class TProofOutputFile (TProofFile for ROOT < 5.20) was introduced to steer the automatic merging of the files produced on each worker with the output objects.
In 5.25/02 the functionality of TProofOutputFile has been extended to also give the possibility to automatically derive a dataset (TFileCollection) out of the files produced on the workers. The dataset can optionally registered (and verified) in the dataset database and can be directly and straightly used for further analysis.
TProofOutputFile
The TProofOutputFile class has been introduced to help handling output files produced on the workers. The class has one main constructor taking up to four arguments:
TProofOutputFile(const char *path, ERunType type, UInt_t opt = kRemote, const char *dsname = 0)
- The file name (const char *); the user is allowed to create files either in the assigned directory (available via TProofServ::GetDataDir(), or in the working area of the sandbox; if an absolute path to a different area is given, the path is treated as relative to the assigned directories; to guarantee unicity and avoid concurrency problems, a string in the form "<ord>/<session-tag>/<query-#>" is prepended to the filename. In merging mode the files are always created in the working area;
- A type argument (TProofOutputFile::ERunType) which can take two values:
enum ERunType { kMerge = 1, kDataset = 2};
- kMerge: merge the produced files
- kDataset: create a dataset instead of merging the files; the related TFileCollection object is returned in the output list;
- An option argument (unsigned int) containing an 'ORed' combination of ProofOutputFile::ETypeOpt:
enum ETypeOpt { kRemote = 1, kLocal = 2, kCreate = 4, kRegister = 8, kOverwrite = 16, kVerify = 32};
- kRemote: in merging mode, do not make local copies of the files to be merged;
- kLocal: in merging mode, make local copies of the files to be merged before mergin (see the TFileMerger constructor);
- kCreate: in dataset mode, just create the TFileCollection and add it to the output list (i.e. do not register the dataset);
- kRegister: in dataset mode, register the created dataset;
- kOverwrite: in dataset register mode, force replacement of the dataset if another one with the same name exists already;
- kVerify: in dataset registering mode, create, verify the created dataset
- A name for the dataset to be created (const char *); this argument is optional; by default the base-name of argument 1. is taken as dataset name in case of need.
A second constructor taking up to three arguments is also available (mainly to preserve the initial signature):
TProofOutputFile(const char *path, const char *option = "M", const char *dsname = 0)
- The file name (const char *); see above;
- An option argument (const char *) which can be a combination of the following:
- 'M': merge the produced files
- 'D': create a dataset instead of merging the files; the related TFileCollection object is returned in the output list;
- 'L': in merging mode, make local copies of the files to be merged before mergin (see the TFileMerger constructor);
- 'H': in merging mode, merge histograms in one go; a bit faster but potentially much more memory consuming.
- 'R': in dataset mode, register the created dataset;
- 'O': in dataset register mode, force replacement of the dataset if another one with the same name exists already;
- 'V': in dataset registering mode, create, verify the created dataset
- A name for the dataset to be created (const char *); see above.
In merging mode the master must be able to read out the files from the worker machines. This means that the working directories, where the files are created, must be exported to the master. If there is an data-serving xrootd system running on the cluster, this can be used for this purpose. The administrator must make sure that the sandboxes are exported in the xrootd configuration file.
If the created file needs to be served by a local server the URL of the server can be passed using the environment variable LOCALDATASERVER. Also, any 'localroot' path definition via the RC-env 'Path.Localroot' are automatically trimmed out; this kind of envs can be set via the xrootd directives 'xpd.putenv' and 'xpd.putrc'. It can also be changed by the user in the way explained in the setting the environment section. When the files are on a shared partition LOCALDATASERVER must be set to 'file://' .
The URL for the output file is controlled by the RC-env 'Proof.OutputFile'; the defaut is the input file name created in the master sandbox.
Both local and final filenames can contain placeholders which are resolved in the constructor of TProofOutputFile. The following place-holders are recognized:
- <user>, the user name
- <u>, the user name initial
- <group>, the PROOF group name
- <stag>, the session tag
- <ord>, the worker ordinal number
- <file>, the base-name of the path used to initialize the TProofOutputFile
Note that <ord> is dropped from the file name in the output file.
In the following of this page we describe how all this works with the help of a selector generating the tutorials demo ntuple and displaying its content.The selector code can be found at ProofNtuple.C under the $ROOTSYS/tutorials/proof directory. An example of steering macro is found in runProof.C under the same directory.
Dissection of ProofNtuple
Creating the TProofOutputFile object, opening the file and creating the ntuple
The instance of the TProofOutputFile class is created in SlaveBegin:
// We may be creating a dataset or a merge file: check it TNamed *nm = dynamic_cast<TNamed *>(fInput->FindObject("SimpleNtuple.root")); if (nm) { // Just create the object UInt_t opt = TProofOutputFile::kRegister | TProofOutputFile::kOverwrite | TProofOutputFile::kVerify; fProofFile = new TProofOutputFile("SimpleNtuple.root", TProofOutputFile::kDataset, opt, nm->GetTitle()); } else { // For the ntuple, we use the automatic file merging facility // Check if an output URL has been given TNamed *out = (TNamed *) fInput->FindObject("PROOF_OUTPUTFILE_LOCATION"); Info("SlaveBegin", "PROOF_OUTPUTFILE_LOCATION: %s", (out ? out->GetTitle() : "undef")); fProofFile = new TProofOutputFile("SimpleNtuple.root", (out ? out->GetTitle() : "M")); out = (TNamed *) fInput->FindObject("PROOF_OUTPUTFILE"); if (out) fProofFile->SetOutputFileName(out->GetTitle()); }
The way the object is created depends on the type of run and reflects the two ways this TSelector implementation is used in the tutorials.
The first conditional scope is used when creating a dataset ("dataset" tutorial in runProof.C). A named object called "SimpleNtuple.root" triggers the creation of a dataset which name is the title of the named object. The dataset will be registered, forcing replacement and verified.
The second part of the conditional scope is used when merging the files ("ntuple" tutorial in runProof.C). In this case we use the named object "PROOF_OUTPUTFILE_LOCATION" to determine whether the files are copie on the master before merging or not.
The file must be open using TProofOutputFile::OpenFile:
// Open the file TDirectory *savedir = gDirectory; if (!(fFile = fProofFile->OpenFile("RECREATE"))) { Warning("SlaveBegin", "problems opening file: %s/%s", fProofFile->GetDir(), fProofFile->GetFileName()); }
The TNtuple is created and attache to the file as usual in ROOT:
// Now we create the ntuple fNtp = new TNtuple("ntuple","Demo ntuple","px:py:pz:random:i"); // File resident fNtp->SetDirectory(fFile); fNtp->AutoSave();
Filling the ntuple
The TNtuple is filled in Process :
Bool_t ProofNtuple::Process(Long64_t entry) { // Fill ntuple Float_t px, py; fRandom->Rannor(px,py); Float_t pz = px*px + py*py; Float_t random = fRandom->Rndm(1); Int_t i = (Int_t) entry; fNtp->Fill(px,py,pz,random,i); return kTRUE; }
Finalizing the file
The file must be finalized and closed in SlaveTerminate :
// Write the ntuple to the file if (fFile) { Bool_t cleanup = kFALSE; TDirectory *savedir = gDirectory; if (fNtp->GetEntries() > 0) { fFile->cd(); fNtp->Write(); fProofFile->Print(); fOutput->Add(fProofFile); } else { cleanup = kTRUE; } fNtp->SetDirectory(0); gDirectory = savedir; fFile->Close(); // Cleanup, if needed if (cleanup) { TUrl uf(*(fFile->GetEndpointUrl())); SafeDelete(fFile); gSystem->Unlink(uf.GetFile()); SafeDelete(fProofFile); } }
We add the TProofOutputFile to the output list only if the ntuple is not empty, otherwise we cleanup the file.
Accessing the output objects
For "ntuple" runs the output objects in Terminate are read from the output file and used for the final plot:
// Do nothing is not requested (dataset creation run) if (!fPlotNtuple) return; // Get the ntuple form the file if ((fProofFile = dynamic_cast<TProofOutputFile*>(fOutput->FindObject("SimpleNtuple.root")))) { TString outputFile(fProofFile->GetOutputFileName()); TString outputName(fProofFile->GetName()); outputName += ".root"; Printf("outputFile: %s", outputFile.Data()); // Read the ntuple from the file fFile = TFile::Open(outputFile); if (fFile) { Printf("Managed to open file: %s", outputFile.Data()); fNtp = (TNtuple *) fFile->Get("ntuple"); } else { Error("Terminate", "could not open file: %s", outputFile.Data()); } if (!fFile) return; } else { Error("Terminate", "TProofOutputFile not found"); return; }
For "dataset" runs the graphical finalization is done outside the selector in runProof.C, providing an example of drawing functionality via PROOF.
proof->AddInput(new TNamed("PROOF_OUTPUTFILE", "root://arthux.do.main:5151//data/out/MyOutput.root"));