A cluster contains all the data of a given event range. As clusters are usually compressed and tied to event boundaries, an exact size cannot be enforced. Instead, RNTuple uses a target size for the compressed data as a guideline for when to flush a cluster.
The default cluster target size is 50MB of compressed data. The default can be changed by the RNTupleWriteOptions
. The default should work well in the majority of cases. In general, larger clusters provide room for more and larger pages and should improve compression ratio and speed. However, clusters also need to be buffered during write and (partially) during read, so larger clusters increase the memory footprint.
A second option in RNTupleWriteOptions
specifies the maximum uncompressed cluster size. The default is 512MiB. This setting acts as an "emergency break" and should prevent very compressible clusters from growing too large.
Given the two settings, writing works as follows: when the current cluster is larger than the maximum uncompressed size, it will be flushed unconditionally. When the current cluster size reaches the estimate for the compressed cluster size, it will be flushed, too. The estimated compression ratio for the first cluster is 0.5 if compression is used, and 1 otherwise. The following clusters use the average compression ratio of all so-far written clusters as an estimate. See the notes below on a discussion of this approximation.
Pages contain consecutive elements of a certain columns. They are the unit of compression and of addressability on storage. RNTuple uses a target size for the uncompressed data as a guideline for when to flush a page.
The default page target size is 64KiB. The default can be changed by the RNTupleWriteOptions
. In general, larger pages give better compression ratios; smaller pages reduce the memory footprint. When reading, every active column requires at least one page buffer. For the number of read requests, the page size does not matter because pages of the same column are written consecutively and therefore read in one go.
Given the target size, writing works as follows: In the beginning, the first page is filled until the target size. Afterwards there is a mechanism to prevent undersized tail pages: writing uses two page buffers in turns and flushes the previous buffer filled to its target size only once the next buffer is at least at 50%. Then writing continues until the target size, at which point writing switches back to the other page. If the cluster gets flushed with an undersized tail page, the small page is appended to the previous page before flushing. Therefore, tail pages sizes are between [0.5 * target size .. 1.5 * target size]
(unless the column doesn't have enough elements to fill 50% of the first page).
Concretely, writing will fill and flush two pages A
and B
as follows:
A
.A
reached its target size, writing switches to page B
while the contents of page A
are kept in memory.B
is at least at 50%, page A
is flushed.B
reaches its target size, writing switches to page A
.A
is at least at 50%, page B
is flushed.The estimator for the compressed cluster size uses the average compression factor of the so far written clusters. This has been choosen as a simple, yet expectedly accurate enough estimator (to be validated). The following alternative strategies were discussed: