The Packetizer
The packetizer is responsible for load balancing a job between the workers assigned to it. It decides where each piece of work - called a packet - should be processed. An instance of the packetizer is created on the master node. In case of a multi-master configuration, there is one packetizer created for each of the sub-masters. Therefore, when looking at the packetizer, we can focus on the case of a single master without loosing generality.
The performance of the workers can vary significantly as well as the transfer rates to access different files. In order to dynamically balance the work distribution, the packetizer uses a pull architecture: when workers are ready for further processing they ask the packetizer for a next packet.
The Pull Architecture
The different packetizers and their strategies are described in this paper.