Logo ROOT  
Reference Guide
distrdf001_spark_connection.py
Go to the documentation of this file.
1## \file
2## \ingroup tutorial_dataframe
3## \notebook -draw
4## Configure a Spark connection and fill two histograms distributedly.
5##
6## This tutorial shows the ingredients needed to setup the connection to a Spark
7## cluster, namely a SparkConf object holding configuration parameters and a
8## SparkContext object created with the desired options. After this initial
9## setup, an RDataFrame with distributed capabilities is created and connected
10## to the SparkContext instance. Finally, a couple of histograms are drawn from
11## the created columns in the dataset.
12##
13## \macro_code
14## \macro_image
15##
16## \date March 2021
17## \author Vincenzo Eduardo Padulano
18import pyspark
19import ROOT
20
21# Point RDataFrame calls to Spark RDataFrame object
22RDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
23
24# Setup the connection to Spark
25# First create a dictionary with keys representing Spark specific configuration
26# parameters. In this tutorial we use the following configuration parameters:
27#
28# 1. spark.app.name: The name of the Spark application
29# 2. spark.master: The Spark endpoint responsible for running the
30# application. With the syntax "local[4]" we signal Spark we want to run
31# locally on the same machine with 4 cores, each running a separate
32# process. The default behaviour of a Spark application would run
33# locally on the same machine with as many concurrent processes as
34# available cores, that could be also written as "local[*]".
35#
36# If you have access to a remote cluster you should substitute the endpoint URL
37# of your Spark master in the form "spark://HOST:PORT" in the value of
38# `spark.master`. Depending on the availability of your cluster you may request
39# more computing nodes or cores per node with a similar configuration:
40#
41# sparkconf = pyspark.SparkConf().setAll(
42# {"spark.master": "spark://HOST:PORT",
43# "spark.executor.instances": <number_of_nodes>,
44# "spark.executor.cores" <cores_per_node>,}.items())
45#
46# You can find all configuration options and more details in the official Spark
47# documentation at https://spark.apache.org/docs/latest/configuration.html .
48
49# Create a SparkConf object with all the desired Spark configuration parameters
50sparkconf = pyspark.SparkConf().setAll(
51 {"spark.app.name": "distrdf001_spark_connection",
52 "spark.master": "local[4]", }.items())
53# Create a SparkContext with the configuration stored in `sparkconf`
54sparkcontext = pyspark.SparkContext(conf=sparkconf)
55
56# Create an RDataFrame that will use Spark as a backend for computations
57df = RDataFrame(1000, sparkcontext=sparkcontext)
58
59# Set the random seed and define two columns of the dataset with random numbers.
60ROOT.gRandom.SetSeed(1)
61df_1 = df.Define("gaus", "gRandom->Gaus(10, 1)").Define("exponential", "gRandom->Exp(10)")
62
63# Book an histogram for each column
64h_gaus = df_1.Histo1D(("gaus", "Normal distribution", 50, 0, 30), "gaus")
65h_exp = df_1.Histo1D(("exponential", "Exponential distribution", 50, 0, 30), "exponential")
66
67# Plot the histograms side by side on a canvas
68c = ROOT.TCanvas("distrdf001", "distrdf001", 800, 400)
69c.Divide(2, 1)
70c.cd(1)
71h_gaus.DrawCopy()
72c.cd(2)
73h_exp.DrawCopy()
74
75# Save the canvas
76c.SaveAs("distrdf001_spark_connection.png")
77print("Saved figure to distrdf001_spark_connection.png")