ROOT
v6-34
Reference Guide
Loading...
Searching...
No Matches
distrdf001_spark_connection.py
Go to the documentation of this file.
1
## \file
2
## \ingroup tutorial_dataframe
3
## \notebook -draw
4
## Configure a Spark connection and fill two histograms distributedly.
5
##
6
## This tutorial shows the ingredients needed to setup the connection to a Spark
7
## cluster, namely a SparkConf object holding configuration parameters and a
8
## SparkContext object created with the desired options. After this initial
9
## setup, an RDataFrame with distributed capabilities is created and connected
10
## to the SparkContext instance. Finally, a couple of histograms are drawn from
11
## the created columns in the dataset.
12
##
13
## \macro_code
14
## \macro_image
15
##
16
## \date March 2021
17
## \author Vincenzo Eduardo Padulano
18
import
pyspark
19
import
ROOT
20
21
# Point RDataFrame calls to Spark RDataFrame object
22
RDataFrame =
ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
23
24
# Setup the connection to Spark
25
# First create a dictionary with keys representing Spark specific configuration
26
# parameters. In this tutorial we use the following configuration parameters:
27
#
28
# 1. spark.app.name: The name of the Spark application
29
# 2. spark.master: The Spark endpoint responsible for running the
30
# application. With the syntax "local[2]" we signal Spark we want to run
31
# locally on the same machine with 2 cores, each running a separate
32
# process. The default behaviour of a Spark application would run
33
# locally on the same machine with as many concurrent processes as
34
# available cores, that could be also written as "local[*]".
35
#
36
# If you have access to a remote cluster you should substitute the endpoint URL
37
# of your Spark master in the form "spark://HOST:PORT" in the value of
38
# `spark.master`. Depending on the availability of your cluster you may request
39
# more computing nodes or cores per node with a similar configuration:
40
#
41
# sparkconf = pyspark.SparkConf().setAll(
42
# {"spark.master": "spark://HOST:PORT",
43
# "spark.executor.instances": <number_of_nodes>,
44
# "spark.executor.cores" <cores_per_node>,}.items())
45
#
46
# You can find all configuration options and more details in the official Spark
47
# documentation at https://spark.apache.org/docs/latest/configuration.html .
48
49
# Create a SparkConf object with all the desired Spark configuration parameters
50
sparkconf =
pyspark.SparkConf
().
setAll
(
51
{
"spark.app.name"
:
"distrdf001_spark_connection"
,
52
"spark.master"
:
"local[2]"
,
53
"spark.driver.memory"
:
"4g"
}.
items
())
54
# Create a SparkContext with the configuration stored in `sparkconf`
55
sparkcontext =
pyspark.SparkContext
(conf=sparkconf)
56
57
# Create an RDataFrame that will use Spark as a backend for computations
58
df = RDataFrame(1000, sparkcontext=sparkcontext)
59
60
# Set the random seed and define two columns of the dataset with random numbers.
61
ROOT.gRandom.SetSeed
(1)
62
df_1 =
df.Define
(
"gaus"
,
"gRandom->Gaus(10, 1)"
).Define(
"exponential"
,
"gRandom->Exp(10)"
)
63
64
# Book an histogram for each column
65
h_gaus =
df_1.Histo1D
((
"gaus"
,
"Normal distribution"
, 50, 0, 30),
"gaus"
)
66
h_exp =
df_1.Histo1D
((
"exponential"
,
"Exponential distribution"
, 50, 0, 30),
"exponential"
)
67
68
# Plot the histograms side by side on a canvas
69
c =
ROOT.TCanvas
(
"distrdf001"
,
"distrdf001"
, 800, 400)
70
c.Divide
(2, 1)
71
c.cd
(1)
72
h_gaus.DrawCopy
()
73
c.cd
(2)
74
h_exp.DrawCopy
()
75
76
# Save the canvas
77
c.SaveAs
(
"distrdf001_spark_connection.png"
)
78
print(
"Saved figure to distrdf001_spark_connection.png"
)
TRangeDynCast
ROOT::Detail::TRangeCast< T, true > TRangeDynCast
TRangeDynCast is an adapter class that allows the typed iteration through a TCollection.
Definition
TCollection.h:358
ROOT::Detail::TRangeCast
Definition
TCollection.h:311
tutorials
dataframe
distrdf001_spark_connection.py
ROOT v6-34 - Reference Guide Generated on Fri Jan 24 2025 14:44:17 (GVA Time) using Doxygen 1.10.0