Spark on Kubernetes Python and R bindings
The version 2.4 of Spark for Kubernetes introduces Python and R bindings.
spark-py: The Spark image with Python bindings (including Python 2 and 3 executables)spark-r: The Spark image with R bindings (including R executable)
Databricks has published an article dedicated to the Spark 2.4 features for Kubernetes.
It's exactly the same principle as already explained in my [previous article][lk-2]. But this time we are using:
- A different image:
spark-py - Another example:
local:///opt/spark/examples/src/main/python/pi.py, once again a Pi computation :-| - A dedicated Spark namespace:
spark.kubernetes.namespace=spark
The namespace must be created first:
$ k create namespace spark
namespace "spark" created
It permits to isolate the Spark pods from the rest of the cluster and could be used later to cap available resources.
$ cd $SPARK_HOME
$ ./bin/spark-submit \
--master k8s://https://localhost:6443 \
--deploy-mode cluster \
--name spark-pi \
--conf spark.executor.instances=2 \
--conf spark.driver.memory=512m \
--conf spark.executor.memory=512m \
--conf spark.kubernetes.container.image=spark-py:v2.4.0 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.namespace=spark \
local:///opt/spark/examples/src/main/python/pi.py
spark.kubernetes.pyspark.pythonVersion is an additional (an optional) property that can be used to select the major Python version to use (it's 2 by default).
This sets the major Python version of the docker image used to run the driver and executor containers. Can either be 2 or 3.
Labels
An interesting that has nothing to do with Python is that Spark defines labels that are applied on pods. They permit to easily identify the role of each pod.
$ k get po -L spark-app-selector,spark-role -n spark
NAME READY STATUS RESTARTS AGE SPARK-APP-SELECTOR SPARK-ROLE
spark-pi-1545987715677-driver 1/1 Running 0 12s spark-c4e28a2ef3d14cfda16c007383318c79 driver
spark-pi-1545987715677-exec-1 0/1 ContainerCreating 0 1s spark-application-1545987726694 executor
spark-pi-1545987715677-exec-2 0/1 ContainerCreating 0 1s spark-application-1545987726694 executor
You can for example use the label to delete all the terminated driver pods.
# Can also decide to switch from the default to the spark namespace
# $ k config set-context $(kubectl config current- context) --namespace spark
$ k delete po -l spark-role=driver -n spark
# Or you can delete the whole namespace
$ k delete ns spark
[lk-2]: {{< ref "spark-on-k8s-first-run.md" >}}