Spark on Kubernetes Python and R bindings

27 Dec, 2018

The version 2.4 of Spark for Kubernetes introduces Python and R bindings.

spark-py: The Spark image with Python bindings (including Python 2 and 3 executables)
spark-r: The Spark image with R bindings (including R executable)

Databricks has published an article dedicated to the Spark 2.4 features for Kubernetes.

It's exactly the same principle as already explained in my [previous article][lk-2]. But this time we are using:

A different image: spark-py
Another example: local:///opt/spark/examples/src/main/python/pi.py, once again a Pi computation :-|
A dedicated Spark namespace: spark.kubernetes.namespace=spark

The namespace must be created first:

$ k create namespace spark
namespace "spark" created

It permits to isolate the Spark pods from the rest of the cluster and could be used later to cap available resources.

$ cd $SPARK_HOME
$ ./bin/spark-submit \
    --master k8s://https://localhost:6443 \
    --deploy-mode cluster \
    --name spark-pi \
    --conf spark.executor.instances=2 \
    --conf spark.driver.memory=512m \
    --conf spark.executor.memory=512m \
    --conf spark.kubernetes.container.image=spark-py:v2.4.0 \
    --conf spark.kubernetes.pyspark.pythonVersion=3 \
    --conf spark.kubernetes.namespace=spark \
    local:///opt/spark/examples/src/main/python/pi.py

spark.kubernetes.pyspark.pythonVersion is an additional (an optional) property that can be used to select the major Python version to use (it's 2 by default).

This sets the major Python version of the docker image used to run the driver and executor containers. Can either be 2 or 3.

Labels

An interesting that has nothing to do with Python is that Spark defines labels that are applied on pods. They permit to easily identify the role of each pod.

$ k get po -L spark-app-selector,spark-role -n spark

NAME                            READY     STATUS              RESTARTS   AGE       SPARK-APP-SELECTOR                       SPARK-ROLE
spark-pi-1545987715677-driver   1/1       Running             0          12s       spark-c4e28a2ef3d14cfda16c007383318c79   driver
spark-pi-1545987715677-exec-1   0/1       ContainerCreating   0          1s        spark-application-1545987726694          executor
spark-pi-1545987715677-exec-2   0/1       ContainerCreating   0          1s        spark-application-1545987726694          executor

You can for example use the label to delete all the terminated driver pods.

#  Can also decide to switch from the default to the spark namespace
# $ k config set-context $(kubectl config current- context) --namespace spark
$ k delete po -l spark-role=driver -n spark
# Or you can delete the whole namespace
$ k delete ns spark

[lk-2]: {{< ref "spark-on-k8s-first-run.md" >}}

#R #python