Kubeflow in K3D with GPU Support


It's easy to install Kubeflow with K3D. To have GPU support, it's needed to first build a K3S image with CUDA support. This article shows how to setup Kubeflow 1.7.0, K3D v5.4.9 and K3S 1.25.6.
 
But what do those names mean?

Kubeflow is a Kubernetes based MLOps tool.  It allows to manage the lifecycle of ML models.

K3S is a lightweight Kubernetes distribution. And K3D runs K3S clusters in Docker, which is neat. 

If GPU support is needed, you have to build a K3S image with CUDA support. There is a K3D manual page to help build this image. But the manual isn't currently updated, and this Github issue nailed the process. In the end you'll have a local K3S imagem with CUDA support.

After having the image built, you just need to create a K3D cluster and install Kubeflow. First, install the k3d cli.

The command below creates a k8s cluster with 3 worker nodes (-a 3), with the load balancer listening on port 8080 (-p 8080:80@loadbalancer), and one gpu (--gpus=1). The --image tells K3D to use the recently built K3S CUDA image.

$ k3d cluster create kubeflow-gpu --api-port 6550 -p "8080:80@loadbalancer" -a 3 --image=rancher/k3s:v1.25.6-k3s1-cuda --gpus=1

After the cluster is created, you can check it's status:

$ k3d cluster list
NAME                    SERVERS   AGENTS   LOADBALANCER
kubeflow-gpu            1/1       3/3      true

$ kubectl get nodes
NAME                                 STATUS   ROLES                  AGE     VERSION
k3d-kubeflow-gpu-agent-1    Ready    <none>                 3h26m   v1.25.6+k3s1
k3d-kubeflow-gpu-agent-0    Ready    <none>                 3h26m   v1.25.6+k3s1
k3d-kubeflow-gpu-server-0   Ready    control-plane,master   3h26m   v1.25.6+k3s1
k3d-kubeflow-gpu-agent-2    Ready    <none>                 3h26m   v1.25.6+k3s1


After the cluster is up and running, it's time to install Kubeflow, according to this instructions. First, you need to install the kustomize cli.

$ git clone https://github.com/kubeflow/manifests
$ cd manifests
$ git checkout v1.7.0
$ while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Check that there are no pods crashing:

$ kubectl get pods -A |grep -v Running

I had to run the while loop twice to have no Pods crashing. It happens because the CRDs can be created in any order, so you may need to execute the installation more than once.

After installation is complete, you need to configure access to the Kubeflow Dashboard. It can be done with kubectl port-forward, which is ephemeral (it will die as soon as you close the shell session), or through an Ingress rule, which is perennial.

The port-forward is straightforward:

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Then point your browser to http://localhost:8080.

The ingress command requires a URL that can be resolved by your workstation:

$ kubectl create ingress -n istio-system kubeflow-entral-dashboard --rule="kubeflow.your.domain/*=istio-ingressgateway:80"

Then point your browser to http://kubeflow.your.domain:8080. The port number is important, because I'm sure you remember we informed K3D that we wanted the load balancer to listen on port 8080 when we created the cluster.

The default username and password is mentioned here.

There is one final caveat. As of K3D version 5.4.9, the pods inside the cluster were not resolving DNS and I needed to patch the Coredns ConfigMap with the correct DNS servers. Create a file named coredns-patch.yml:

apiVersion: v1
data:
  Corefile: |
    .:53 {
        forward . /etc/resolv.conf your-dns1 your-dns2 your-dns3

Replace your-dns1, your-dns2 and your-dns3 accordingly. Then apply the patch:

$ kubectl -n kube-system patch configmaps/coredns --patch-file coredns-patch.yaml


And that's it. Hopefully you now have a Kubeflow instance with GPU support ready to rock.

Now, you can check the GPU is working in your Kubeflow Notebooks. Create a Jupyter Notebook with GPU support and run:

import torch
torch.cuda.is_available()

Expect "True" as the response.

Alternatively, open a console inside the Jupyter Notebook and run:

$ nvidia-smi
 
And you should see the GPU usage information.

Good modelling!



Comments

Popular posts from this blog

Ubuntu 17.10 - CIFS Mount Error Code -5

Integrating Drupal 8 Webforms Submissions and Rocket Chat

Instalação eToken Pro no Ubuntu 18.04 para acesso ao eCAC da RFB