Setup Continuous Profiling for GPUs using Kubernetes
Prerequisites:
- Kubernetes Cluster.
- Kubernetes Nodes are running Linux 5.4 or newer.
- A NVIDIA GPU
Get an API token
To be able to send any data to Polar Signals Cloud, we're going to need an API token for the collection mechanism to authenticate with the Polar Signals API.
Please refer to the Generating Tokens documentation that has more details.
Instructions
The Kubernetes manifest below, will deploy the Polar Signals Agent and Polar Signals GPU Metrics Agent as DaemonSet to a Kubernetes cluster. Here's what it does in summary:
- Creates a namespace called
polarsignals
to deploy the agent into. - Creates a secret containing the Polarsignals API token (which you should have created above) for authentication.
- Defines a ClusterRole and ClusterRoleBinding to grant the agent permissions to list pods and get node info across the cluster.
- Deploys the agent and gpu-metrics-agent as a DaemonSet. This will deploy two pod to each nodes in the cluster. The agent container runs with some privileged settings to enable profiling via eBPF.
The agent will then profile all applications running on the nodes and send the profiling data to the Polar Signals Cloud for queries and analysis.
Make sure to paste in your generated API token and then copy the manifest below into a file called polarsignals-agent.yaml
, then apply it to your Kubernetes cluster using the command below.
Polar Signals Agent
kubectl apply -f polarsignals-agent.yaml
You can also use the command below to apply the manifest directly from the Polar Signals API.
kubectl apply -f "https://api.polarsignals.com/api/manifests.yaml?token=<your-token>"
apiVersion: v1
kind: Namespace
metadata:
labels:
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/warn: privileged
name: polarsignals
---
apiVersion: v1
kind: Secret
metadata:
name: polarsignals-agent
namespace: polarsignals
labels:
app.kubernetes.io/component: continuous-profiler
app.kubernetes.io/instance: polarsignals-agent
app.kubernetes.io/name: polarsignals-agent
stringData:
token: <your-token-here>
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: continuous-profiler
app.kubernetes.io/instance: polarsignals-agent
app.kubernetes.io/name: polarsignals-agent
name: polarsignals-agent
namespace: polarsignals
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/component: continuous-profiler
app.kubernetes.io/instance: polarsignals-agent
app.kubernetes.io/name: polarsignals-agent
name: polarsignals-agent
namespace: polarsignals
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: polarsignals-agent
subjects:
- kind: ServiceAccount
name: polarsignals-agent
namespace: polarsignals
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app.kubernetes.io/component: continuous-profiler
app.kubernetes.io/instance: polarsignals-agent
app.kubernetes.io/name: polarsignals-agent
name: polarsignals-agent
namespace: polarsignals
spec:
selector:
matchLabels:
app.kubernetes.io/component: continuous-profiler
app.kubernetes.io/instance: polarsignals-agent
app.kubernetes.io/name: polarsignals-agent
template:
metadata:
labels:
app.kubernetes.io/component: continuous-profiler
app.kubernetes.io/instance: polarsignals-agent
app.kubernetes.io/name: polarsignals-agent
app.kubernetes.io/version: v0.36.0
spec:
containers:
- args:
- /bin/parca-agent
- --log-level=info
- --node=$(NODE_NAME)
- --http-address=:7071
- --remote-store-address=grpc.polarsignals.com:443
- --remote-store-bearer-token-file=/var/polarsignals-agent/token
- --debuginfo-strip
- --debuginfo-temp-dir=/tmp
- --debuginfo-upload-cache-duration=5m
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
image: ghcr.io/parca-dev/parca-agent:v0.36.0
name: polarsignals-agent
ports:
- containerPort: 7071
name: http
readinessProbe:
httpGet:
path: /ready
port: http
resources: {}
securityContext:
privileged: true
readOnlyRootFilesystem: true
volumeMounts:
- mountPath: /tmp
name: tmp
- mountPath: /run
name: run
- mountPath: /boot
name: boot
readOnly: true
- mountPath: /lib/modules
name: modules
- mountPath: /sys/kernel/debug
name: debugfs
- mountPath: /sys/fs/cgroup
name: cgroup
- mountPath: /sys/fs/bpf
name: bpffs
- mountPath: /var/run/dbus/system_bus_socket
name: dbus-system
- mountPath: /var/polarsignals-agent
name: token
hostPID: true
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: polarsignals-agent
tolerations:
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- emptyDir: {}
name: tmp
- hostPath:
path: /run
name: run
- hostPath:
path: /boot
name: boot
- hostPath:
path: /sys/fs/cgroup
name: cgroup
- hostPath:
path: /lib/modules
name: modules
- hostPath:
path: /sys/fs/bpf
name: bpffs
- hostPath:
path: /sys/kernel/debug
name: debugfs
- hostPath:
path: /var/run/dbus/system_bus_socket
name: dbus-system
- secret:
secretName: polarsignals-agent
name: token
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: continuous-profiler
app.kubernetes.io/instance: polarsignals-agent
app.kubernetes.io/name: polarsignals-agent
name: polarsignals-agent
namespace: polarsignals
Polar Signals GPU Metrics Agent
Additionally, going to deploy the gpu-metrics-agent.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-metrics-agent
namespace: polarsignals
labels:
app: gpu-metrics-agent
spec:
selector:
matchLabels:
app: gpu-metrics-agent
template:
metadata:
labels:
app: gpu-metrics-agent
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-gpu # Change this if you're not on GKE
operator: In
values:
- 'true'
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "present"
effect: NoSchedule
containers:
- name: gpu-metrics-agent
image: ghcr.io/polarsignals/gpu-metrics-agent:main-e8d881f
command:
- /gpu-metrics-agent
- --log-level=debug
- --metrics-producer-nvidia-gpu
- --collection-interval=2s
- --node=$(NODE_NAME)
- --remote-store-address=grpc.polarsignals.com:443
- --remote-store-bearer-token-file=/etc/gpu-metrics-server/token
- --remote-store-insecure-skip-verify
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
requests:
memory: 32Mi
cpu: 10m
limits:
memory: 128Mi
cpu: 100m
securityContext:
readOnlyRootFilesystem: true
privileged: true
volumeMounts:
- mountPath: /usr/local
name: kubernetes-bin
- mountPath: /etc/gpu-metrics-server
name: token
readOnly: false
volumes:
- name: token
secret:
secretName: polarsignals-agent
- hostPath:
path: /home/kubernetes/bin
type: Directory
name: kubernetes-bin