Efficient AI Model Serving with KServe ModelMesh and Kubernetes Persistent Volumes
By Fractz - 25 Sep 2024
Efficiently serving AI models at scale is essential for maintaining high-performance applications in today data-driven world. KServe’s ModelMesh Serving framework, when paired with Kubernetes Persistent Volumes (PVs), offers a scalable and resource-efficient solution for deploying AI models. This blog provides a step-by-step guide on configuring ModelMesh Serving to utilize PVs, reducing deployment latency and eliminating reliance on cloud object storage.
1. Introduction
As AI models grow more complex and require frequent updates, the infrastructure supporting these models must evolve to keep pace. Traditional methods of storing model files on cloud object storage can introduce delays during deployment, particularly when fetching large model files. KServe’s ModelMesh Serving addresses these challenges by providing a scalable model serving framework that can dynamically load and unload models as needed. By leveraging Kubernetes Persistent Volumes, this solution further optimizes model serving by reducing latency and improving overall efficiency.
2. Step-by-Step Guide
This section provides a detailed guide on configuring KServe ModelMesh Serving with Kubernetes Persistent Volumes. Follow these steps to set up an efficient AI model serving environment.
2.1. Prerequisites
Before starting, ensure you have the following:
• A Kubernetes cluster with admin privileges (or Minikube) with at least 4 CPUs and 8 GB of memory.
• kubectl and kustomize (v4.0.0+) installed.
• A “Quickstart” installation of ModelMesh Serving.
2.2. Create a Persistent Volume Claim (PVC)
To begin, create a Persistent Volume Claim (PVC) within your Kubernetes cluster to allocate storage that ModelMesh can use to store model files.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "my-models-pvc"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
EOF
Apply this configuration using kubectl:
kubectl apply -f - <<EOF
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "my-models-pvc"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
EOF
Verify that the PVC is created and bound to a persistent volume:
kubectl get pvc
# Output example:
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# my-models-pvc Bound pvc-783726ab-9fd3-47f3-8c7d-bf7822d6d7f8 15Gi RWX retain-file-gold 2m
2.3. Create a Pod to Access the PVC
Next, create a pod that will mount the PVC as a volume. This pod will be used to upload your model files to the persistent volume.
apiVersion: v1
kind: Pod
metadata:
name: "pvc-access"
spec:
containers:
- name: main
image: ubuntu
command: ["/bin/sh", "-ec", "sleep 10000"]
volumeMounts:
- name: "my-pvc"
mountPath: "/mnt/models"
volumes:
- name: "my-pvc"
persistentVolumeClaim:
claimName: "my-models-pvc"
EOF
Apply this configuration using kubectl:
kubectl apply -f - <<EOF
---
apiVersion: v1
kind: Pod
metadata:
name: "pvc-access"
spec:
containers:
- name: main
image: ubuntu
command: ["/bin/sh", "-ec", "sleep 10000"]
volumeMounts:
- name: "my-pvc"
mountPath: "/mnt/models"
volumes:
- name: "my-pvc"
persistentVolumeClaim:
claimName: "my-models-pvc"
EOF
Confirm that the pod is running:
kubectl get pods | grep pvc\|STATUS
# Output example:
# NAME READY STATUS RESTARTS AGE
# pvc-access 1/1 Running 0 2m12s
2.4. Store the Model on the Persistent Volume
Now, add your AI model to the persistent volume. In this tutorial, we use the MNIST handwritten digit recognition model trained with scikit-learn.
First, download the model file:
curl -sOL https://github.com/kserve/modelmesh-minio-examples/raw/main/sklearn/mnist-svm.joblib
Next, copy the model file onto the pvc-access pod:
kubectl cp mnist-svm.joblib pvc-access:/mnt/models/
Verify that the model has been successfully uploaded to the persistent volume:
kubectl exec -it pvc-access -- ls -alr /mnt/models/
# Expected output:
# total 356
# -rw-r--r-- 1 501 staff 344917 Sep 17 09:20 mnist-svm.joblib
# drwxr-xr-x 3 nobody 4294967294 4096 Sep 17 09:20 ..
# drwxr-xr-x 2 nobody 4294967294 4096 Sep 17 09:20 .
2.5. Configure ModelMesh Serving to Use the Persistent Volume Claim
To enable ModelMesh Serving to dynamically bind to the PVC, configure it by setting the allowAnyPVC configuration flag to true. Create a ConfigMap with this setting:
apiVersion: v1
kind: ConfigMap
metadata:
name: model-serving-config
data:
config.yaml: |
allowAnyPVC: true
EOF
Apply this configuration using kubectl:
kubectl apply -f - <<EOF
---
apiVersion: v1
kind: ConfigMap
metadata:
name: model-serving-config
data:
config.yaml: |
allowAnyPVC: true
EOF
2.6. Deploy a New Inference Service
With the PVC configured, you can now deploy a new inference service that references the persistent volume. Here’s how to deploy a service that uses the mnist-svm.joblib model:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-mnist
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: sklearn
storage:
parameters:
type: pvc
name: my-models-pvc
path: mnist-svm.joblib
EOF
Apply the inference service configuration using kubectl:
kubectl apply -f - <<EOF
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-mnist
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: sklearn
storage:
parameters:
type: pvc
name: my-models-pvc
path: mnist-svm.joblib
EOF
After a few seconds, your inference service should be ready:
kubectl get isvc
# Output example:
# NAME URL READY PREV LATEST AGE
# sklearn-mnist grpc://modelmesh-serving.modelmesh-serving:8033 True 35s
2.7. Run an Inference Request
To test the setup, run an inference request against the deployed model. First, set up port-forwarding:
kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8008 &
Next, use curl to send an inference request. The data array represents the grayscale values of a 64-pixel image scan of the digit to be classified:
MODEL_NAME="sklearn-mnist"
curl -X POST -k "http://localhost:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{ "name": "predict", "shape": [1, 64], "datatype": "FP32", "data": [0.0, 0.0, 3.0, 10.0, 15.0, 16.0, 2.0, 0.0, 0.0, 2.0, 14.0, 16.0, 11.0, 15.0, 7.0, 0.0, 0.0, 7.0, 16.0, 3.0, 5.0, 15.0, 4.0, 0.0, 0.0, 4.0, 14.0, 10.0, 12.0, 13.0, 0.0, 0.0, 0.0, 0.0, 3.0, 13.0, 15.0, 12.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.0, 15.0, 5.0, 0.0, 0.0, 0.0, 0.0, 15.0, 15.0, 15.0, 6.0, 0.0, 0.0, 0.0, 0.0, 10.0, 12.0, 11.0, 0.0, 0.0]}]}'
The JSON response should look like this, indicating that the model correctly identified the digit "7":
{
"model_name": "sklearn-mnist__isvc-2d5cba6382",
"outputs": [
{"name": "predict", "datatype": "INT64", "shape": [1], "data": [7]}
]
}
3. Conclusion
By configuring KServe ModelMesh Serving to utilize Kubernetes Persistent Volumes, you can significantly improve the efficiency and performance of your AI model deployments. This approach not only reduces the latency associated with fetching models from remote storage but also enhances the overall scalability of your AI infrastructure.
Fractz
25 Sep 2024
Share: