ModelMesh

This is originally at https://github.com/kserve/modelmesh-serving/blob/main/docs/quickstart.md

Getting started

To quickly get started using ModelMesh Serving, here is a brief guide.

Prerequisites

1. Install ModelMesh Serving

Get the latest release

RELEASE=release-0.10
git clone -b $RELEASE --depth 1 --single-branch https://github.com/kserve/modelmesh-serving.git
cd modelmesh-serving

Run install script

kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart

This will install ModelMesh Serving in the modelmesh-serving namespace, along with an etcd and MinIO instances. Eventually after running this script, you should see a Successfully installed ModelMesh Serving! message.

[!Note] These etcd and MinIO deployments are intended for development/experimentation and not for production.

Verify installation

Check that the pods are running:

kubectl get pods

NAME                                        READY   STATUS    RESTARTS   AGE
pod/etcd                                    1/1     Running   0          5m
pod/minio                                   1/1     Running   0          5m
pod/modelmesh-controller-547bfb64dc-mrgrq   1/1     Running   0          5m

Check that the ServingRuntimes are available:

kubectl get servingruntimes

NAME           DISABLED   MODELTYPE    CONTAINERS   AGE
mlserver-0.x              sklearn      mlserver     5m
ovms-1.x                  openvino_ir  ovms         5m
torchserve-0.x            pytorch-mar  torchserve   5m
triton-2.x                tensorflow   triton       5m

ServingRuntimes are automatically provisioned based on the framework of the model deployed. Three ServingRuntimes are included with ModelMesh Serving by default. The current mappings for these are:

ServingRuntimeSupported Frameworks
mlserver-0.xsklearn, xgboost, lightgbm
ovms-1.xopenvino_ir, onnx
torchserve-0.xpytorch-mar
triton-2.xtensorflow, pytorch, onnx, tensorrt

2. Deploy a model

With ModelMesh Serving now installed, try deploying a model using the KServe InferenceService CRD.

[!Note] While both the KServe controller and ModelMesh controller will reconcile InferenceService resources, the ModelMesh controller will only handle those InferenceServices with the serving.kserve.io/deploymentMode: ModelMesh annotation. Otherwise, the KServe controller will handle reconciliation. Likewise, the KServe controller will not reconcile an InferenceService with the serving.kserve.io/deploymentMode: ModelMesh annotation, and will defer under the assumption that the ModelMesh controller will handle it.

Here, we deploy an pages/site/Scikit-learn MNIST model which is served from the local MinIO container:

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-sklearn-isvc
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib
EOF

Note: the above YAML uses the InferenceService predictor storage spec. You can also continue using the storageUri field in lieu of the storage spec:

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-sklearn-isvc
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
    serving.kserve.io/secretKey: localMinIO
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://modelmesh-example-models/sklearn/mnist-svm.joblib
EOF

After applying this InferenceService, you should see that it is likely not yet ready.

kubectl get isvc

NAME                    URL   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
example-sklearn-isvc          False                                                                 3s

Eventually, you should see the ServingRuntime pods that will hold the SKLearn model become Running.

kubectl get pods

...
modelmesh-serving-mlserver-0.x-7db675f677-twrwd   3/3     Running   0          2m
modelmesh-serving-mlserver-0.x-7db675f677-xvd8q   3/3     Running   0          2m

Then, checking on the InferenceService again, you should see that the one we deployed is now ready with a provided URL:

kubectl get isvc

NAME                    URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
example-sklearn-isvc    grpc://modelmesh-serving.modelmesh-serving:8033   True                                                                  97s

You can describe the InferenceService to get more status information:

kubectl describe isvc example-sklearn-isvc

Name:         example-sklearn-isvc
...
Status:
  Components:
    Predictor:
      Grpc URL:  grpc://modelmesh-serving.modelmesh-serving:8033
      Rest URL:  http://modelmesh-serving.modelmesh-serving:8008
      URL:       grpc://modelmesh-serving.modelmesh-serving:8033
  Conditions:
    Last Transition Time:  2022-07-18T18:01:54Z
    Status:                True
    Type:                  PredictorReady
    Last Transition Time:  2022-07-18T18:01:54Z
    Status:                True
    Type:                  Ready
  Model Status:
    Copies:
      Failed Copies:  0
      Total Copies:   2
    States:
      Active Model State:  Loaded
      Target Model State:
    Transition Status:     UpToDate
  URL:                     grpc://modelmesh-serving.modelmesh-serving:8033
...

3. Perform an inference request

Now that a model is loaded and available, you can then perform inference. Currently, only gRPC inference requests are supported by ModelMesh, but REST support is enabled via a REST proxy container. By default, ModelMesh Serving uses a headless Service since a normal Service has issues load balancing gRPC requests. See more info here.

gRPC request

To test out gRPC inference requests, you can port-forward the headless service in a separate terminal window:

kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8033 -n modelmesh-serving

Then a gRPC client generated from the KServe notes grpc_predict_v2.proto file can be used with localhost:8033. A ready-to-use Python example of this can be found here.

Alternatively, you can test inference with grpcurl. This can easily be installed with brew install grpcurl if on macOS.

With grpcurl, a request can be sent to the SKLearn MNIST model like the following. Make sure that the MODEL_NAME variable below is set to the name of your InferenceService.

MODEL_NAME=example-sklearn-isvc
grpcurl \
  -plaintext \
  -proto fvt/proto/kfs_inference_v2.proto \
  -d '{ "model_name": "'"${MODEL_NAME}"'", "inputs": [{ "name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": { "fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0] }}]}' \
  localhost:8033 \
  inference.GRPCInferenceService.ModelInfer

This should give you output like the following:

{
  "modelName": "example-sklearn-isvc__isvc-3642375d03",
  "outputs": [
    {
      "name": "predict",
      "datatype": "INT64",
      "shape": ["1"],
      "contents": {
        "int64Contents": ["8"]
      }
    }
  ]
}

REST request

[!Note] The REST proxy is currently in an alpha state and may still have issues with certain usage scenarios.

You will need to port-forward a different port for REST.

kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8008 -n modelmesh-serving

With curl, a request can be sent to the SKLearn MNIST model like the following. Make sure that the MODEL_NAME variable below is set to the name of your InferenceService.

MODEL_NAME=example-sklearn-isvc
curl -X POST -k http://localhost:8008/v2/models/${MODEL_NAME}/infer -d '{"inputs": [{ "name": "predict", "shape": [1, 64], "datatype": "FP32", "data": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}]}'

This should give you a response like the following:

{
  "model_name": "example-sklearn-isvc__ksp-7702c1b55a",
  "outputs": [
    {
      "name": "predict",
      "datatype": "FP32",
      "shape": [1],
      "data": [8]
    }
  ]
}

4. (Optional) Deleting your ModelMesh Serving installation

To delete all ModelMesh Serving resources that were installed, run the following from the root of the project:

./scripts/delete.sh --namespace modelmesh-serving

Implementing a Custom Serving Runtime

ModelMesh Serving serves different kinds of models via different Serving Runtime implementations. A Serving Runtime is one or more containers which:

  • Can dynamically load and unload models from disk into memory on demand
  • Exposes a gRPC service endpoint to serve inferencing requests for loaded models

More specifically, the container(s) must:

  1. Implement the simple model management gRPC SPI which comprises RPC methods to load/unload models, report their size, and report the runtime’s total capacity
  2. Implement one or more other arbitrary gRPC services to serve inferencing requests for already-loaded models

These gRPC services for (2) must all be served from the same server endpoint. The management service SPI may be served by that same endpoint or a different one. Each of these endpoints may listen on a localhost port, or a unix domain socket. For best performance, a domain socket is preferred for the inferencing endpoint, and the corresponding file should be created in an empty dir within one of the containers. This dir will become a mount in all of the runtime containers when they are run.

Model server Management SPI

Below is a description of how to implement the mmesh.ModelRuntime gRPC service, specified in model-runtime.proto. Note that this is currently subject to change, but we will try to ensure that any changes are backwards-compatible or at least will require minimal change on the runtime side.

Model sizing

So that ModelMesh Serving can decide when/where models should be loaded and unloaded, a given serving runtime implementation must communicate details of how much capacity it has to hold loaded models in memory, as well as how much each loaded model consumes.

Model sizes are communicated in a few different ways:

  • A rough “global” default/average size for all models must be provided in the defaultModelSizeInBytes field in the response to the runtimeStatus rpc method. This should be a very conservative estimate.

  • A predicted size can optionally be provided by implementing the predictModelSize rpc method. This will be called prior to loadModel and if implemented should return immediately (for example it should not make remote calls which could be delayed).

  • The more precise size of an already-loaded model can be provided by either:

    1. Including it in the sizeInBytes field of the response to the loadModel rpc method
    2. Not setting in the loadModel response, and instead implementing the separate modelSize method to return the size. This will be called immediately after loadModel returns, and isn’t required to be implemented if the first option is used.

    The second of these last two options is preferred when a separate step is required to determine the size after the model has already been loaded. This is so that the model can start to be used for inferencing immediately, while the sizing operation is still in progress.

Capacity is indicated once via the capacityInBytes field in the response to the runtimeStatus rpc method and assumed to be constant.

Ideally, the value of capacityInBytes should be calculated dynamically as a function of your model server container’s allocated memory. One way to arrange this is via Kubernetes’ Downward API - mapping the container’s requests.memory property to an environment variable. Of course some amount of fixed overhead should likely be subtracted from this value:

env:
  - name: MODEL_SERVER_MEM_REQ_BYTES
    valueFrom:
      resourceFieldRef:
        containerName: my-model-server
        resource: requests.memory

runtimeStatus

message RuntimeStatusRequest {}

This is polled at the point that the main model-mesh container starts to check that the runtime is ready. You should return a response with status set to STARTING until the runtime is ready to accept other requests and load/serve models at which point status should be set to READY.

The other fields in the response only need to be set in the READY response (and will be ignored prior to that). Once READY is returned, no further calls will be made unless the model-mesh container unexpectedly restarts.

Currently, to ensure overall consistency of the system, it is required that runtimes purge any loaded/loading models when receiving a runtimeStatus call, and do not return READY until this is complete. Typically, it’s only called during initialization prior to any load/unloadModel calls and hence this “purge” will be a no-op. But runtimes should also handle the case where there are models loaded. It is likely that this requirement will be removed in a future update, but ModelMesh Serving will remain compatible with runtimes that still include the logic.

message RuntimeStatusResponse {
    enum Status {
        STARTING = 0;
        READY = 1;
        FAILING = 2; //not used yet
    }
    Status status = 1;
    // memory capacity for static loaded models, in bytes
    uint64 capacityInBytes = 2;
    // maximum number of model loads that can be in-flight at the same time
    uint32 maxLoadingConcurrency = 3;
    // timeout for model loads in milliseconds
    uint32 modelLoadingTimeoutMs = 4;
    // conservative "default" model size,
    // such that "most" models are smaller than this
    uint64 defaultModelSizeInBytes = 5;
    // version string for this model server code
    string runtimeVersion = 6;

    message MethodInfo {
        // "path" of protobuf field numbers leading to the string
        // field within the request method corresponding to the
        // model name or id
        repeated uint32 idInjectionPath = 1;
    }

    // optional map of per-gRPC rpc method configuration
    // keys should be fully-qualified gRPC method name
    // (including package/service prefix)
    map<string,MethodInfo> methodInfos = 8;

    // EXPERIMENTAL - Set to true to enable the mode where
    // each loaded model reports a maximum inferencing
    // concurrency via the maxConcurrency field of
    // the LoadModelResponse message. Additional requests
    // are queued in the modelmesh framework. Turning this
    // on will also enable latency-based autoscaling for
    // the models, which attempts to minimize request
    // queueing time and requires no other configuration.
    bool limitModelConcurrency = 9;
}

loadModel

message LoadModelRequest {
    string modelId = 1;

    string modelType = 2;
    string modelPath = 3;
    string modelKey = 4;
}

The runtime should load a model with name/id specified by the modelId field into memory ready for serving, from the path specified by the modelPath field. At this time, the modelType field value should be ignored.

The modelKey field will contain a JSON string with the following contents:

{
  "model_type": {
    "name": "mytype",
    "version": "2"
  }
}

Where model_type corresponds to the modelFormat section from the originating InferenceSerivce predictor. Note that version is optional and may not be present. In future, additional attributes might be present in the outer json object so your implementation should ignore them gracefully.

The response shouldn’t be returned until the model has loaded successfully and is ready to use.

message LoadModelResponse {
    // OPTIONAL - If nontrivial cost is involved in
    // determining the size, return 0 here and
    // do the sizing in the modelSize function
    uint64 sizeInBytes = 1;

    // EXPERIMENTAL - Applies only if limitModelConcurrency = true
    // was returned from runtimeStatus rpc.
    // See RuntimeStatusResponse.limitModelConcurrency for more detail
    uint32 maxConcurrency = 2;
}

unloadModel

message UnloadModelRequest {
    string modelId = 1;
}

The runtime should unload the previously loaded (or failed) model specified by modelId, and not return a response until the unload is complete and corresponding resources have been freed. If the specified model is not found/loaded, the runtime should return immediately (without error).

message UnloadModelResponse {}

Inferencing

The model runtime server can expose any number of protobuf-based gRPC services on the grpcDataEndpoint to use for inferencing requests. ModelMesh Serving is agnostic to specific service definitions (request/response message content), but for tensor-in/tensor-out based services it is recommended to conform to the KServe V2 dataplane API spec.

A given model runtime server will be guaranteed to only receive model inferencing requests for models that had previously completed loading successfully (via a loadModel request), and to have not been unloaded since.

Though generally agnostic to the specific API methods, ModelMesh Serving does need to be able to set/override the model name/id used in a given request. There are two options for obtaining the model name/id within the (which will correspond to the same id previously passed to loadModel):

  1. Obtain from one of the mm-model-id or mm-model-id-bin gRPC metadata headers (latter required for non-ASCII UTF-8 ids). Precisely how this is done depends on the implementation language - see gRPC documentation for more information (TODO specific refs/examples here).
  2. Provide the location of a specific string field within your request protobuf message (per RPC method) which will be replaced with the target model id. This is done via the methodInfos map in the runtime’s response to the runtimeStatus RPC method. Each applicable inferencing method should have an entry whose idInjectionPath field is set to a list of field numbers corresponding to the heirarchy of nested messages within the request message, the last of which being the number of the string field to replace. For example, if the id is a string field in the top-level request message with number 1 (as is the case in the KServe V2 ModelInferRequest), this list would be set to just [1].

Option 2 is particularly applicable when integrating with an existing gRPC-based model server.

Deploying a Runtime

Each Serving Runtime implementation is defined using the custom resource type ServingRuntime which defines information about the runtime such as which container images need to be loaded, and the local gRPC endpoints on which they will listen. When the resource is applied to the Kubernetes cluster, the model server will deploy the runtime specific containers which will then enable support for the corresponding model types.

The following is an example of a ServingRuntime custom resource

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: example-runtime
spec:
  supportedModelFormats:
    - name: new-modelformat
      version: "1"
      autoSelect: true
  containers:
    - name: model-server
      image: samplemodelserver:latest
  multiModel: true
  grpcEndpoint: "port:8085"
  grpcDataEndpoint: "port:8090"

In each entry of the supportedModelFormats list, autoSelect: true can optionally be specified to indicate that that the given ServingRuntime can be considered for automatic placement of InferenceServices with the corresponding model type/format if no runtime is explicitly specified. For example, if a user applies an InferenceService with predictor.model.modelFormat.name: new-modelformat and no runtime value, the above ServingRuntime will be used since it contains an “auto-selectable” supported model format that matches new-modelformat. If autoSelect were false or unspecified, the InferenceService would fail to load with the message “No ServingRuntime supports specified model type and/or protocol” unless the runtime example-runtime was specified directly in the YAML.

Runtime container resource allocations

TODO more detail coming here

Integrating with existing model servers

The ability to specify multiple containers provides a nice way to integrate with existing model servers via an adapter pattern, as long as they provide the required capability of dynamically loading and unloading models.

Note: In the above diagram, only the adapter and model server containers are explicitly specified in the ServingRuntime CR, the others are included automatically.

The built-in runtimes based on Nvidia’s Triton Inferencing Server and the Seldon MLServer, and their corresponding adapters serve as good examples of this and can be used as a reference.

Reference

Spec Attributes

Available attributes in the ServingRuntime spec:

AttributeDescription
multiModelWhether this ServingRuntime is ModelMesh-compatible and intended for multi-model usage (as opposed to KServe single-model serving).
disabledDisables this runtime
containersList of containers associated with the runtime
containers[ ].imageThe container image for the current container
containers[ ].commandExecutable command found in the provided image
containers[ ].argsList of command line arguments as strings
containers[ ].resourcesKubernetes limits or requests
containers[ ].imagePullPolicyThe container image pull policy
containers[ ].workingDirThe working directory for current container
grpcEndpointThe port for model management requests
grpcDataEndpointThe port or unix socket for inferencing requests arriving to the model server over the gRPC protocol. May be set to the same value as grpcEndpoint
supportedModelFormatsList of model types supported by the current runtime
supportedModelFormats[ ].nameName of the model type
supportedModelFormats[ ].versionVersion of the model type. It is recommended to include only the major version here, for example “1” rather than “1.15.4”
storageHelper.disabledDisables the storage helper
nodeSelectorInfluence Kubernetes scheduling to assign pods to nodes
affinityInfluence Kubernetes scheduling to assign pods to nodes
tolerationsAllow pods to be scheduled onto nodes with matching taints
replicasThe number of replicas of the runtime to create. This overrides the podsPerRuntime configuration

Endpoint formats

Several of the attributes (grpcEndpoint, grpcDataEndpoint) support either Unix Domain Sockets or TCP. The endpoint should be formatted as either port:<number> or unix:<path>. The provided container must be either listening on the specific TCP socket or at the provided path.


Warning

If a unix domain socket is specified for both grpcEndpoint and grpcDataEndpoint then it must either be the same socket (identical path) or reside in the same directory.


Full Example

The following example demonstrates all of the possible attributes that can be set in the model serving runtime spec:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: example-runtime
spec:
  supportedModelFormats:
    - name: my_model_format # name of the model
      version: "1"
      autoSelect: true
  containers:
    - args:
        - arg1
        - arg2
      command:
        - command
        - command2
      env:
        - name: name
          value: value
        - name: fromSecret
          valueFrom:
            secretKeyRef:
              key: mykey
      image: image
      name: name
      resources:
        limits:
          memory: 200Mi
      imagePullPolicy: IfNotPresent
      workingDir: "/container/working/dir"
  multiModel: true
  disabled: false
  storageHelper:
    disabled: true
  grpcEndpoint: port:1234 # or unix:/path
  grpcDataEndpoint: port:1234 # or unix:/path
  # To influence pod scheduling, one or more of the following can be used
  nodeSelector: # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
    kubernetes.io/arch: "amd64"
  affinity: # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "kubernetes.io/arch"
                operator: In
                values:
                  - "amd64"
  tolerations: # https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
    - key: "example-key"
      operator: "Exists"
      effect: "NoSchedule"

Storage Helper

Storage helper will download the model from the S3 bucket using the secret storage-config and place it in the local path. By default, storage helper is enabled in the serving runtime. Storage helper can be disabled by adding the configuration storageHelper.disabled set to true in serving runtime. If the storage helper is disabled, the custom runtime needs to handle access to and pulling model data from storage itself. Configuration can be passed to the runtime’s pods through environment variables.

Example

Consider the custom runtime defined above with the following InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-mnist-isvc
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: my_model_format
        version: "1"
      storage:
        key: my_storage
        path: my_models/mnist-model
        parameters:
          bucket: my_bucket

If the storage helper is enabled, the model serving container will receive the below model metadata in the loadModel call where modelPath will contain the path of the model in the local file system.

{
  "modelId": "my-mnist-isvc-<suffix>",
  "modelType": "my_model_format",
  "modelPath": "/models/my-mnist-isvc-<suffix>/",
  "modelKey": "<serialized metadata as JSON, see below>"
}

The following metadata for the InferenceService predictor is serialized to a string and embedded as the modelKey field:

{
  "bucket": "my_bucket",
  "disk_size_bytes": 2415,
  "model_type": {
    "name": "my_model_format",
    "version": "1"
  },
  "storage_key": "my_storage"
}

If the storage helper is disabled, the model serving container will receive the below model metadata in the loadModel call where modelPath is same as the path provided in the predictor storage spec.

{
  "modelId": "my-mnist-isvc-<suffix>",
  "modelType": "my_model_format",
  "modelPath": "my_models/mnist-model",
  "modelKey": "<serialized metadata as JSON, see below>"
}

The following metadata for the InferenceService predictor is serialized to a string and embedded as the modelKey field:

{
  "bucket": "my_bucket",
  "model_type": {
    "name": "my_model_format",
    "version": "1"
  },
  "storage_key": "my_storage"
}

Installation

Prerequisites

  • Kubernetes cluster - A Kubernetes cluster is required. You will need cluster-admin authority in order to complete all of the prescribed steps.

  • Kubectl and Kustomize - The installation will occur via the terminal using kubectl and kustomize.

  • etcd - ModelMesh Serving requires an etcd server in order to coordinate internal state which can be either dedicated or shared. More on this later.

  • S3-compatible object storage - Before models can be deployed, a remote S3-compatible datastore is needed from which to pull the model data. This could be for example an IBM Cloud Object Storage instance, or a locally running MinIO deployment. Note that this is not required to be in place prior to the initial installation.

We provide an install script to quickly run ModelMesh Serving with a provisioned etcd server. This may be useful for experimentation or development but should not be used in production.

The install script has a --quickstart option for setting up a self-contained ModelMesh Serving instance. This will deploy and configure local etcd and MinIO servers in the same Kubernetes namespace. Note that this is only for experimentation and/or development use - in particular the connections to these datastores are not secure and the etcd cluster is a single member which is not highly available. Use of --quickstart also configures the storage-config secret to be able to pull from the ModelMesh Serving example models bucket which contains the model data for the sample Predictors. For complete details on the manfiests applied with --quickstart see config/dependencies/quickstart.yaml.

Setup the etcd connection information

If the --quickstart install option is not being used, details of an existing etcd cluster must be specified prior to installation. Otherwise, please skip this step and proceed to Installation.

Create a file named etcd-config.json, populating the values based upon your etcd server. The same etcd server can be shared between environments and/or namespaces, but in this case the root_prefix must be set differently in each namespace’s respective secret. The complete json schema for this configuration is documented here.

{
  "endpoints": "https://etcd-service-hostame:2379",
  "userid": "userid",
  "password": "password",
  "root_prefix": "unique-chroot-prefix"
}

Then create the secret using the file (note that the key name within the secret must be etcd_connection):

kubectl create secret generic model-serving-etcd --from-file=etcd_connection=etcd-config.json

A secret named model-serving-etcd will be created and passed to the controller.

Installation

Install the latest release of modelmesh-serving by first cloning the corresponding release branch:

RELEASE=release-0.8
git clone -b $RELEASE --depth 1 --single-branch https://github.com/kserve/modelmesh-serving.git
cd modelmesh-serving

Run the script to install ModelMesh Serving CRDs, controller, and built-in runtimes into the specified Kubernetes namespaces, after reviewing the command line flags below.

A Kubernetes --namespace is required, which must already exist. You must also have cluster-admin authority and cluster access must be configured prior to running the install script.

A list of Kubernetes namespaces --user-namespaces is optional to enable user namespaces for ModelMesh Serving. The script will skip the namespaces which don’t already exist.

The --quickstart option can be specified to install and configure supporting datastores in the same namespace (etcd and MinIO) for experimental/development use. If this is not chosen, the namespace provided must have an Etcd secret named model-serving-etcd created which provides access to the Etcd cluster. See the instructions above on this step.

kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace modelmesh-serving --quickstart

See the installation help below for detail:

./scripts/install.sh --help
usage: ./scripts/install.sh [flags]

Flags:
  -n, --namespace                (required) Kubernetes namespace to deploy ModelMesh Serving to.
  -p, --install-config-path      Path to local model serve installation configs. Can be ModelMesh Serving tarfile or directory.
  -d, --delete                   Delete any existing instances of ModelMesh Serving in Kube namespace before running install, including CRDs, RBACs, controller, older CRD with serving.kserve.io api group name, etc.
  -u, --user-namespaces          Kubernetes namespaces to enable for ModelMesh Serving
  --quickstart                   Install and configure required supporting datastores in the same namespace (etcd and MinIO) - for experimentation/development
  --fvt                          Install and configure required supporting datastores in the same namespace (etcd and MinIO) - for development with fvt enabled
  -dev, --dev-mode-logging       Enable dev mode logging (stacktraces on warning and no sampling)
  --namespace-scope-mode         Run ModelMesh Serving in namespace scope mode

Installs ModelMesh Serving CRDs, controller, and built-in runtimes into specified
Kubernetes namespaces.

Expects cluster-admin authority and Kube cluster access to be configured prior to running.
Also requires Etcd secret 'model-serving-etcd' to be created in namespace already.

You can optionally provide a local --install-config-path that points to a local ModelMesh Serving tar file or directory containing ModelMesh Serving configs to deploy. If not specified, the config directory from the root of the project will be used.

You can also optionally use --delete to delete any existing instances of ModelMesh Serving in the designated Kube namespace before running the install.

The installation will create a secret named storage-config if it does not already exist. If the --quickstart option was chosen, this will be populated with the connection details for the example models bucket in IBM Cloud Object Storage and the local MinIO; otherwise, it will be empty and ready for you to add your own entries.

Setup additional namespaces

To enable additional namespaces for ModelMesh after the initial installation, you need to add a label named modelmesh-enabled, and optionally setup the storage secret storage-config and built-in runtimes, in the user namespaces.

The following command will add the label to “your_namespace”.

kubectl label namespace your_namespace modelmesh-enabled="true" --overwrite

You can also run a script to setup multiple user namespaces. See the setup help below for detail:

./scripts/setup_user_namespaces.sh --help
Run this script to enable user namespaces for ModelMesh Serving, and optionally add the storage secret
for example models and built-in serving runtimes to the target namespaces.

usage: ./scripts/setup_user_namespaces.sh [flags]
  Flags:
    -u, --user-namespaces         (required) Kubernetes user namespaces to enable for ModelMesh
    -c, --controller-namespace    Kubernetes ModelMesh controller namespace, default is modelmesh-serving
    --create-storage-secret       Create storage secret for example models
    --deploy-serving-runtimes     Deploy built-in serving runtimes
    --dev-mode                    Run in development mode meaning the configs are local, not release based
    -h, --help                    Display this help

The following command will setup two namespaces with the required label, optional storage secret, and built-in runtimes, so you can deploy sample predictors into any of them immediately.

./scripts/setup_user_namespaces.sh -u "ns1 ns2" --create-storage-secret --deploy-serving-runtimes

Delete the installation

To wipe out the ModelMesh Serving CRDs, controller, and built-in runtimes from the specified Kubernetes namespaces:

./scripts/delete.sh --namespace modelmesh-serving

(Optional) Delete the specified namespace modelmesh-serving

kubectl delete ns modelmesh-serving

  1. To start a Minikube cluster with a specific Kubernetes version see this section↩︎

  2. https://kubernetes.io/docs/tasks/tools/#kubectl ↩︎

  3. https://kubectl.docs.kubernetes.io/installation/kustomize/ ↩︎

  4. To start a Minikube cluster with specific memory and vCPU specs, see this section↩︎