How to use Trial Templates

Trial template parameters overview and how use CRDs with Katib Trials

This guide describes how to configure Trial template parameters and use custom Kubernetes CRD in Katib Trials. You will learn about changing Trial template specification, how to use Kubernetes ConfigMaps to store templates and how to modify Katib controller to support your Kubernetes CRD in Katib Experiments.

Katib dynamically supports any kind of Kubernetes CRD as Trial’s Worker. In Katib examples, you can find the following examples for Trial’s Workers:

To use your own Kubernetes resource follow the steps below.

How to use Trial Template

To run the Katib Experiment you have to specify a Trial template for your Worker job where actual model training is running.

Configure Trial Template Specification

Trial template specification is located under .spec.trialTemplate of your Experiment. To define Trial, you should specify these parameters in .spec.trialTemplate:

  • trialParameters - list of the parameters which are used in the Trial template during Experiment execution.

    Note: Your Trial template must contain each parameter from the trialParameters. You can set these parameters in any field of your template, except .metadata.name and .metadata.namespace. For example, your training container can receive hyperparameters as command-line or arguments or as environment variables.

    Your Experiment’s Suggestion produces trialParameters before running the Trial. Each trialParameter has these structure:

    • name - the parameter name that is replaced in your template.

    • description (optional) - the description of the parameter.

    • reference - the parameter name that Experiment’s Suggestion returns. Usually, for the hyperparameter tuning parameter references are equal to the Experiment search space. For example, in grid example search space has three parameters (lr, momentum) and trialParameters contains each of these parameters in reference.

  • You have to define your Trial template in one of the trialSpec or configMap sources.

    Note: Your template must omit .metadata.name and .metadata.namespace.

    To set the parameters from the trialParameters, you need to use this expression: ${trialParameters.<parameter-name>} in your template. Katib automatically replaces it with the appropriate values from the Suggestion.

    For example, --lr=${trialParameters.learningRate} is the learningRate parameter.

    • trialSpec - the Trial template in unstructured format. The template should be a valid YAML.

    • configMap - Kubernetes ConfigMap specification where the Trial template is located. This ConfigMap must have the label katib.kubeflow.org/component: trial-templates and contains key-value pairs, where key: <template-name>, value: <template-yaml>. Check the example of the ConfigMap with Trial templates.

      The configMap specification should have:

      1. configMapName - the ConfigMap name with the Trial templates.

      2. configMapNamespace - the ConfigMap namespace with the Trial templates.

      3. templatePath - the ConfigMap’s data path to the template.

.spec.trialTemplate parameters below are used to control Trial behavior. If parameter has the default value, it can be omitted in the Experiment YAML.

  • retain - indicates that Trials’s resources are not clean-up after the Trial is complete. Check the example with retain: true parameter.

    The default value is false

  • primaryPodLabels - the Trial Worker’s Pod or Pods labels. These Pods are injected by Katib metrics collector.

    Note: If primaryPodLabels are omitted, the Katib metrics collector wraps all worker’s Pods. Check the example with primaryPodLabels.

    The default value for Kubeflow TFJob, PyTorchJob, MXJob, and XGBoostJob is job-role: master

    The primaryPodLabels default value works only if you specify your template in .spec.trialTemplate.trialSpec. For the configMap template source you have to manually set primaryPodLabels.

  • primaryContainerName - the training container name where actual model training is running. Katib metrics collector wraps this container to collect required metrics for the single Experiment optimization step.

  • successCondition - The Trial Worker’s object status in which Trial’s job has succeeded. This condition must be in GJSON format. Check the example with successCondition.

    The default value for Kubernetes Job is:

    status.conditions.#(type=="Complete")#|#(status=="True")#
    

    The default value for Kubeflow TFJob, PyTorchJob, MXJob, and XGBoostJob is:

    status.conditions.#(type=="Succeeded")#|#(status=="True")#
    

    The successCondition default value works only if you specify your template in .spec.trialTemplate.trialSpec. For the configMap template source you have to manually set successCondition.

  • failureCondition - The Trial Worker’s object status in which Trial’s job has failed. This condition must be in GJSON format. Check the example with failureCondition.

    The default value for Kubernetes Job and Kubeflow TFJob, PyTorchJob, MXJob, and XGBoostJob is:

    status.conditions.#(type=="Failed")#|#(status=="True")#
    

    The failureCondition default value works only if you specify your template in .spec.trialTemplate.trialSpec. For the configMap template source you have to manually set failureCondition.

Use Metadata in Trial Template

You can’t specify .metadata.name and .metadata.namespace in your Trial template, but you can get this data during the Experiment run. For example, if you want to append the Trial’s name to your model storage.

To do this, point .trialParameters[x].reference to the appropriate metadata parameter and use .trialParameters[x].name in your Trial template.

The table below shows the connection between .trialParameters[x].reference value and Trial metadata.

ReferenceTrial metadata
${trialSpec.Name}Trial name
${trialSpec.Namespace}Trial namespace
${trialSpec.Kind}Kubernetes resource kind for the Trial's worker
${trialSpec.APIVersion}Kubernetes resource APIVersion for the Trial's worker
${trialSpec.Labels[custom-key]}Trial's worker label with custom-key key
${trialSpec.Annotations[custom-key]}Trial's worker annotation with custom-key key

Check the example of using Trial metadata.

Use CRDs with Trial Template

It is possible to use your own Kubernetes CRD or other Kubernetes resource (e.g. Kubernetes CronJob) as a Trial Worker without modifying Katib controller source code and building the new image. As long as your CRD creates Kubernetes Pods, allows to inject the sidecar container on these Pods and has succeeded and failed status, you can use it in Katib.

To do that, you need to modify Katib components before installing it on your Kubernetes cluster. Accordingly, you have to know your CRD API group and version, the CRD object’s kind. Also, you need to know which resources your custom object is created. Check the Kubernetes guide to know more about CRDs.

Follow these two simple steps to integrate your custom CRD in Katib:

  1. Modify Katib controller ClusterRole’s rules with the new rule to give Katib access to all resources that are created by the Trial. To know more about ClusterRole, check the Kubernetes guide.

    In case of Tekton Pipelines, Trials creates Tekton PipelineRun, then Tekton PipelineRun creates Tekton TaskRun. Therefore, Katib controller ClusterRole should have access to the pipelineruns and taskruns:

    - apiGroups:
        - tekton.dev
      resources:
        - pipelineruns
        - taskruns
      verbs:
        - "get"
        - "list"
        - "watch"
        - "create"
        - "delete"
    
  2. Modify Katib Config controller parameters with the new entity:

    trialResources:
     - <object-kind>.<object-API-version>.<object-API-group>
    

    For example, to support Tekton Pipelines:

    trialResources:
      - PipelineRun.v1beta1.tekton.dev
    

After these changes, deploy Katib as described in the installation guide and wait until the katib-controller Pod is created. You can check logs from the Katib controller to verify your resource integration:

$ kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow | grep '"CRD Kind":"PipelineRun"'

{"level":"info","ts":1628032648.6285546,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}

If you ran the above steps successfully, you should be able to use your custom object YAML in the Experiment’s Trial template source spec.

We appreciate your feedback on using various CRDs in Katib. It would be great, if you could let us know about your Experiments. The developer guide is a good starting point to know how to contribute to the project.

Next steps

Feedback

Was this page helpful?


Last modified April 26, 2024: Fix links in other pages (7333160)