Using custom system prompts with LMEval

A technical guide on configuring custom system prompts in LMEvalJob with unitxt, including detailed explanations of the Custom Resource features, enhanced task recipes, and a practical example using the google/flan-t5-base model.

Published

February 22, 2025

Install the TrustyAI operator.

Then create a test namespace.

Then apply the LMEvalJob to the test namespace.

Explanation of the Custom Resource (CR)

The Custom Resource (CR) introduces enhancements to the LMEvalJob configuration, specifically designed to work seamlessly with unitxt, a tool for managing and evaluating text generation tasks. The CR focuses on optimizing the system prompts and task recipes to ensure that the evaluation process is both efficient and effective.

Key Features of the CR:

Custom System Prompts: The CR allows for the integration of custom system prompts that can be tailored to specific evaluation scenarios. This flexibility enables users to define the context and expectations for the model’s responses, improving the relevance and accuracy of the evaluations.
Enhanced Task Recipes: The task recipes have been updated to include more detailed instructions and formats that align with unitxt’s requirements. This ensures that the input and output formats are compatible, facilitating smoother data handling and processing.
Improved Evaluation Metrics: The CR introduces new metrics for evaluating the performance of the model based on the outputs generated in response to the custom prompts. These metrics are designed to provide deeper insights into the model’s capabilities and areas for improvement.
Integration with unitxt: The changes made in the CR are specifically aimed at enhancing compatibility with unitxt, allowing users to leverage its features for more comprehensive evaluations. This integration streamlines the workflow, making it easier to assess the model’s performance in real-world applications.

By implementing this CR, users can expect a more robust evaluation framework that not only meets the needs of their specific use cases but also enhances the overall effectiveness of the LMEvalJob in generating and assessing text outputs.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: custom-card-template
  namespace: tas
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template:
        ref: tp_0
      systemPrompt:
        ref: sp_0
      card:
        name: "cards.wnli"
    custom:
      templates:
      - name: tp_0
        value: |
          {
              "__type__": "input_output_template",
              "input_format": "{text_a_type}: {text_a}\n{text_b_type}: {text_b}",
              "output_format": "{label}",
              "target_prefix": "The {type_of_relation} class is ",
              "instruction": "Given a {text_a_type} and {text_b_type} classify the {type_of_relation} of the {text_b_type} to one of {classes}.",
              "postprocessors": [
                  "processors.take_first_non_empty_line",
                  "processors.lower_case_till_punc"
              ]
          }
      systemPrompts:
      - name: sp_0
        value: "Be concise. At every point give the shortest acceptable answer."
  logSamples: true