Configuring the Model
APIs for setting inference-time and load-time parameters for your model
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Inference Parameters
Set inference-time parameters such as temperature, maxTokens, topP and more.
const prediction = model.respond(chat, {
temperature: 0.6,
maxTokens: 50,
});See LLMPredictionConfigInput for all configurable fields.
Another useful inference-time configuration parameter is structured, which allows you to rigorously enforce the structure of the output using a JSON or zod schema.
Load Parameters
Set load-time parameters such as the context length, GPU offload ratio, and more.
Set Load Parameters with .model()
The .model() retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the configuration will be ignored.
const model = await client.llm.model("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});See LLMLoadModelConfig for all configurable fields.
Set Load Parameters with .load()
The .load() method creates a new model instance and loads it with the specified configuration.
const model = await client.llm.load("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});See LLMLoadModelConfig for all configurable fields.