Configuring the Model
APIs for setting inference-time and load-time parameters for your model
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Inference Parameters
Set inference-time parameters such as temperature, maxTokens, topP and more.
result = model.respond(chat, config={
"temperature": 0.6,
"maxTokens": 50,
})See LLMPredictionConfigInput in the
Typescript SDK documentation for all configurable fields.
Note that while structured can be set to a JSON schema definition as an inference-time configuration parameter
(Zod schemas are not supported in the Python SDK), the preferred approach is to instead set the
dedicated response_format parameter, which allows you to more rigorously
enforce the structure of the output using a JSON or class based schema definition.
Load Parameters
Set load-time parameters such as the context length, GPU offload ratio, and more.
Set Load Parameters with .model()
The .model() retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the given configuration will be ignored.
import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct", config={
"contextLength": 8192,
"gpu": {
"ratio": 0.5,
}
})See LLMLoadModelConfig in the
Typescript SDK documentation for all configurable fields.
Set Load Parameters with .load_new_instance()
The .load_new_instance() method creates a new model instance and loads it with the specified configuration.
import lmstudio as lms
client = lms.get_default_client()
model = client.llm.load_new_instance("qwen2.5-7b-instruct", config={
"contextLength": 8192,
"gpu": {
"ratio": 0.5,
}
})See LLMLoadModelConfig in the
Typescript SDK documentation for all configurable fields.