model.yaml
Cortex uses a model.yaml
file to specify the configuration desired for each model. Models can be downloaded
from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored
in the models
directory.
Structure of model.yaml
Here is an example of model.yaml
format:
# BEGIN GENERAL METADATAmodel: gemma-2-9b-it-Q8_0 ## Model ID which is used for request construct - should be unique between models (author / quantization)name: Llama 3.1 ## metadata.general.nameversion: 1 ## metadata.versionsources: ## can be universal protocol (models://) OR absolute local file path (file://) OR https remote URL (https://) - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf # for downloaded model from HF - files://C:/Users/user/Downloads/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf # for imported model# END GENERAL METADATA# BEGIN INFERENCE PARAMETERS## BEGIN REQUIREDstop: ## tokenizer.ggml.eos_token_id - <|end_of_text|> - <|eot_id|> - <|eom_id|>## END REQUIRED## BEGIN OPTIONALstream: true # Default true?top_p: 0.9 # Ranges: 0 to 1temperature: 0.6 # Ranges: 0 to 1frequency_penalty: 0 # Ranges: 0 to 1presence_penalty: 0 # Ranges: 0 to 1max_tokens: 8192 # Should be default to context lengthseed: -1dynatemp_range: 0dynatemp_exponent: 1top_k: 40min_p: 0.05tfs_z: 1typ_p: 1repeat_last_n: 64repeat_penalty: 1mirostat: falsemirostat_tau: 5mirostat_eta: 0.1penalize_nl: falseignore_eos: falsen_probs: 0n_parallels: 1min_keep: 0## END OPTIONAL# END INFERENCE PARAMETERS# BEGIN MODEL LOAD PARAMETERS## BEGIN REQUIREDprompt_template: |+ # tokenizer.chat_template <|begin_of_text|><|start_header_id|>system<|end_header_id|> {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|> {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>## END REQUIRED## BEGIN OPTIONALctx_len: 0 # llama.context_length | 0 or undefined = loaded from modelngl: 33 # Undefined = loaded from modelengine: llama-cpp## END OPTIONAL# END MODEL LOAD PARAMETERS
The model.yaml
is composed of three high-level sections: The model metadata,
inference parameters, and model load parameters. Each section contains a set of
parameters that define the model's behavior and configuration and some parameters
are optional.
The model.yaml
contains sensible defaults for each parameter but there are instances where
may need to override these default values to get your model to work as intended. For example,
if you train or fine-tune a highly bespoke model with a custom template and less common parameters,
you can specify these in the model.yaml
file.
Model Metadata
model: gemma-2-9b-it-Q8_0name: Llama 3.1version: 1sources: - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf - files://C:/Users/user/Downloads/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
A Cortex Model consists of essential metadata that identifies it within the server and the local files. The required parameters include:
Parameter | Description | Required |
---|---|---|
name | The identifier name of the model, used as the model_id . | Yes |
model | Details specifying the variant of the model, including size or quantization. | Yes |
version | The version number of the model. | Yes |
sources | The source file of the model. | Yes |
Inference Parameters
stop: - <|end_of_text|> - <|eot_id|> - <|eom_id|>stream: true # Default true?top_p: 0.9 # Ranges: 0 to 1temperature: 0.6 # Ranges: 0 to 1frequency_penalty: 0 # Ranges: 0 to 1presence_penalty: 0 # Ranges: 0 to 1max_tokens: 8192 # Should be default to context lengthseed: -1dynatemp_range: 0dynatemp_exponent: 1top_k: 40min_p: 0.05tfs_z: 1typ_p: 1repeat_last_n: 64repeat_penalty: 1mirostat: falsemirostat_tau: 5mirostat_eta: 0.1penalize_nl: falseignore_eos: falsen_probs: 0n_parallels: 1min_keep: 0
Inference parameters affect the results of our model's predictions. While not all parameters are required, all of the following can be used to tweak the model's output.
Parameter | Description | Required |
---|---|---|
stream | Enables or disables streaming mode for the output (true or false). | No |
top_p | The cumulative probability threshold for token sampling. Ranges from 0 to 1. | No |
temperature | Controls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1. | No |
frequency_penalty | Penalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1. | No |
presence_penalty | Penalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1. | No |
max_tokens | Maximum number of tokens in the output for 1 turn. | No |
seed | Seed for the random number generator. -1 means no seed. | No |
dynatemp_range | Dynamic temperature range. | No |
dynatemp_exponent | Dynamic temperature exponent. | No |
top_k | The number of most likely tokens to consider at each step. | No |
min_p | Minimum probability threshold for token sampling. | No |
tfs_z | The z-score used for Typical token sampling. | No |
typ_p | The cumulative probability threshold used for Typical token sampling. | No |
repeat_last_n | Number of previous tokens to penalize for repeating. | No |
repeat_penalty | Penalty for repeating tokens. | No |
mirostat | Enables or disables Mirostat sampling (true or false). | No |
mirostat_tau | Target entropy value for Mirostat sampling. | No |
mirostat_eta | Learning rate for Mirostat sampling. | No |
penalize_nl | Penalizes newline tokens (true or false). | No |
ignore_eos | Ignores the end-of-sequence token (true or false). | No |
n_probs | Number of probabilities to return. | No |
min_keep | Minimum number of tokens to keep. | No |
n_parallels | Number of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update ctx_len coressponding to n_parallels (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. ) | No |
stop | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
Model Load Parameters
The Model load parameters give you the options that control how Cortex runs the model and can be crucial for the model's performance.
prompt_template: |+ <|begin_of_text|><|start_header_id|>system<|end_header_id|> {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|> {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>ctx_len: 4096ngl: 33engine: llama-cpp
Not all parameters are required.
Parameter | Description | Required |
---|---|---|
ngl | Number of model layers will be offload to GPU. | No |
ctx_len | Context length (maximum number of tokens). | No |
prompt_template | Template for formatting the prompt, including system messages and instructions. | Yes |
engine | The engine that run model, default to llama-cpp for local model with gguf format. | Yes |
These parameters also act as defaults when using the model start API through
cortex. If you change these, in particular the prompt_templpate
and the engine
(the others can be changed during runtime)
make sure to reload the model.
Runtime parameters
In addition to predefined parameters in model.yml
, Cortex supports runtime parameters to override these settings
when using the model start API.
Model start params
Cortex supports the following parameters when starting a model via the model start API for the llama-cpp engine:
cache_enabled: boolngl: intn_parallel: intcache_type: stringctx_len: int## Support for vision modelmmproj: stringllama_model_path: stringmodel_path: string
Parameter | Description | Required |
---|---|---|
cache_type | Data type of the KV cache in llama.cpp models. Supported types are f16 , q8_0 , and q4_0 , default is f16 . | No |
cache_enabled | Enables caching of conversation history for reuse in subsequent requests. Default is false | No |
mmproj | path to mmproj GGUF model, support for llava model | No |
llama_model_path | path to llm GGUF model | No |
These parameters will override the model.yml
parameters when starting model through the API.
Chat completion API parameters
The API is accessible at the /v1/chat/completions
URL and accepts all parameters from the chat completion
API as described API reference
With the llama-cpp
engine, cortex will accept all parameters from [model.yml
inference section](#Inference Parameters)
and from the chat completion API.