Skip to main content

model.yaml

Cortex uses a model.yaml file to specify the configuration desired for each model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the models directory.

Structure of model.yaml

Here is an example of model.yaml format:


# BEGIN GENERAL METADATA
model: gemma-2-9b-it-Q8_0 ## Model ID which is used for request construct - should be unique between models (author / quantization)
name: Llama 3.1 ## metadata.general.name
version: 1 ## metadata.version
sources: ## can be universal protocol (models://) OR absolute local file path (file://) OR https remote URL (https://)
- models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf # for downloaded model from HF
- files://C:/Users/user/Downloads/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf # for imported model
# END GENERAL METADATA
# BEGIN INFERENCE PARAMETERS
## BEGIN REQUIRED
stop: ## tokenizer.ggml.eos_token_id
  - <|end_of_text|>
  - <|eot_id|>
  - <|eom_id|>
## END REQUIRED
## BEGIN OPTIONAL
stream: true # Default true?
top_p: 0.9 # Ranges: 0 to 1
temperature: 0.6 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 8192 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
n_parallels: 1
min_keep: 0
## END OPTIONAL
# END INFERENCE PARAMETERS
# BEGIN MODEL LOAD PARAMETERS
## BEGIN REQUIRED
prompt_template: |+ # tokenizer.chat_template
  <|begin_of_text|><|start_header_id|>system<|end_header_id|>
  {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
  {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
## END REQUIRED
## BEGIN OPTIONAL
ctx_len: 0 # llama.context_length | 0 or undefined = loaded from model
ngl: 33 # Undefined = loaded from model
engine: llama-cpp
## END OPTIONAL
# END MODEL LOAD PARAMETERS

The model.yaml is composed of three high-level sections: The model metadata, inference parameters, and model load parameters. Each section contains a set of parameters that define the model's behavior and configuration and some parameters are optional.

The model.yaml contains sensible defaults for each parameter but there are instances where may need to override these default values to get your model to work as intended. For example, if you train or fine-tune a highly bespoke model with a custom template and less common parameters, you can specify these in the model.yaml file.

Model Metadata


model: gemma-2-9b-it-Q8_0
name: Llama 3.1
version: 1
sources:
- models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
- files://C:/Users/user/Downloads/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf

A Cortex Model consists of essential metadata that identifies it within the server and the local files. The required parameters include:

ParameterDescriptionRequired
nameThe identifier name of the model, used as the model_id.Yes
modelDetails specifying the variant of the model, including size or quantization.Yes
versionThe version number of the model.Yes
sourcesThe source file of the model.Yes

Inference Parameters


stop:
  - <|end_of_text|>
  - <|eot_id|>
  - <|eom_id|>
stream: true # Default true?
top_p: 0.9 # Ranges: 0 to 1
temperature: 0.6 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 8192 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
n_parallels: 1
min_keep: 0

Inference parameters affect the results of our model's predictions. While not all parameters are required, all of the following can be used to tweak the model's output.

ParameterDescriptionRequired
streamEnables or disables streaming mode for the output (true or false).No
top_pThe cumulative probability threshold for token sampling. Ranges from 0 to 1.No
temperatureControls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1.No
frequency_penaltyPenalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1.No
presence_penaltyPenalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1.No
max_tokensMaximum number of tokens in the output for 1 turn.No
seedSeed for the random number generator. -1 means no seed.No
dynatemp_rangeDynamic temperature range.No
dynatemp_exponentDynamic temperature exponent.No
top_kThe number of most likely tokens to consider at each step.No
min_pMinimum probability threshold for token sampling.No
tfs_zThe z-score used for Typical token sampling.No
typ_pThe cumulative probability threshold used for Typical token sampling.No
repeat_last_nNumber of previous tokens to penalize for repeating.No
repeat_penaltyPenalty for repeating tokens.No
mirostatEnables or disables Mirostat sampling (true or false).No
mirostat_tauTarget entropy value for Mirostat sampling.No
mirostat_etaLearning rate for Mirostat sampling.No
penalize_nlPenalizes newline tokens (true or false).No
ignore_eosIgnores the end-of-sequence token (true or false).No
n_probsNumber of probabilities to return.No
min_keepMinimum number of tokens to keep.No
n_parallelsNumber of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update ctx_len coressponding to n_parallels (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. )No
stopSpecifies the stopping condition for the model, which can be a word, a letter, or a specific text.Yes

Model Load Parameters

The Model load parameters give you the options that control how Cortex runs the model and can be crucial for the model's performance.


prompt_template: |+
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
ctx_len: 4096
ngl: 33
engine: llama-cpp

Not all parameters are required.

ParameterDescriptionRequired
nglNumber of model layers will be offload to GPU.No
ctx_lenContext length (maximum number of tokens).No
prompt_templateTemplate for formatting the prompt, including system messages and instructions.Yes
engineThe engine that run model, default to llama-cpp for local model with gguf format.Yes

These parameters also act as defaults when using the model start API through cortex. If you change these, in particular the prompt_templpate and the engine (the others can be changed during runtime) make sure to reload the model.

Runtime parameters

In addition to predefined parameters in model.yml, Cortex supports runtime parameters to override these settings when using the model start API.

Model start params

Cortex supports the following parameters when starting a model via the model start API for the llama-cpp engine:


cache_enabled: bool
ngl: int
n_parallel: int
cache_type: string
ctx_len: int
## Support for vision model
mmproj: string
llama_model_path: string
model_path: string

ParameterDescriptionRequired
cache_typeData type of the KV cache in llama.cpp models. Supported types are f16, q8_0, and q4_0, default is f16.No
cache_enabledEnables caching of conversation history for reuse in subsequent requests. Default is falseNo
mmprojpath to mmproj GGUF model, support for llava modelNo
llama_model_pathpath to llm GGUF modelNo

These parameters will override the model.yml parameters when starting model through the API.

Chat completion API parameters

The API is accessible at the /v1/chat/completions URL and accepts all parameters from the chat completion API as described API reference

With the llama-cpp engine, cortex will accept all parameters from [model.yml inference section](#Inference Parameters) and from the chat completion API.