Llama.cpp
Cortex leverages llama.cpp
as its default engine for GGUF models. The example model configuration shown
below illustrates how to configure a GGUF model (in this case DeepSeek's 8B model) with both required and
optional parameters. The configuration includes metadata, inference parameters, and model loading settings
that control everything from basic model identification to advanced generation behavior. Cortex can automatically
generate GGUF models from HuggingFace repositories when a model.yaml file isn't available.
# BEGIN GENERAL GGUF METADATAid: deepseek-r1-distill-llama-8b # Model ID unique between models (author / quantization)model: deepseek-r1-distill-llama-8b:8b-gguf-q2-k # Model ID which is used for request construct - should be unique between models (author / quantization)name: deepseek-r1-distill-llama-8b # metadata.general.nameversion: 1files: # Can be relative OR absolute local file path - models/cortex.so/deepseek-r1-distill-llama-8b/8b-gguf-q2-k/model.gguf# END GENERAL GGUF METADATA# BEGIN INFERENCE PARAMETERS# BEGIN REQUIREDstop: # tokenizer.ggml.eos_token_id - <|im_end|> - <|end▁of▁sentence|># END REQUIRED# BEGIN OPTIONALsize: 3179134413stream: true # Default true?top_p: 0.9 # Ranges: 0 to 1temperature: 0.7 # Ranges: 0 to 1frequency_penalty: 0 # Ranges: 0 to 1presence_penalty: 0 # Ranges: 0 to 1max_tokens: 4096 # Should be default to context lengthseed: -1dynatemp_range: 0dynatemp_exponent: 1top_k: 40min_p: 0.05tfs_z: 1typ_p: 1repeat_last_n: 64repeat_penalty: 1mirostat: falsemirostat_tau: 5mirostat_eta: 0.1penalize_nl: falseignore_eos: falsen_probs: 0min_keep: 0# END OPTIONAL# END INFERENCE PARAMETERS# BEGIN MODEL LOAD PARAMETERS# BEGIN REQUIREDengine: llama-cpp # engine to run modelprompt_template: <|start_of_text|>{system_message}<|User|>{prompt}<|Assistant|># END REQUIRED# BEGIN OPTIONALctx_len: 4096 # llama.context_length | 0 or undefined = loaded from modeln_parallel: 1ngl: 34 # Undefined = loaded from model# END OPTIONAL# END MODEL LOAD PARAMETERS
Model Parameters
Parameter | Description | Required |
---|---|---|
id | Unique model identifier including author and quantization | Yes |
model | Model ID used for request construction | Yes |
name | General name metadata for the model | Yes |
version | Model version number | Yes |
files | Path to model GGUF file (relative or absolute) | Yes |
stop | Array of stopping sequences for generation | Yes |
engine | Model execution engine (llama-cpp) | Yes |
prompt_template | Template for formatting the prompt with system message and user input | Yes |
size | Model file size in bytes | No |
stream | Enable streaming output (default: true) | No |
top_p | Nucleus sampling probability threshold (0-1) | No |
temperature | Output randomness control (0-1) | No |
frequency_penalty | Penalty for frequent token usage (0-1) | No |
presence_penalty | Penalty for token presence (0-1) | No |
max_tokens | Maximum output length | No |
seed | Random seed for reproducibility | No |
dynatemp_range | Dynamic temperature range | No |
dynatemp_exponent | Dynamic temperature exponent | No |
top_k | Top-k sampling parameter | No |
min_p | Minimum probability threshold | No |
tfs_z | Tail-free sampling parameter | No |
typ_p | Typical sampling parameter | No |
repeat_last_n | Repetition penalty window | No |
repeat_penalty | Penalty for repeated tokens | No |
mirostat | Enable Mirostat sampling | No |
mirostat_tau | Mirostat target entropy | No |
mirostat_eta | Mirostat learning rate | No |
penalize_nl | Apply penalty to newlines | No |
ignore_eos | Ignore end-of-sequence token | No |
n_probs | Number of probability outputs | No |
min_keep | Minimum tokens to retain | No |
ctx_len | Context window size | No |
n_parallel | Number of parallel instances | No |
ngl | Number of GPU layers | No |