Skip to main content

Llama.cpp

Cortex leverages llama.cpp as its default engine for GGUF models. The example model configuration shown below illustrates how to configure a GGUF model (in this case DeepSeek's 8B model) with both required and optional parameters. The configuration includes metadata, inference parameters, and model loading settings that control everything from basic model identification to advanced generation behavior. Cortex can automatically generate GGUF models from HuggingFace repositories when a model.yaml file isn't available.


# BEGIN GENERAL GGUF METADATA
id: deepseek-r1-distill-llama-8b # Model ID unique between models (author / quantization)
model: deepseek-r1-distill-llama-8b:8b-gguf-q2-k # Model ID which is used for request construct - should be unique between models (author / quantization)
name: deepseek-r1-distill-llama-8b # metadata.general.name
version: 1
files: # Can be relative OR absolute local file path
- models/cortex.so/deepseek-r1-distill-llama-8b/8b-gguf-q2-k/model.gguf
# END GENERAL GGUF METADATA
# BEGIN INFERENCE PARAMETERS
# BEGIN REQUIRED
stop: # tokenizer.ggml.eos_token_id
- <|im_end|>
- <|end▁of▁sentence|>
# END REQUIRED
# BEGIN OPTIONAL
size: 3179134413
stream: true # Default true?
top_p: 0.9 # Ranges: 0 to 1
temperature: 0.7 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 4096 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
min_keep: 0
# END OPTIONAL
# END INFERENCE PARAMETERS
# BEGIN MODEL LOAD PARAMETERS
# BEGIN REQUIRED
engine: llama-cpp # engine to run model
prompt_template: <|start_of_text|>{system_message}<|User|>{prompt}<|Assistant|>
# END REQUIRED
# BEGIN OPTIONAL
ctx_len: 4096 # llama.context_length | 0 or undefined = loaded from model
n_parallel: 1
ngl: 34 # Undefined = loaded from model
# END OPTIONAL
# END MODEL LOAD PARAMETERS

Model Parameters

ParameterDescriptionRequired
idUnique model identifier including author and quantizationYes
modelModel ID used for request constructionYes
nameGeneral name metadata for the modelYes
versionModel version numberYes
filesPath to model GGUF file (relative or absolute)Yes
stopArray of stopping sequences for generationYes
engineModel execution engine (llama-cpp)Yes
prompt_templateTemplate for formatting the prompt with system message and user inputYes
sizeModel file size in bytesNo
streamEnable streaming output (default: true)No
top_pNucleus sampling probability threshold (0-1)No
temperatureOutput randomness control (0-1)No
frequency_penaltyPenalty for frequent token usage (0-1)No
presence_penaltyPenalty for token presence (0-1)No
max_tokensMaximum output lengthNo
seedRandom seed for reproducibilityNo
dynatemp_rangeDynamic temperature rangeNo
dynatemp_exponentDynamic temperature exponentNo
top_kTop-k sampling parameterNo
min_pMinimum probability thresholdNo
tfs_zTail-free sampling parameterNo
typ_pTypical sampling parameterNo
repeat_last_nRepetition penalty windowNo
repeat_penaltyPenalty for repeated tokensNo
mirostatEnable Mirostat samplingNo
mirostat_tauMirostat target entropyNo
mirostat_etaMirostat learning rateNo
penalize_nlApply penalty to newlinesNo
ignore_eosIgnore end-of-sequence tokenNo
n_probsNumber of probability outputsNo
min_keepMinimum tokens to retainNo
ctx_lenContext window sizeNo
n_parallelNumber of parallel instancesNo
nglNumber of GPU layersNo