Skip to main content

Model Overview

Models in cortex are used for inference purposes (e.g., chat completion, embedding, etc.) after they have been downloaded locally. Currently, we support different engines including llama.cpp with the GGUF model format, and ONNX for edge or different model deployments.

In the future, you will also be able to run remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) via Cortex. Support for OpenAI and Anthropic engines is under development and will be available soon.

When you run cortex start in the terminal, cortex automatically starts an API server. (This functionality was inspired by the Docker CLI.) The cortex server manages various model endpoints which can facilitate the following:

  • Model Operations: Run and stop models.
  • Model Management: Pull and manage your local models.

Model Formats

Cortex supports three model formats and each model format require specific engine to run:

  • GGUF - run with llama-cpp engine
  • ONNX - run with onnxruntime engine

Within the Python Engine (currently under development), you can run models in other formats

info

For details on each format, see the Model Formats page.

Cortex Hub Models

To make it easy to run state-of-the-art open source models, we quantize popular models and upload these versions the our own space in HuggingFace at Cortex's HuggingFace. These models are ready to be downloaded and you can check them out at the link above or in our Models section.

Model Variants

Built-in models are made available across the following variants:

  • By format: gguf and onnx
  • By Size: 7b, 13b, and more.
  • By quantization method: q4, q8, and more.