OpenAI Compatible Server
llama-cpp-python
offers an OpenAI API compatible web server.
This web server can be used to serve local models and easily connect them to existing clients.
Setup
Installation
The server can be installed by running the following command:
Running the server
The server can then be started by running the following command:
Server options
For a full list of options, run:
NOTE: All server options are also available as environment variables. For example, --model
can be set by setting the MODEL
environment variable.
Check out the server config reference below settings for more information on the available options.
CLI arguments and environment variables are available for all of the fields defined in ServerSettings
and ModelSettings
Additionally the server supports configuration check out the configuration section for more information and examples.
Guides
Code Completion
llama-cpp-python
supports code completion via GitHub Copilot.
NOTE: Without GPU acceleration this is unlikely to be fast enough to be usable.
You'll first need to download one of the available code completion models in GGUF format:
Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:
Then just update your settings in .vscode/settings.json
to point to your code completion server:
{
// ...
"github.copilot.advanced": {
"debug.testOverrideProxyUrl": "http://<host>:<port>",
"debug.overrideProxyUrl": "http://<host>:<port>"
}
// ...
}
Function Calling
llama-cpp-python
supports structured function calling based on a JSON schema.
Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.
You'll first need to download one of the available function calling models in GGUF format:
Then when you run the server you'll need to also specify either functionary-v1
or functionary-v2
chat_format.
Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned here, you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.
python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>
Check out this example notebook for a walkthrough of some interesting use cases for function calling.
Multimodal Models
llama-cpp-python
supports the llava1.5 family of multi-modal models which allow the language model to
read information from both text and images.
You'll first need to download one of the available multi-modal models in GGUF format:
Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the llava-1-5
chat_format
python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5
Then you can just use the OpenAI API as normal
from openai import OpenAI
client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx")
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "<image_url>"
},
},
{"type": "text", "text": "What does the image say"},
],
}
],
)
print(response)
Configuration and Multi-Model Support
The server supports configuration via a JSON config file that can be passed using the --config_file
parameter or the CONFIG_FILE
environment variable.
Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.
The server supports routing requests to multiple models based on the model
parameter in the request which matches against the model_alias
in the config file.
At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.
{
"host": "0.0.0.0",
"port": 8080,
"models": [
{
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
"model_alias": "gpt-3.5-turbo",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
"model_alias": "gpt-4",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",
"model_alias": "gpt-4-vision-preview",
"chat_format": "llava-1-5",
"clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",
"model_alias": "text-davinci-003",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",
"model_alias": "copilot-codex",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 1024,
"n_ctx": 9216
}
]
}
The config file format is defined by the ConfigFileSettings
class.
Server Options Reference
llama_cpp.server.settings.ConfigFileSettings
Bases: ServerSettings
Configuration file format settings.
Source code in llama_cpp/server/settings.py
models: List[ModelSettings] = Field(default=[], description='Model configs')
class-attribute
instance-attribute
llama_cpp.server.settings.ServerSettings
Bases: BaseSettings
Server settings used to configure the FastAPI and Uvicorn server.
Source code in llama_cpp/server/settings.py
host: str = Field(default='localhost', description='Listen address')
class-attribute
instance-attribute
port: int = Field(default=8000, description='Listen port')
class-attribute
instance-attribute
ssl_keyfile: Optional[str] = Field(default=None, description='SSL key file for HTTPS')
class-attribute
instance-attribute
ssl_certfile: Optional[str] = Field(default=None, description='SSL certificate file for HTTPS')
class-attribute
instance-attribute
api_key: Optional[str] = Field(default=None, description='API key for authentication. If set all requests need to be authenticated.')
class-attribute
instance-attribute
interrupt_requests: bool = Field(default=True, description='Whether to interrupt requests when a new request is received.')
class-attribute
instance-attribute
disable_ping_events: bool = Field(default=False, description='Disable EventSource pings (may be needed for some clients).')
class-attribute
instance-attribute
root_path: str = Field(default='', description='The root path for the server. Useful when running behind a reverse proxy.')
class-attribute
instance-attribute
llama_cpp.server.settings.ModelSettings
Bases: BaseSettings
Model settings used to load a Llama model.
Source code in llama_cpp/server/settings.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
|