OptionalbatchPrompt processing batch size.
OptionalcacheOptionalcallbackOptionalcallbacksOptionalconcurrencyOptionalcontextText context size.
OptionalembeddingEmbedding mode only.
Optionalf16Use fp16 for KV cache.
OptionalgbnfGBNF string to be used to format output. Also known as grammar.
OptionalgpuNumber of layers to store in VRAM.
OptionaljsonJSON schema to be used to format output. Also known as grammar.
OptionallogitsThe llama_eval() call computes all logits, not just the last one.
OptionalmaxThe maximum number of concurrent calls that can be made.
Defaults to Infinity, which means no limit.
OptionalmaxThe maximum number of retries that can be made for a single call, with an exponential backoff between each attempt. Defaults to 6.
OptionalmaxOptionalmetadataPath to the model on the filesystem.
OptionalonCustom handler to handle failed attempts. Takes the originally thrown error object as input, and should itself throw an error if the input error is not retryable.
OptionalprependAdd the begining of sentence token.
OptionalseedIf null, a random seed will be used.
OptionaltagsOptionaltemperatureThe randomness of the responses, e.g. 0.1 deterministic, 1.5 creative, 0.8 balanced, 0 disables.
OptionalthreadsNumber of threads to use to evaluate tokens.
OptionaltopKConsider the n most likely tokens, where n is 1 to vocabulary size, 0 disables (uses full vocabulary). Note: only applies when temperature > 0.
OptionaltopPSelects the smallest token set whose probability exceeds P, where P is between 0 - 1, 1 disables. Note: only applies when temperature > 0.
OptionaltrimTrim whitespace from the end of the generated text Disabled by default.
OptionaluseForce system to keep model in RAM.
OptionaluseUse mmap if possible.
OptionalverboseOptionalvocabOnly load the vocabulary, no weights.
Note that the modelPath is the only required parameter. For testing you can set this in the environment variable
LLAMA_PATH.