Skip to main content

llama

# py for
apk add \
gcc g++ python3 py3-pip musl-dev cmake make pkgconf build-base \
git openssh-client binutils coreutils util-linux findutils sed grep tar wget curl neofetch \
rust cargo python3-dev openssl-dev linux-headers

# llama.cpp
# =========
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j

./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
./main -m ./models/7B/ggml-model-q4_0.bin --file prompts/alpaca.txt --instruct --ctx_size 2048 --keep -1

./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7

# https://github.com/ymcui/Chinese-LLaMA-Alpaca
# =========
apk add rust cargo python3-dev openssl-dev cmake linux-headers
pip install git+https://github.com/huggingface/transformers
pip install sentencepiece
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install peft

git clone https://github.com/huggingface/transformers

# musl pthread_attr_setaffinity_np
python ./transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /ml/models/LLaMA \
--model_size 7B \
--output_dir /ml/models/LLaMA-hf

ggml

FAQ

Ref


#->/lmg/ Model Links and Torrents <-

[TOC2]

Changelog (MDY)

[05-10-2023] - Added WizardLM 13B Uncensored [05-07-2023] - Added Vicuna 13B Cocktail, bluemoonrp-13b & AlpacaDente2 [05-05-2023] - Added CPU quantization variation links [05-02-2023] - Initial Rentry

4-bit GPU Model Requirements

!!! note VRAM Required takes full context (2048) into account. You may be able to load the model on GPU's with slightly lower VRAM, but you will not be able to run at full context. If you do not have enough RAM to load model, it will load into swap. Groupsize models will increase VRAM usage, as will running a LoRA alongside the model.

Model ParametersVRAM RequiredGPU ExamplesRAM to Load
7B8GBRTX 1660, 2060, AMD 5700xt, RTX 3050, RTX 3060, RTX 30706 GB
13B12GBAMD 6900xt, RTX 2060 12GB, 3060 12GB, 3080 12GB, A200012GB
30B24GBRTX 3090, RTX 4090, A4500, A5000, 6000, Tesla V10032GB
65B42GBA100 80GB, NVIDIA Quadro RTX 8000, Quadro RTX A600064GB

4-bit CPU/llama.cpp RAM Requirements

!!! note 5bit to 8bit Quantized models are becoming more common, and will obviously require more RAM. Will update these with the numbers when I have them.

Model4-bit5-bit8-bit
7B3.9 GB
13B7.8 GB
30B19.5 GB
65B38.5 GB

Original Weights

LLaMA 16-bit Weights

!!! info

The original LLaMA weights converted to Transformers @ 16bit. A torrent is available as well, but it uses outdated configuration files that will need to be updated. Note that these aren't for general use, as the VRAM requirements are beyond consumer scope.

>Filtering : None
ModelTypeDownload
7B 16bitHF FormatHuggingFace
13B 16bitHF FormatHuggingFace
30B 16bitHF FormatHuggingFace
65B 16bitHF FormatHuggingFace
All the aboveHF FormatTorrent Magnet

LLaMA 4-bit Weights

!!! info

The original LLaMA weights quantized to 4-bit. The GPU CUDA versions have outdated tokenizer and configuration files. It is recommended to either update them with [this](https://rentry.org/544p2) or use the [universal LLaMA tokenizer.](https://github.com/oobabooga/text-generation-webui/blob/main/docs/LLaMA-model.md#option-1-pre-converted-weights)

>Filtering : None
ModelTypeDownload
7B, 13B, 30B, 65BCPUTorrent Magnet
7B, 13B, 30B, 65BGPU CUDA (no groupsize)Torrent Magnet
7B, 13B, 30B, 65BGPU CUDA (128gs)Torrent Magnet
7B, 13B, 30B, 65BGPU TritonNeko Institute of Science HF page

Models/Finetunes/LoRA's

WizardLM 13B Uncensored (05/10/2023)

!!! info

This is WizardLM trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA.

Note that despite being an "uncensored" model, several tests have demonstrated that the model will still refuse to comply with certain requests.

>Filtering : Light
ModelTypeDownload
13B GGMLCPUQ5
13BGPUQ4 CUDA 128gs

BluemoonRP 13B (05/07/2023)

!!! info

An RP/ERP focused finetune of LLaMA 13B finetuned on BluemoonRP logs. It is designed to simulate a 2-person RP session. Two versions are provided; a standard 13B with 2K context and an experimental 13B with 4K context. It has a non-standard format (LEAD/ASSOCIATE), so ensure that you read the model card and use the correct syntax.

>Filtering : None
ModelTypeDownload
13BGPU & CPUhttps://huggingface.co/reeducator/bluemoonrp-13b

Vicuna 13B Cocktail (05/07/2023)

!!! info

Vicuna 1.1 13B finetune incorporating various datasets in addition to the unfiltered ShareGPT. This is an experiment attempting to enhance the creativity of the Vicuna 1.1, while also reducing censorship as much as possible. All datasets have been cleaned. Additionally, only the "instruct" portion of GPTeacher has been used. It has a non-standard format (USER/ASSOCIATE), so ensure that you read the model card and use the correct syntax.

>Filtering : Light
ModelTypeDownload
13BGPU & CPUhttps://huggingface.co/reeducator/vicuna-13b-cocktail

GPT4-x-AlpacaDente2-30B (05/05/2023)

!!! info

ChanSung's Alpaca-LoRA-30B-elina merged with Open Assistant's second Finetune. Testing in progress.

>Filtering : Medium
ModelTypeDownload
30B GGMLCPUQ5
30BGPUQ4 CUDA

https://huggingface.co/askmyteapot/GPT4-x-AlpacaDente2-30b-4bit

Vicuna 13B Free v1.1 (05/01/2023)

!!! info

A work-in-progress, community driven attempt to make an unfiltered version of Vicuna. It currently has an early stopping bug, and a partial workaround has been posted on the repo's model card.

>Filtering : Light
ModelTypeDownload
13BGPU & CPUhttps://huggingface.co/reeducator/vicuna-13b-free

Pygmalion/Metharme 7B (04/30/2023)

!!! info

Pygmalion 7B is a dialogue model that uses LLaMA-7B as a base. The dataset includes RP/ERP content. Metharme 7B is an experimental instruct-tuned variation, which can be guided using natural language like other instruct models.

PygmalionAI intend to use the same dataset on the higher parameter LLaMA models. No ETA as of yet.

>Filtering : None
ModelTypeDownload
7B Pygmalion/MetharmeXORhttps://huggingface.co/PygmalionAI/
7B Pygmalion GGMLCPUQ4, Q5, Q8
7B Metharme GGMLCPUQ4, Q5
7B PygmalionGPUQ4 Triton, Q4 CUDA 128gs
7B MetharmeGPUQ4 Triton, Q4 CUDA

GPT4-X-Alpasta 30B (04/29/2023)

!!! info

An attempt at improving Open Assistant's performance as an instruct while retaining its excellent prose. The merge consists of Chansung's GPT4-Alpaca Lora and Open Assistant's native fine-tune.

It is an extremely coherent model for logic based instruct outputs. And while the prose is generally very good, it does suffer from the "Assistant" personality bleedthrough that plagues the OpenAssistant dataset, which can give you dry dialogue for creative writing/chatbot purposes. However, several accounts claim it's nowhere near as bad as OA's finetunes, and that the prose and coherence gains makes up for it.

>Filtering : Medium
ModelTypeDownload
30B 4bitCPU & GPU CUDAhttps://huggingface.co/MetaIX/GPT4-X-Alpasta-30b-4bit

OpenAssistant LLaMa 30B SFT 6 (04/23/2023)

!!! info

An open-source alternative to OpenAI’s ChatGPT/GPT 3.5 Turbo. However, it seems to suffer from [overfitting](https://www.datarobot.com/wiki/overfitting/) and is heavily filtered. Not recommended for creative writing or chat bots, given the "assistant" personality constantly bleeds through, giving you dry dialogue.

>Filtering : Heavy
ModelTypeDownload
30BXORhttps://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor
30B GGMLCPUQ4
30BGPUQ4 CUDA, Q4 CUDA 128gs

SuperCOT (04/22/2023)

!!! info

SuperCOT is a LoRA trained with the aim of making LLaMa follow prompts for Langchain better, by infusing chain-of-thought datasets, code explanations and instructions, snippets, logical deductions and Alpaca GPT-4 prompts.

Though designed to improve Langchain, it's quite versatile and works very well for other tasks like creative writing and chatbots. The author also pruned a number of filters from the datasets. As of early May 2023, it's the most recommended model on /lmg/

>Filtering : Light
ModelTypeDownload
Original LoRALoRAhttps://huggingface.co/kaiokendev/SuperCOT-LoRA
13B GGMLCPUQ4, Q8
30B GGMLCPUQ4, Q5, Q8
13BGPUQ4 CUDA 128gs
30BGPUQ4 CUDA, Q4 CUDA 128gs

Previous Model List

!!! info

The old rentry, retained for archiving purposes. Contains older and outdated models.

https://rentry.org/backupmdlist


Models for llama.cpp (ggml format)

LLaMA quantized 4-bit weights (ggml q4_0)

2023-03-31 torrent magnet

!!! info Tutorial link for llama.cpp !!! info Tutorial link for koboldcpp

SHA256 checksums:

2dad53e70ca521fedcf9f9be5c26c15df602487a9c008bdafbb2bf8f946b6bf0  llama-7b-ggml-q4_0/ggml-model-q4_0.bin
9cd4d6c1f5f42d5abf529c51bde3303991fba912ab8ed452adfd7c97a4be77d7 llama-13b-ggml-q4_0/ggml-model-q4_0.bin
daefbc6b1b644a75be0286ef865253ab3786e96a2c1bca8b71216b1751eee63e llama-33b-ggml-q4_0/ggml-model-q4_0.bin
d58a29c8403ecbd14258bbce07d90894fc5a8be25b9d359463c18f9f2ef96eb6 llama-65b-ggml-q4_0/ggml-model-q4_0.bin

ggml model file magic: 0x67676a74 (ggjt in hex) ggml model file version: 1

Alpaca quantized 4-bit weights (ggml q4_0)

ModelDownload
LLaMA 7B fine-tune from chavinlo/alpaca-native2023-03-31 torrent magnet
LLaMA 7B merged with tloen/alpaca-lora-7b LoRA2023-03-31 torrent magnet
LLaMA 13B merged with chansung/alpaca-lora-13b LoRA2023-03-31 torrent magnet
LLaMA 33B merged with chansung/alpaca-lora-30b LoRA2023-03-31 torrent magnet

!!! info Tutorial link for llama.cpp Example: ./main --model ggml-model-q4_0.bin --file prompts/alpaca.txt --instruct --ctx_size 2048 --keep -1 !!! info Tutorial link for koboldcpp

SHA256 checksums:

f5e264b10944c55a84810e8073dfdcd653fa8e47ff50ea043ec071051ac7821d  alpaca-7b-ggml-q4_0-native-finetune/ggml-model-q4_0.bin
d9777baad5cf6a5d196e70867338d8cc3c7af68c7744e68de839a522983860d7 alpaca-7b-ggml-q4_0-lora-merged/ggml-model-q4_0.bin
3838aa32651c65948e289374abd71f6feab1a62a4921a648e30d979df86a4af3 alpaca-13b-ggml-q4_0-lora-merged/ggml-model-q4_0.bin
2267ed1dc0bf0d6d300ba292c25083c7fa5395f3726c7c68a49b2be19a64b349 alpaca-33b-ggml-q4_0-lora-merged/ggml-model-q4_0.bin

ggml model file magic: 0x67676a74 (ggjt in hex) ggml model file version: 1

GPT4All 7B quantized 4-bit weights (ggml q4_0)

2023-03-31 torrent magnet

!!! info Tutorial link for llama.cpp GPT4All can be used with llama.cpp in the same way as the other ggml models. !!! info Tutorial link for koboldcpp

SHA256 checksums:

9f6cd4830a3c45a86147c80a32888e7be8f8a489284c87cdb882a7cfe40940c1  gpt4all-unfiltered-7b-ggml-q4_0-lora-merged/ggml-model-q4_0.bin
de314c5ee155ac40a03ca3b3be85ba2b02aef9e9f083c411c0b4490689dd047e gpt4all-7b-ggml-q4_0-lora-merged/ggml-model-q4_0.bin

ggml model file magic: 0x67676a74 (ggjt in hex) ggml model file version: 1

GPT4 x Alpaca 13B quantized 4-bit weights (ggml q4_0)

2023-04-01 torrent magnet

!!! info Tutorial link for llama.cpp GPT4 x Alpaca can be used with llama.cpp in the same way as the other ggml models. Text generation with this version is faster compared to the GPTQ-quantized one. !!! info Tutorial link for koboldcpp

SHA256 checksum:

e6b77ebf297946949b25b3c4b870f10cdc98fb9fcaa6d19cef4dda9021031580  gpt4-x-alpaca-13b-ggml-q4_0/ggml-model-q4_0.bin

ggml model file magic: 0x67676a74 (ggjt in hex) ggml model file version: 1

Model source

GPT4 x Alpaca 13B quantized 4-bit weights (ggml q4_1 from GPTQ with groupsize 128)

2023-04-01 torrent magnet

!!! info Tutorial link for llama.cpp GPT4 x Alpaca can be used with llama.cpp in the same way as the other ggml models. !!! info Tutorial link for koboldcpp

SHA256 checksum:

d4a640a1ce33009c244a361c6f87733aacbc2bea90e84d3c304a4c8be2bdf22d  gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g/ggml-model-q4_1.bin

ggml model file magic: 0x67676a74 (ggjt in hex) ggml model file version: 1

Model source

Vicuna 13B quantized 4-bit weights (ggml q4_0)

2023-04-03 torrent magnet

!!! info Tutorial link for llama.cpp Vicuna can be used with llama.cpp in the same way as the other ggml models. !!! info Tutorial link for koboldcpp

SHA256 checksum:

f96689a13c581f53b616887b2efe82bbfbc5321258dbcfdbe69a22076a7da461  vicuna-13b-ggml-q4_0-delta-merged/ggml-model-q4_0.bin

ggml model file magic: 0x67676a74 (ggjt in hex) ggml model file version: 1

Model source

OpenAssistant LLaMA 13B quantized 4-bit weights (ggml q4_0 & q4_1)

!!! warning Note that this model is work-in-progress.

2023-04-07 torrent magnet | HuggingFace Hub direct download

!!! info Tutorial link for llama.cpp !!! info Tutorial link for koboldcpp

SHA256 checksums:

fe77206c7890ecd0824c7b6b6a6deab92e471366b2e4271c05ece9a686474ef6  ggml-model-q4_0.bin
412da683b6ab0f710ce0adc8bc36db52bb92df96698558c5f2a1399af9bd0a78 ggml-model-q4_1.bin

ggml model file magic: 0x67676a74 (ggjt in hex) ggml model file version: 1

Original model source GPTQ-quantized model source Torrent source


Models for HuggingFace 🤗

!!! danger Updated tokenizer and model configuration files can be found here. Ensure that your models have the appropriate JSON files within the same directory as the weights, otherwise text generation might be impacted by tokenization problems. The issues were addressed here and here, but a manual update of both the transformers library and your model configuration files is required.

LLaMA float16 weights

2023-03-26 torrent magnet | HuggingFace Hub direct downloads

!!! info Tutorial link for Text generation web UI

Torrent source and SHA256 checksums

Vicuna 13B float16 weights

2023-04-03 torrent magnet

!!! info Tutorial link for Text generation web UI

Model source

LLaMA quantized 4-bit weights (GPTQ format without groupsize)

2023-03-26 torrent magnet

!!! info Tutorial link for Text generation web UI

SHA256 checksums:

09841a1c4895e1da3b05c1bdbfb8271c6d43812661e4348c862ff2ab1e6ff5b3  llama-7b-4bit/llama-7b-4bit.safetensors
edfa0b4060aae392b1e9df21fb60a97d78c9268ac6972e3888f6dc955ba0377b llama-13b-4bit/llama-13b-4bit.safetensors
4cb560746fe58796233159612d8d3c9dbdebdf6f0443b47be71643f2f91b8541 llama-30b-4bit/llama-30b-4bit.safetensors
886ce814ed54c4bd6850e2216d5f198c49475210f8690f45dc63365d9aff3177 llama-65b-4bit/llama-65b-4bit.safetensors

Torrent source and more information

LLaMA quantized 4-bit weights (GPTQ format with groupsize 128)

2023-03-26 torrent magnet

!!! info Tutorial link for Text generation web UI Groupsize 128 is a better choice for the 13B, 33B and 65B models, according to this.

SHA256 checksums:

ed8ec9c9f0ebb83210157ad0e3c5148760a4e9fd2acfb02cf00f8f2054d2743b  llama-7b-4bit-128g/llama-7b-4bit-128g.safetensors
d3073ef1a2c0b441f95a5d4f8a5aa3b82884eef45d8997270619cb29bcc994b8 llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
8b7d75d562938823c4503b956cb4b8af6ac0a5afbce2278566cc787da0f8f682 llama-30b-4bit-128g/llama-30b-4bit-128g.safetensors
f1418091e3307611fb0a213e50a0f52c80841b9c4bcba67abc1f6c64c357c850 llama-65b-4bit-128g/llama-65b-4bit-128g.safetensors

Torrent source and more information

Alpaca quantized 4-bit weights (GPTQ format with groupsize 128)

ModelDownload
LLaMA 7B fine-tune from ozcur/alpaca-native-4bit as safetensors2023-03-29 torrent magnet
LLaMA 33B merged with baseten/alpaca-30b LoRA by an anon2023-03-26 torrent magnet | extra config files

!!! info Tutorial link for Text generation web UI

SHA256 checksums:

17d6ba8f83be89f8dfa05cd4720cdd06b4d32c3baed79986e3ba1501b2305530  Alpaca-7B-GPTQ-4bit-128g-native-finetune_2023-03-29/alpaca-7b-4bit-128g-native-finetune.safetensors
a2f8d202ce61b1b612afe08c11f97133c1d56076d65391e738b1ab57c854ee05 Alpaca-30B-4bit-128g/alpaca-30b-hf-4bit.safetensors

Vicuna 13B quantized 4-bit & 8-bit weights (GPTQ format with groupsize 128)

2023-04-03 torrent magnet

!!! info Tutorial link for Text generation web UI

Torrent source Extra config files