-
jan
Jan is an open source alternative to ChatGPT that runs 100% offline on your computer. Multiple engine support (llama.cpp, TensorRT-LLM)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
primeqa
The prime repository for state-of-the-art Multilingual Question Answering research and development.
-
GPU-Benchmarks-on-LLM-Inference
Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
seconded - IMHO Jan has the cleanest UI and most straightforward setup out of all LLM frontends available now.
https://jan.ai/
https://github.com/janhq/jan
I was able to successfully run Llama 3 8B, mistral 7B, phi and other 7B models using Ollama [1] on my M1 MacBook Air.
[1] https://ollama.com
Here [1] is a reference to the token/sec of Llama 3 on different apple hardware. You can evaluate if this is an acceptable performance for your agents. I would assume the token/sec would be much lower if the LLM agent is running along the side as the game would also be using a portion of the CPU and GPU. I think this is something that you need to test out on your own to determine its usability.
You can also look into lower parameter models (3B for example) to determine if the balance between accuracy and performance fits under your usecase.
>Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?
I don't have any knowledge on game dev so I can comment on this but yes, packaging it locally would make the inference free.
[1] https://github.com/ggerganov/llama.cpp/discussions/4167
See https://github.com/Mozilla-Ocho/llamafile, a standalone packaging of llama.cpp that runs an LLM locally. It will use the GPU, but it also falls back on the CPU. CPU performance of small, quantized models is still pretty decent, and the page has estimated memory requirements for different models.
There is actually a specific approach of this concept for generating synthetic data for training called UDAPDR[0].
It or something like it could likely be applied to any form of generation including what you are describing.
[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...
LocalLLaMA subreddit usually has some interesting benchmarks and reports.
Here is one example, testing performance of different GPUs and Macs with various flavours of Llama:
https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...
LocalLLaMA subreddit usually has some interesting benchmarks and reports.
Here is one example, testing performance of different GPUs and Macs with various flavours of Llama:
https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...