TensorRT-LLM
fiftyone
TensorRT-LLM | fiftyone | |
---|---|---|
14 | 21 | |
6,988 | 6,843 | |
6.2% | 2.5% | |
8.4 | 10.0 | |
3 days ago | about 14 hours ago | |
C++ | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
TensorRT-LLM
-
Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B
Yes, we are also looking at integrating MLX [1] which is optimized for Apple Silicon and built by an incredible team of individuals, a few of which were behind the original Torch [2] project. There's also TensorRT-LLM [3] by Nvidia optimized for their recent hardware.
All of this of course acknowledging that llama.cpp is an incredible project with competitive performance and support for almost any platform.
[1] https://github.com/ml-explore/mlx
[2] https://en.wikipedia.org/wiki/Torch_(machine_learning)
[3] https://github.com/NVIDIA/TensorRT-LLM
- FLaNK AI for 11 March 2024
-
FLaNK Stack 26 February 2024
NVIDIA GPU LLM https://github.com/NVIDIA/TensorRT-LLM
- FLaNK Stack Weekly 19 Feb 2024
-
Nvidia Chat with RTX
https://github.com/NVIDIA/TensorRT-LLM
It's quite a thin wrapper around putting both projects into %LocalAppData%, along with a miniconda environment with the correct dependnancies installed. Also for some reason the LLaMA 13b (24.5GB) and Ministral 7b (13.6GB) but only installed Ministral?
Ministral 7b runs about as accurate as I remeber, but responses are faster than I can read. This seems at the cost of context and variance/temperature - although it's a chat interface the implementation doesn't seem to take into account previous questions or answers. Asking it the same question also gives the same answer.
The RAG (llamaindex) is okay, but a little suspect. The installation comes with a default folder dataset, containing text files of nvidia marketing materials. When I tried asking questions about the files, it often cites the wrong file even if it gave the right answer.
-
Nvidia's Chat with RTX is a promising AI chatbot that runs locally on your PC
Yeah, seems a bit odd because the TensorRT-LLM repo lists Turing as supported architecture.
https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#pr...
-
MK1 Flywheel Unlocks the Full Potential of AMD Instinct for LLM Inference
I support any progress to erode the Nvidia monopoly.
That said from what I'm seeing here the free and open source (less other aspects of the CUDA stack, of course) TensorRT-LLM[0] almost certainly bests this implementation using the Nvidia hardware they reference for comparison.
I don't have an A6000 but as an example with the tensorrt_llm backend for Nvidia Triton Inference Server (also free and open source) I get roughly 30 req/s with Mistral 7B on my RTX 4090 with significantly lower latency. Comparison benchmarks are tough, especially when published benchmarks like these are fairly scant on the real details.
TensorRT-LLM has only been public for a few months and if you peruse the docs, PRs, etc you'll see they have many more optimizations in the works.
In typical Nvidia fashion TensorRT-LLM runs on any Nvidia card (from laptop to datacenter) going back to Turing (five year old cards) assuming you have the VRAM.
You can download and run this today, free and "open source" for these implementations at least. I'm extremely skeptical of the claim "MK1 Flywheel has the Best Throughput and Latency for LLM Inference on NVIDIA". You'll note they compare to vLLM, which is an excellent and incredible project but if you look at vLLM vs Triton w/ TensorRT-LLM the performance improvements are dramatic.
Of course it's the latest and greatest ($$$$$$ and unobtanium) but one look at H100/H200 performance[3] and you can see what happens when the vendor has a robust software ecosystem to help sell their hardware. Pay the Nvidia tax on the frontend for the hardware, get it back as a dividend on the software.
I feel like MK1 must be aware of TensorRT-LLM but of course those comparison benchmarks won't help sell their startup.
[0] - https://github.com/NVIDIA/TensorRT-LLM
[1] - https://github.com/triton-inference-server/tensorrtllm_backe...
[2] - https://mkone.ai/blog/mk1-flywheel-race-tuned-and-track-read...
[3] - https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source...
-
FP8 quantized results are bad compared to int8 results
I have followed the instructions on https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama to convert the float16 Llama2 13B to FP8 and build a tensorRT-LLM engine.
- Optimum-NVIDIA - 28x faster inference in just 1 line of code !?
- Incoming: TensorRT-LLM version 0.6 with support for MoE, new models and more quantization
fiftyone
-
Anomaly Detection with FiftyOne and Anomalib
pip install -U git+https://github.com/voxel51/fiftyone.git
-
May 8, 2024 AI, Machine Learning and Computer Vision Meetup
In this brief walkthrough, I will illustrate how to leverage open-source FiftyOne and Anomalib to build deployment-ready anomaly detection models. First, we will load and visualize the MVTec AD dataset in the FiftyOne App. Next, we will use Albumentations to test out augmentation techniques. We will then train an anomaly detection model with Anomalib and evaluate the model with FiftyOne.
-
Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean
My experience has been much like this. For twenty years, I’ve emphasized scientific and engineering discovery in my work as an academic researcher, publishing these findings at the top conferences in computer vision, AI, and related fields. Yet, at my company, we focus on infrastructure that enables others to unlock scientific discovery. We have built a software framework that enables its users to do better work when training models and curating datasets with large unstructured, visual data — it’s kind of like a PyTorch++ or a Snowflake for unstructured data. This software stack, called FiftyOne in its single-user open source incarnation and FiftyOne Teams in its collaborative enterprise version, has garnered millions of installations and a vibrant user community.
-
How to Estimate Depth from a Single Image
We will use the Hugging Face transformers and diffusers libraries for inference, FiftyOne for data management and visualization, and scikit-image for evaluation metrics.
-
How to Cluster Images
With all that background out of the way, let’s turn theory into practice and learn how to use clustering to structure our unstructured data. We’ll be leveraging two open-source machine learning libraries: scikit-learn, which comes pre-packaged with implementations of most common clustering algorithms, and fiftyone, which streamlines the management and visualization of unstructured data:
-
Efficiently Managing and Querying Visual Data With MongoDB Atlas Vector Search and FiftyOne
FiftyOne is the leading open-source toolkit for the curation and visualization of unstructured data, built on top of MongoDB. It leverages the non-relational nature of MongoDB to provide an intuitive interface for working with datasets consisting of images, videos, point clouds, PDFs, and more.
-
FiftyOne Computer Vision Tips and Tricks - March 15, 2024
Welcome to our weekly FiftyOne tips and tricks blog where we recap interesting questions and answers that have recently popped up on Slack, GitHub, Stack Overflow, and Reddit.
- FLaNK AI for 11 March 2024
-
How to Build a Semantic Search Engine for Emojis
If you want to perform emoji searches locally with the same visual interface, you can do so with the Emoji Search plugin for FiftyOne.
- FLaNK Stack Weekly for 07August2023
What are some alternatives?
ChatRTX - A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM
caer - High-performance Vision library in Python. Scale your research, not boilerplate.
gpt-fast - Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
pytorch-lightning - Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems). [Moved to: https://github.com/Lightning-AI/lightning]
optimum-nvidia
ZnTrack - Create, visualize, run & benchmark DVC pipelines in Python & Jupyter notebooks.
stable-fast - Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
Serpent.AI - Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!
tensorrtllm_backe
streamlit - Streamlit — A faster way to build and share data apps.
daytona - The Open Source Dev Environment Manager.
anomalib - An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.