Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

TensorRT

22 9,299 4.8 C++

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://github.com/NVIDIA/TensorRT/issues/982
Maybe? Looks like tensorRT does work, but I couldn't find much.

lmdeploy

4 2,781 9.8 Python

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

vLLM has healthy competition. Not affiliated but try lmdeploy:
https://github.com/InternLM/lmdeploy
In my testing it’s significantly faster and more memory efficient than vLLM when configured with AWQ int4 and int8 KV cache.
If you look at the PRs, issues, etc you’ll see there are many more optimizations in the works. That said there are also PRs and issues for some of the lmdeploy tricks in vllm as well (AWQ, Triton Inference Server, etc).
I’m really excited to see where these projects go!

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
vllm

31 20,017 9.9 Python

A high-throughput and memory-efficient inference and serving engine for LLMs

lmdeploy and vllm have custom backends for Nvidia Triton Inference Server, which then actually serves up models. lmdeploy is a little more mature as it essentially uses Triton by default but I expect vllm to come along quickly as Triton Inference Server has been the "go to" for high scale and high performance serving of models for years for a variety of reasons.
It appears that we'll see TensorRT-LLM integrated in Triton Inference Server directly:
https://github.com/vllm-project/vllm/issues/541
How all of this will interact (vllm/lmdeploy backends + Triton + TensorRT-LLM) isn't quite clear to me right now.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack

1 project | news.ycombinator.com | 17 Dec 2023
Getting SDXL-turbo running with tensorRT

1 project | /r/StableDiffusion | 6 Dec 2023
[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

4 projects | /r/MachineLearning | 23 Nov 2021
Hugging Face reverts the license back to Apache 2.0

1 project | news.ycombinator.com | 8 Apr 2024
Experimental Mixtral MoE on vLLM!

2 projects | /r/LocalLLaMA | 10 Dec 2023

Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Inference Tensorrt Gpt cuda-kernels Nvidia
Post date: 8 Sep 2023

TensorRT

lmdeploy

InfluxDB

vllm

Related posts

AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack

Getting SDXL-turbo running with tensorRT

[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

Hugging Face reverts the license back to Apache 2.0

Experimental Mixtral MoE on vLLM!

Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Inference Tensorrt Gpt cuda-kernels Nvidia Post date: 8 Sep 2023

TensorRT

lmdeploy

InfluxDB

vllm

Related posts

AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack

Getting SDXL-turbo running with tensorRT

[P] Python library to optimize Hugging Face transformer for inference: &lt; 0.5 ms latency / 2850 infer/sec

Hugging Face reverts the license back to Apache 2.0

Experimental Mixtral MoE on vLLM!

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Inference Tensorrt Gpt cuda-kernels Nvidia
Post date: 8 Sep 2023

[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec