Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • TensorRT

    NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

  • https://github.com/NVIDIA/TensorRT/issues/982

    Maybe? Looks like tensorRT does work, but I couldn't find much.

  • lmdeploy

    LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

  • vLLM has healthy competition. Not affiliated but try lmdeploy:

    https://github.com/InternLM/lmdeploy

    In my testing it’s significantly faster and more memory efficient than vLLM when configured with AWQ int4 and int8 KV cache.

    If you look at the PRs, issues, etc you’ll see there are many more optimizations in the works. That said there are also PRs and issues for some of the lmdeploy tricks in vllm as well (AWQ, Triton Inference Server, etc).

    I’m really excited to see where these projects go!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

  • lmdeploy and vllm have custom backends for Nvidia Triton Inference Server, which then actually serves up models. lmdeploy is a little more mature as it essentially uses Triton by default but I expect vllm to come along quickly as Triton Inference Server has been the "go to" for high scale and high performance serving of models for years for a variety of reasons.

    It appears that we'll see TensorRT-LLM integrated in Triton Inference Server directly:

    https://github.com/vllm-project/vllm/issues/541

    How all of this will interact (vllm/lmdeploy backends + Triton + TensorRT-LLM) isn't quite clear to me right now.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack

    1 project | news.ycombinator.com | 17 Dec 2023
  • Getting SDXL-turbo running with tensorRT

    1 project | /r/StableDiffusion | 6 Dec 2023
  • [P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

    4 projects | /r/MachineLearning | 23 Nov 2021
  • Hugging Face reverts the license back to Apache 2.0

    1 project | news.ycombinator.com | 8 Apr 2024
  • Experimental Mixtral MoE on vLLM!

    2 projects | /r/LocalLLaMA | 10 Dec 2023