Top 16 Python Retrieval Projects

mteb

2 1,448 9.8 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...

beir

8 1,407 4.2 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
R2R

4 1,232 9.7 Python

The framework for fast development and deployment of RAG systems.

Project mention: Show HN: Ellipsis – Automated PR reviews and bug fixes | news.ycombinator.com | 2024-05-09

Hi HN, hunterbrooks and nbrad here from Ellipsis (https://www.ellipsis.dev). Ellipsis automatically reviews your PRs when opened and on each new commit. If you tag @ellipsis-dev in a comment, it can make changes to the PR (via direct commit or side PR) and answer questions, just like a human.
Demo video: https://www.youtube.com/watch?v=X61NGZpaNQA
So far, we have dozens of open source projects and companies using Ellipsis. We seem to have landed in a kind of sweet spot where there’s a good match between the current capabilities of AI tools and the actual needs of software engineers - this doesn’t replace human review, but it saves you time by catching/fixing lots of small silly stuff.
Here’s an example in the wild: https://github.com/relari-ai/continuous-eval/pull/38, where Ellipsis (1) adds a PR summary; (2) finds a bug and adds a review comment; (3) after a [human] user comments, generates a side PR with the fix; and (4) after a (human) user merges the side PR and adds another commit, re-reviews the PR and approves it
Here’s another example: https://github.com/SciPhi-AI/R2R/pull/350#pullrequestreview-..., where Ellipsis adds several comments with inline suggestions that were directly merged by the developer.
You can configure Ellipsis in natural language to enforce custom rules, style guides, or conventions. For example, here’s how the `jxnl/instructor` repo uses natural language rules to make sure that docs are kept in sync: https://github.com/jxnl/instructor/blob/main/ellipsis.yaml#L..., and here’s an example PR that Ellipsis came up with based on those rules: https://github.com/jxnl/instructor/pull/346.
Don’t worry, your code is never stored or used to train models (https://docs.ellipsis.dev/security).
Installing into your repo takes 2 clicks at https://www.ellipsis.dev. We’d really appreciate your feedback, thoughts, and ideas!

RETRO-pytorch

2 832 2.8 Python

Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch
fastembed

4 822 9.5 Python

Fast, Accurate, Lightweight Python library to make State of the Art Embedding

Project mention: FastLLM by Qdrant – lightweight LLM tailored For RAG | news.ycombinator.com | 2024-04-01

NeumAI

2 785 8.7 Python

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21

Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...

memorizing-transformers-pytorch

5 614 2.6 Python

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Project mention: What can LLMs never do? | news.ycombinator.com | 2024-04-27

At one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... or via routed queries with https://github.com/glassroom/heinsen_routing . Both approaches seemed to work for me, but I had to put that work on hold for reasons outside my control.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
searchGPT

3 570 7.2 Python

Grounded search engine (i.e. with source reference) based on LLM / ChatGPT / OpenAI API. It supports web search, file content search etc.
raptor

3 491 6.6 Python

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Project mention: Show HN: A phone number to text with questions about current events | news.ycombinator.com | 2024-05-10

Hi HN! For my senior thesis in CS, I built an SMS-based application to make journalism more accessible. It works like this:
1) You text the topics you're interested in to my phone number. Every day, you'll receive a text with 5 headlines from The Associated Press (https://apnews.com/) related to those topics.
2) If you have questions about any of the current events the headlines describe, you just text them back. A response is generated from the contents of the articles using the RAPTOR retrieval framework (https://github.com/parthsarthi03/raptor) and texted right back to you.
The repo can be found here: https://github.com/tdh15/pressText
I'd really appreciate any and all feedback. Whatever you got, I'd love to hear it :)

cherche

12 313 4.4 Python

Neural Search
ACT

5 127 8.8 Python

Atmospheric data Community Toolkit - A python based toolkit for exploring and analyzing time series atmospheric datasets (by ARM-DOE)
icl-ceil

1 81 1.7 Python

[ICML 2023] Code for our paper “Compositional Exemplars for In-context Learning”.
retomaton

1 64 0.0 Python

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
ragswift

1 33 8.0 Python

🚀 Scale your RAG pipeline using Ragswift: A scalable centralized embeddings management platform

Project mention: Show HN: Ragswift – Scalable embeddings platform powered by distributed compute | news.ycombinator.com | 2024-01-22

SHREC2023-ANIMAR

1 6 6.6 Python

Source codes of team TikTorch (1st place solution) for track 2 and 3 of the SHREC2023 Challenge
FloridaPropertyData

1 2 4.3 Python

A Python-based tool for retrieving and processing property data for specific counties in Florida using Parcel ID numbers. Simplifies data retrieval and offers customization options for real estate agents, investors, and government officials.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Retrieval related posts

Show HN: A phone number to text with questions about current events

2 projects | news.ycombinator.com | 10 May 2024
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

1 project | news.ycombinator.com | 30 Apr 2024
[D] Any pre trained retrieval based language models available?

3 projects | /r/MachineLearning | 22 Oct 2022

Index

What are some of the best open-source Retrieval projects in Python? This list will help you:

	Project	Stars
1	mteb	1,448
2	beir	1,407
3	R2R	1,232
4	RETRO-pytorch	832
5	fastembed	822
6	NeumAI	785
7	memorizing-transformers-pytorch	614
8	searchGPT	570
9	raptor	491
10	cherche	313
11	ACT	127
12	icl-ceil	81
13	retomaton	64
14	ragswift	33
15	SHREC2023-ANIMAR	6
16	FloridaPropertyData	2