Top 16 Python Retrieval Projects
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
memorizing-transformers-pytorch
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
searchGPT
Grounded search engine (i.e. with source reference) based on LLM / ChatGPT / OpenAI API. It supports web search, file content search etc.
-
raptor
The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
-
ACT
Atmospheric data Community Toolkit - A python based toolkit for exploring and analyzing time series atmospheric datasets (by ARM-DOE)
-
retomaton
PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
-
ragswift
🚀 Scale your RAG pipeline using Ragswift: A scalable centralized embeddings management platform
-
SHREC2023-ANIMAR
Source codes of team TikTorch (1st place solution) for track 2 and 3 of the SHREC2023 Challenge
-
FloridaPropertyData
A Python-based tool for retrieving and processing property data for specific counties in Florida using Parcel ID numbers. Simplifies data retrieval and offers customization options for real estate agents, investors, and government officials.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...
The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
Project mention: Show HN: Ellipsis – Automated PR reviews and bug fixes | news.ycombinator.com | 2024-05-09Hi HN, hunterbrooks and nbrad here from Ellipsis (https://www.ellipsis.dev). Ellipsis automatically reviews your PRs when opened and on each new commit. If you tag @ellipsis-dev in a comment, it can make changes to the PR (via direct commit or side PR) and answer questions, just like a human.
Demo video: https://www.youtube.com/watch?v=X61NGZpaNQA
So far, we have dozens of open source projects and companies using Ellipsis. We seem to have landed in a kind of sweet spot where there’s a good match between the current capabilities of AI tools and the actual needs of software engineers - this doesn’t replace human review, but it saves you time by catching/fixing lots of small silly stuff.
Here’s an example in the wild: https://github.com/relari-ai/continuous-eval/pull/38, where Ellipsis (1) adds a PR summary; (2) finds a bug and adds a review comment; (3) after a [human] user comments, generates a side PR with the fix; and (4) after a (human) user merges the side PR and adds another commit, re-reviews the PR and approves it
Here’s another example: https://github.com/SciPhi-AI/R2R/pull/350#pullrequestreview-..., where Ellipsis adds several comments with inline suggestions that were directly merged by the developer.
You can configure Ellipsis in natural language to enforce custom rules, style guides, or conventions. For example, here’s how the `jxnl/instructor` repo uses natural language rules to make sure that docs are kept in sync: https://github.com/jxnl/instructor/blob/main/ellipsis.yaml#L..., and here’s an example PR that Ellipsis came up with based on those rules: https://github.com/jxnl/instructor/pull/346.
Don’t worry, your code is never stored or used to train models (https://docs.ellipsis.dev/security).
Installing into your repo takes 2 clicks at https://www.ellipsis.dev. We’d really appreciate your feedback, thoughts, and ideas!
Project mention: FastLLM by Qdrant – lightweight LLM tailored For RAG | news.ycombinator.com | 2024-04-01
Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
At one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... or via routed queries with https://github.com/glassroom/heinsen_routing . Both approaches seemed to work for me, but I had to put that work on hold for reasons outside my control.
Project mention: Show HN: A phone number to text with questions about current events | news.ycombinator.com | 2024-05-10Hi HN! For my senior thesis in CS, I built an SMS-based application to make journalism more accessible. It works like this:
1) You text the topics you're interested in to my phone number. Every day, you'll receive a text with 5 headlines from The Associated Press (https://apnews.com/) related to those topics.
2) If you have questions about any of the current events the headlines describe, you just text them back. A response is generated from the contents of the articles using the RAPTOR retrieval framework (https://github.com/parthsarthi03/raptor) and texted right back to you.
The repo can be found here: https://github.com/tdh15/pressText
I'd really appreciate any and all feedback. Whatever you got, I'd love to hear it :)
Project mention: Show HN: Ragswift – Scalable embeddings platform powered by distributed compute | news.ycombinator.com | 2024-01-22
Python Retrieval related posts
Index
What are some of the best open-source Retrieval projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | mteb | 1,448 |
2 | beir | 1,407 |
3 | R2R | 1,232 |
4 | RETRO-pytorch | 832 |
5 | fastembed | 822 |
6 | NeumAI | 785 |
7 | memorizing-transformers-pytorch | 614 |
8 | searchGPT | 570 |
9 | raptor | 491 |
10 | cherche | 313 |
11 | ACT | 127 |
12 | icl-ceil | 81 |
13 | retomaton | 64 |
14 | ragswift | 33 |
15 | SHREC2023-ANIMAR | 6 |
16 | FloridaPropertyData | 2 |
Sponsored