Python information-retrieval

Open-source Python projects categorized as information-retrieval

Top 23 Python information-retrieval Projects

  • EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

  • Project mention: I built an online PDF management platform using open-source software | news.ycombinator.com | 2024-05-12

    Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

    [0] https://github.com/JaidedAI/EasyOCR

  • gensim

    Topic Modelling for Humans

  • Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • haystack

    :mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

  • Project mention: Haystack DB – 10x faster than FAISS with binary embeddings by default | news.ycombinator.com | 2024-04-28

    I was confused for a bit but there is no relation to https://haystack.deepset.ai/

  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • Project mention: Show HN: FileKitty – Combine and label text files for LLM prompt contexts | news.ycombinator.com | 2024-05-01
  • ragflow

    RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

  • Project mention: DeepSeek-V2 integrated, RAGFlow v0.5.0 is released | news.ycombinator.com | 2024-05-07
  • marqo

    Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

  • Project mention: Are we at peak vector database? | news.ycombinator.com | 2024-01-25

    We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile). What we have done so far is here -> https://github.com/marqo-ai/marqo

  • catalyst

    Accelerated deep learning R&D (by catalyst-team)

  • Project mention: Instance segmentation of small objects in grainy drone imagery | /r/computervision | 2023-12-09
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • llmware

    Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

  • Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06

    I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.

    A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)

    I think dedicated miniture LLMs are the way forward.

    Disclaimer - Not affiliated with them in any way, just think it's a really cool project.

  • ranking

    Learning to Rank in TensorFlow

  • InvoiceNet

    Deep neural network to extract intelligent information from invoice documents.

  • instructor-embedding

    [ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

  • Project mention: My experience on starting with fine tuning LLMs with custom data | /r/LocalLLaMA | 2023-07-10

    If you li embeddings and vector DB, you should look into this: https://github.com/HKUNLP/instructor-embedding

  • pke

    Python Keyphrase Extraction module

  • mteb

    MTEB: Massive Text Embedding Benchmark

  • Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

    RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:

    - Chunking can interfer with context boundaries

    - Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)

    - Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)

    - RAG will miserably fail with requests like "summarize the whole document"

    - to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb

    1 https://github.com/underlines/awesome-marketing-datascience/...

  • beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

  • Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

    The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

  • rank_bm25

    A Collection of BM25 Algorithms in Python

  • Project mention: Building an efficient sparse keyword index in Python | dev.to | 2023-08-17

    Rank-BM25 project, the top result when searching for python bm25.

  • splade

    SPLADE: sparse neural search (SIGIR21, SIGIR22)

  • Project mention: Splade: Sparse Neural Search | news.ycombinator.com | 2024-03-11
  • RankGPT

    Is ChatGPT Good at Search? LLMs as Re-Ranking Agent [EMNLP 2023 Outstanding Paper Award]

  • Project mention: Avoiding Cascading Failure in LLM Prompt Chains | dev.to | 2023-12-29

    Finally at the end of the pipeline, I added a more expensive filtering and ranking step inspired by RankGPT to do a final ordering over the remaining video clips, picking only the top 10-15 to recommend to the user.

  • ranx

    ⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

  • Project mention: Sparse Vectors in Qdrant: Pure Vector-based Hybrid Search | dev.to | 2024-02-19

    Ranx is a great library for mixing results from different sources.

  • megabots

    🤖 State-of-the-art, production ready LLM apps made mega-easy, so you don't have to build them from scratch 🤯 Create a bot, now 🫵

  • continuous-eval

    Open-Source Evaluation for GenAI Application Pipelines

  • Project mention: Show HN: Ellipsis – Automated PR reviews and bug fixes | news.ycombinator.com | 2024-05-09

    Hi HN, hunterbrooks and nbrad here from Ellipsis (https://www.ellipsis.dev). Ellipsis automatically reviews your PRs when opened and on each new commit. If you tag @ellipsis-dev in a comment, it can make changes to the PR (via direct commit or side PR) and answer questions, just like a human.

    Demo video: https://www.youtube.com/watch?v=X61NGZpaNQA

    So far, we have dozens of open source projects and companies using Ellipsis. We seem to have landed in a kind of sweet spot where there’s a good match between the current capabilities of AI tools and the actual needs of software engineers - this doesn’t replace human review, but it saves you time by catching/fixing lots of small silly stuff.

    Here’s an example in the wild: https://github.com/relari-ai/continuous-eval/pull/38, where Ellipsis (1) adds a PR summary; (2) finds a bug and adds a review comment; (3) after a [human] user comments, generates a side PR with the fix; and (4) after a (human) user merges the side PR and adds another commit, re-reviews the PR and approves it

    Here’s another example: https://github.com/SciPhi-AI/R2R/pull/350#pullrequestreview-..., where Ellipsis adds several comments with inline suggestions that were directly merged by the developer.

    You can configure Ellipsis in natural language to enforce custom rules, style guides, or conventions. For example, here’s how the `jxnl/instructor` repo uses natural language rules to make sure that docs are kept in sync: https://github.com/jxnl/instructor/blob/main/ellipsis.yaml#L..., and here’s an example PR that Ellipsis came up with based on those rules: https://github.com/jxnl/instructor/pull/346.

    Don’t worry, your code is never stored or used to train models (https://docs.ellipsis.dev/security).

    Installing into your repo takes 2 clicks at https://www.ellipsis.dev. We’d really appreciate your feedback, thoughts, and ideas!

  • cherche

    Neural Search

  • gpl

    Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577 (by UKPLab)

  • forte

    Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python information-retrieval related posts

Index

What are some of the best open-source information-retrieval projects in Python? This list will help you:

Project Stars
1 EasyOCR 22,237
2 gensim 15,311
3 haystack 13,883
4 txtai 7,111
5 ragflow 7,744
6 marqo 4,189
7 catalyst 3,234
8 llmware 3,839
9 ranking 2,716
10 InvoiceNet 2,398
11 instructor-embedding 1,728
12 pke 1,531
13 mteb 1,473
14 beir 1,413
15 rank_bm25 857
16 splade 661
17 RankGPT 436
18 ranx 354
19 megabots 334
20 continuous-eval 328
21 cherche 313
22 gpl 310
23 forte 236

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com