SemanticSlicer
nlm-ingestor
SemanticSlicer | nlm-ingestor | |
---|---|---|
1 | 3 | |
7 | 823 | |
- | 12.0% | |
7.5 | 7.1 | |
5 months ago | 23 days ago | |
C# | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
SemanticSlicer
-
Pg_vectorize: The simplest way to do vector search and RAG on Postgres
I wrote a C# library to do this, which is similar to other chunking approaches that are common, like the way langchain does it: https://github.com/drittich/SemanticSlicer
Given a list of separators (regexes), it goes through them in order and keeps splitting the text by them until the chunk fits within the desired size. By putting the higher level separators first (e.g., for HTML split by
before
), it's a pretty good proxy for maintaining context.
nlm-ingestor
-
Pg_vectorize: The simplest way to do vector search and RAG on Postgres
>tree-based approach to organize and summarize text data, capturing both high-level and low-level details.
https://twitter.com/parthsarthi03/status/1753199233241674040
processes documents, organizing content and improving readability by handling sections, paragraphs, links, tables, lists, page continuations, and removing redundancies, watermarks, and applying OCR, with additional support for HTML and other formats through Apache Tika:
https://github.com/nlmatics/nlm-ingestor
-
Show HN: Open-source Rule-based PDF parser for RAG
Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...
What are some alternatives?
OpenAI-DotNet - A Non-Official OpenAI RESTful API Client for DotNet
llmsherpa - Developer APIs to Accelerate LLM Projects
txtai - 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows