Show HN: Open-source Rule-based PDF parser for RAG

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • llmsherpa

    Developer APIs to Accelerate LLM Projects

  • I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

    All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

  • nlm-ingestor

    This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.

  • Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.

    Here's a couple examples:

    - https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

    - https://neuml.hashnode.dev/extract-text-from-documents

    Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

  • grobid

    A machine learning software for extracting information from scholarly documents

  • paperetl

    📄 ⚙️ ETL processes for medical and scientific papers

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Txtai: An all-in-one embeddings database for semantic search and LLM workflows

    1 project | news.ycombinator.com | 24 Jan 2024
  • Generate knowledge with Semantic Graphs and RAG

    1 project | dev.to | 23 Jan 2024
  • Ten Noteworthy AI Research Papers of 2023

    1 project | news.ycombinator.com | 6 Jan 2024
  • 2023: The Year of AI

    2 projects | news.ycombinator.com | 25 Dec 2023
  • Altering the prompt in an Extractor pipeline?

    1 project | /r/txtai | 5 Dec 2023