[D] What pdf parser do you use for paragraph parsing for huggingface models

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Parsr

    Transforms PDF, Documents and Images into Enriched Structured Data

  • Parsing PDFs is very non-trivial process. Google and Amazon parses are largely based on OCRing. There are some advanced state-of-the-art NN-based OCR approaches but they are not very stable, but a stable industry standard is Tesseract, and nice all-in-one open source tools that brings a ton of tools together is https://github.com/axa-group/Parsr . hope this helps

  • grobid

    A machine learning software for extracting information from scholarly documents

  • A few years ago I evaluated a few open source tools. In the end focused on GROBID. As usual, it depends on the type of document whether it works well for your use-case. There is some focus on it being "fast" (if that is a concern).

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Issue getting Parsr GUI up and running

    1 project | /r/docker | 13 Sep 2023
  • Grobid – ML software for extracting information from scholarly documents

    1 project | news.ycombinator.com | 21 Apr 2023
  • Converting PDF into HTML: is it possble?

    2 projects | /r/AskProgramming | 3 Feb 2023
  • How to create a web app that turns academic papers into text documents

    1 project | /r/webdev | 16 Jan 2023
  • Extract research paper`s references

    1 project | /r/LanguageTechnology | 1 Jan 2023