[D] What pdf parser do you use for paragraph parsing for huggingface models

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

Parsr

7 5,679 4.6 JavaScript

Transforms PDF, Documents and Images into Enriched Structured Data

Parsing PDFs is very non-trivial process. Google and Amazon parses are largely based on OCRing. There are some advanced state-of-the-art NN-based OCR approaches but they are not very stable, but a stable industry standard is Tesseract, and nice all-in-one open source tools that brings a ton of tools together is https://github.com/axa-group/Parsr . hope this helps

grobid

12 3,157 9.2 Java

A machine learning software for extracting information from scholarly documents

A few years ago I evaluated a few open source tools. In the end focused on GROBID. As usual, it depends on the type of document whether it works well for your use-case. There is some focus on it being "fast" (if that is a concern).

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Issue getting Parsr GUI up and running

1 project | /r/docker | 13 Sep 2023
Grobid – ML software for extracting information from scholarly documents

1 project | news.ycombinator.com | 21 Apr 2023
Converting PDF into HTML: is it possble?

2 projects | /r/AskProgramming | 3 Feb 2023
How to create a web app that turns academic papers into text documents

1 project | /r/webdev | 16 Jan 2023
Extract research paper`s references

1 project | /r/LanguageTechnology | 1 Jan 2023

[D] What pdf parser do you use for paragraph parsing for huggingface models

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
PDF Machine Learning parsr scientific-articles Images
Post date: 13 Jul 2021

Parsr

grobid

InfluxDB

Related posts

Issue getting Parsr GUI up and running

Grobid – ML software for extracting information from scholarly documents

Converting PDF into HTML: is it possble?

How to create a web app that turns academic papers into text documents

Extract research paper`s references

[D] What pdf parser do you use for paragraph parsing for huggingface models

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning PDF Machine Learning parsr scientific-articles Images Post date: 13 Jul 2021

Parsr

grobid

InfluxDB

Related posts

Issue getting Parsr GUI up and running

Grobid – ML software for extracting information from scholarly documents

Converting PDF into HTML: is it possble?

How to create a web app that turns academic papers into text documents

Extract research paper`s references

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
PDF Machine Learning parsr scientific-articles Images
Post date: 13 Jul 2021