Marker: Convert PDF to Markdown quickly with high accuracy

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

marker

9 9,303 8.8 Python

Convert PDF to markdown quickly with high accuracy
tesseract-ocr

123 58,776 9.1 C++

Tesseract Open Source OCR Engine (main repository)

Last update was pretty recent, and the git mentions tesseract 5 as a dep. so it's likely moved on a bit from when you last tried it:
https://github.com/tesseract-ocr/tesseract/releases
I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
libgen_to_txt

1 224 5.9 Python

Convert all of libgen to high quality markdown

Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).

pdfcpu

30 6,396 9.1 Go

A PDF processor written in Go.

I can report that the closest I've came before is with PDFMiner (https://pypi.org/project/pdfminer/) for Python. The benefit of this one is that it retains styling information, so that italics and the like can be retained, at least with some post-processing (I think one might need to convert certain CSS-classes to actual or tags).
The other option I have started looking into is the PDFCPU library for Go. It is a bit more low-level than PDFMiner, but one gets out very well structured info, that seem it might be possible to post-process quite well, for one's particular use case and PDF layouts: https://github.com/pdfcpu/pdfcpu
I also now tried the Marker tool in the OT, and it seems to do a reasonable job. It did intermingle some columns though, at least in some tricky cases such as when there were a round shaped image in between the two columns. One note is that Marker doesn't seem to retain styling like italics though.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Multimodal AI: Bridging the Gap Between Human and Machine Understanding

1 project | dev.to | 14 May 2024
Highlighting Image Text

1 project | dev.to | 30 Apr 2024
one of the Codia AI Design technologies: OCR Technology

1 project | dev.to | 14 Feb 2024
OCR text to speech for disability

1 project | /r/AskProgramming | 10 Dec 2023
How to Read Text From an Image with Python

1 project | dev.to | 23 Oct 2023

Marker: Convert PDF to Markdown quickly with high accuracy

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Image processing PDF Tesseract Go tesseract-ocr
Post date: 30 Nov 2023

marker

tesseract-ocr

Scout Monitoring

libgen_to_txt

pdfcpu

Related posts

Multimodal AI: Bridging the Gap Between Human and Machine Understanding

Highlighting Image Text

one of the Codia AI Design technologies: OCR Technology

OCR text to speech for disability

How to Read Text From an Image with Python

Marker: Convert PDF to Markdown quickly with high accuracy

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Image processing PDF Tesseract Go tesseract-ocr Post date: 30 Nov 2023

marker

tesseract-ocr

Scout Monitoring

libgen_to_txt

pdfcpu

Related posts

Multimodal AI: Bridging the Gap Between Human and Machine Understanding

Highlighting Image Text

one of the Codia AI Design technologies: OCR Technology

OCR text to speech for disability

How to Read Text From an Image with Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Image processing PDF Tesseract Go tesseract-ocr
Post date: 30 Nov 2023