Top 23 OCR Open-Source Projects

tesseract-ocr

125 58,936 9.1 C++

Tesseract Open Source OCR Engine (main repository)

Project mention: OCR with tesseract, python and pytesseract | dev.to | 2024-06-04

If you want to learn more visit the complete tesseract documentation.

PaddleOCR

61 39,374 9.0 Python

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Project mention: Ask HN: I have many PDFs – what is the best local way to leverage AI for search? | news.ycombinator.com | 2024-05-30

If you want to run locally you can look into this https://github.com/PaddlePaddle/PaddleOCR
https://andrejusb.blogspot.com/2024/03/optimizing-receipt-pr...
But I suggest that you just skip that and use gpt-4o. They aren't actually going to steal your data.
Sort through it to find anything with a credit card number or anything ahead time.
Or you could look into InternVL..
Or a combination of PaddleOCR first and then use a strong LLM via API, like gpt-4o or llama3 70b via together.ai
If you truly must do it locally, then if you have two 3090s or 4090s it might work out. Otherwise it the LLMs may not be smart enough to give good results.
Leaving out the details of your hardware makes it impossible to give good advice about running locally. Other than, it's not really necessary.

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
Tesseract.js

32 33,906 8.1 JavaScript

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Project mention: I am out of the loop. Is Next.js "the future" and something I should consider adding to my knowledge pool? | /r/webdev | 2023-07-05

What do you have against tesseract.js?

ShareX

579 28,156 9.3 C#

ShareX is a free and open source program that lets you capture or record any area of your screen and share it with a single press of a key. It also allows uploading images, text or other types of files to many supported destinations you can choose from.

Project mention: From Dull to Dazzling: 3 Methods to Elevate Your Writing with Visual Content | dev.to | 2024-05-02

For Windows: ShareX - https://github.com/ShareX/ShareX

EasyOCR

39 22,419 3.6 Python

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Project mention: I built an online PDF management platform using open-source software | news.ycombinator.com | 2024-05-12

Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)
[0] https://github.com/JaidedAI/EasyOCR

paperless-ngx

213 17,530 9.9 Python

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

Project mention: Ask HN: I have many PDFs – what is the best local way to leverage AI for search? | news.ycombinator.com | 2024-05-30

Paperless supports OCR + full text indexing: https://docs.paperless-ngx.com/
As far as AI goes, not sure.

siyuan

20 16,628 10.0 TypeScript

A privacy-first, self-hosted, fully open source personal knowledge management software, written in typescript and golang.

Project mention: A structured note-taking app for personal use | news.ycombinator.com | 2023-12-21

Try SiYuan Note. It's free and open source local-first mix of Notion and Obsidian.
https://github.com/siyuan-note/siyuan

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
OCRmyPDF

77 12,353 9.5 Python

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14

Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.

LaTeX-OCR

22 11,163 2.4 Python

pix2tex: Using a ViT to convert images of equations into LaTeX code.

Project mention: Show HN: Synthesize TikZ Graphics Programs for Scientific Figures and Sketches | news.ycombinator.com | 2024-06-06

already claim to (at least partially) support this.
[1] https://github.com/lukas-blecher/LaTeX-OCR

Bob

3 8,161 -24.5

Bob 是一款 macOS 平台的翻译和 OCR 软件。
ailab

2 7,651 0.0 C#

Experience, Learn and Code the latest breakthrough innovations with Microsoft AI

Project mention: AI-Powered Developer Tools | news.ycombinator.com | 2023-08-06

Sorry about that! I should have checked before sharing that link.
It looks like Microsoft published the code on GitHub, so you might be able to deploy it via Azure. (I haven't tried it.)
https://github.com/Microsoft/ailab/blob/master/Sketch2Code/R...

unstructured

12 7,017 9.8 HTML

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Project mention: LlamaCloud and LlamaParse | news.ycombinator.com | 2024-02-20

Be careful with unstructured:
https://github.com/Unstructured-IO/unstructured/blob/d11c70c...
from: https://github.com/open-webui/open-webui/issues/687

ragflow

10 8,744 9.7 Python

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Project mention: Better RAG Results with Reciprocal Rank Fusion and Hybrid Search | news.ycombinator.com | 2024-05-30

Within our open source RAG product RAGFlow(https://github.com/infiniflow/ragflow), Elasticsearch is currently used instead of other general vector databases, because it can provide hybrid search right now. Under the default cases, embedding based reranker is not required, just RRF is enough, while even if reranker is used, keywords based retrieval is also a MUST to be hybridized with embedding based retrieval, that's just what RAGFlow's latest 0.7 release has provided.
On the other hand let me introduce another database we developed, Infinity(https://github.com/infiniflow/infinity), which can provide the fastest hybrid search, you can see the performance here(https://github.com/infiniflow/infinity/blob/main/docs/refere...), both vector search and full-text search could perform much faster than other open source alternatives.
From the next version(weeks later), Infinity will also provide more comprehensive hybrid search capabilities, what you have mentioned the 3-way recalls(dense vector, sparse vector, keyword search) could be provided within single request.

Easydict

1 6,179 9.9 Objective-C

一个简洁优雅的词典翻译 macOS App。开箱即用，支持离线 OCR 识别，支持有道词典，🍎 苹果系统词典，🍎 苹果系统翻译，OpenAI，Gemini，DeepL，Google，Bing，腾讯，百度，阿里，小牛，彩云和火山翻译。A concise and elegant Dictionary and Translator macOS App for looking up words and translating text.
tessdata

11 5,998 2.8

Trained models with fast variant of the "best" LSTM models + legacy models

Project mention: OCR with tesseract, python and pytesseract | dev.to | 2024-06-04

Consider that not all language files work with the original tesseract (0 and 3). Although generally the neural networks one is the one that gives the best result. You can find the models compatible with the original tesseract and neural networks in the tesseract repository.

Parsr

7 5,683 4.6 JavaScript

Transforms PDF, Documents and Images into Enriched Structured Data

Project mention: LlamaCloud and LlamaParse | news.ycombinator.com | 2024-02-20

I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).
AMA

pytesseract

11 5,588 7.4 Python

A Python wrapper for Google Tesseract
donut

19 5,434 2.1 Python

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Project mention: Ask HN: Why are all OCR outputs so raw? | news.ycombinator.com | 2023-11-15

maybe this is better? https://github.com/clovaai/donut
I'm not sure

video-subtitle-extractor

1 5,070 7.6 Python

视频硬字幕提取，生成srt文件。无需申请第三方API，本地实现文本识别。基于深度学习的视频字幕提取框架，包含字幕区域检测、字幕内容提取。A GUI tool for extracting hard-coded subtitle (hardsub) from videos and generating srt files.
SwiftOCR

0 4,600 1.8 Swift

Fast and simple OCR library written in Swift
layout-parser

6 4,573 0.0 Python

A Unified Toolkit for Deep Learning Based Document Image Analysis
TNN

1 4,304 1.6 C++

TNN: developed by Tencent Youtu Lab and Guangying Lab, a uniform deep learning inference framework for mobile、desktop and server. TNN is distinguished by several outstanding features, including its cross-platform capability, high performance, model compression and code pruning. Based on ncnn and Rapidnet, TNN further strengthens the support and performance optimization for mobile devices, and also draws on the advantages of good extensibility and high performance from existed open source efforts
manga-image-translator

12 4,447 9.2 Python

Translate manga/image 一键翻译各类图片内文字 https://cotrans.touhou.ai/

Project mention: [DISC] - The angel who came to pick me up is a Gal (Oneshot by Shiraishi Kouhei) | /r/manga | 2023-09-06

OCR works pretty good. ocr.space, ocr.best and cotrans.touhou.ai/ are all pretty nice.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

OCR related posts

A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images?

5 projects | news.ycombinator.com | 7 Jun 2024
Microsoft Research chief scientist has no issue with Recall

2 projects | news.ycombinator.com | 6 Jun 2024
Microsoft has gone radio silent on Windows Recall

1 project | news.ycombinator.com | 5 Jun 2024
OCR with tesseract, python and pytesseract

2 projects | dev.to | 4 Jun 2024
TotalRecall: Windows 11 Recall Pwned

3 projects | news.ycombinator.com | 4 Jun 2024
Security researcher discovers Microsoft's Recall tool is woefully insecure

1 project | news.ycombinator.com | 4 Jun 2024
OCR Tools for Mac, iOS and Windows

1 project | news.ycombinator.com | 3 Jun 2024
A note from our sponsor - Scout Monitoring
www.scoutapm.com | 7 Jun 2024

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today. Learn more →

Index

What are some of the best open-source OCR projects? This list will help you:

	Project	Stars
1	tesseract-ocr	58,936
2	PaddleOCR	39,374
3	Tesseract.js	33,906
4	ShareX	28,156
5	EasyOCR	22,419
6	paperless-ngx	17,530
7	siyuan	16,628
8	OCRmyPDF	12,353
9	LaTeX-OCR	11,163
10	Bob	8,161
11	ailab	7,651
12	unstructured	7,017
13	ragflow	8,744
14	Easydict	6,179
15	tessdata	5,998
16	Parsr	5,683
17	pytesseract	5,588
18	donut	5,434
19	video-subtitle-extractor	5,070
20	SwiftOCR	4,600
21	layout-parser	4,573
22	TNN	4,304
23	manga-image-translator	4,447

OCR

Top 23 OCR Open-Source Projects

OCR related posts

A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images?

Microsoft Research chief scientist has no issue with Recall

Microsoft has gone radio silent on Windows Recall

OCR with tesseract, python and pytesseract

TotalRecall: Windows 11 Recall Pwned

Security researcher discovers Microsoft's Recall tool is woefully insecure

OCR Tools for Mac, iOS and Windows

Index