Extract text from a scanned PDF with tesseract.js — fully local.
Updated
Up to 200 MB. PDF never leaves your device — pages are OCRed locally.
⌘Vpaste PDF
⌘Ccopy text
Quick start
How to OCR a scanned PDF
Turn a scanned / image-only PDF into selectable text using pdf.js + tesseract.js, locally.
Step 1
Drop or pick a PDF
Drag a scanned PDF onto the drop zone, click to pick it, or paste from the clipboard. The file stays on your device.
Step 2
Pick language, DPI and pages
Choose the dominant language, render DPI (200 is a good default), and whether to OCR every page or just a range.
Step 3
Extract and download
Hit Extract text. Each page is rendered locally, OCRed by tesseract.js, and the combined text appears in an editable textarea — copy or download as .txt.
In-depth guide
Run OCR on a scanned PDF in your browser — full guide
This tool turns scanned / image-only PDFs into selectable text. Each page is rendered with Mozilla's pdf.js, then the rendered image is run through tesseract.js — a WebAssembly port of Google's Tesseract OCR engine. Everything happens inside your browser; the only network traffic is downloading the language model the first time you use it.
When you need PDF OCR (and when you don't)
Test first: open the PDF in any reader and try to select and copy a sentence of body text.
If the text selects normally the PDF already has a text layer. Don't OCR it — use PDF → DOCX or just copy the text. OCR will be slower and less accurate than the existing text.
If selection grabs a whole page block or "selects" nothing — the PDF is image-only (scanned, photographed, exported from screenshot software). This is exactly what this tool is for.
Choosing DPI and language
DPI controls how detailed each rendered page is before OCR.
150 DPI — fastest. Use for clean modern scans with big text.
200 DPI — recommended default. Best speed / accuracy balance.
300 DPI — slowest but most accurate. Use for small print, faded scans, or anything that came in at a low original resolution.
Language — pick the dominant script of the document. Mismatched language packs cause cascading misreads (English digits will still work in any pack, but accented or non-Latin text needs the matching language). For CJK content, Simplified and Traditional Chinese are separate packs.
Picking specific pages
OCR work is O(pages × DPI²). Doubling DPI quadruples the work per page. A 50-page document at 300 DPI may take several minutes on mobile.
For long documents — or to test settings before committing to a full pass — pick Selected pages and enter ranges:
1-5 → first five pages
3,7,11 → those three pages
1-3,9,15-18 → mixed ranges
This is how we recommend you test: run a 2-page sample first, check the text quality, then re-run on the full document with confidence.
Honest limits
What in-browser OCR is not good at:
Tables and forms. Output is plain text in reading order. Tables come out as flowing text with no cell structure.
Multi-column layouts. Tesseract attempts column detection but gets it wrong on dense newspaper-style pages.
Handwriting. Don't rely on it for handwritten content.
Math / chemistry notation. Specialised symbols are unreliable.
For everything else — printed body text, headings, captions, scanned receipts and invoices in any of the 14 supported languages — accuracy is solid.
Frequently asked questions
Is my PDF uploaded anywhere?
No. pdf.js opens the file in your browser, each page is rendered to a local canvas, and tesseract.js OCRs the canvas — all without a network round-trip for your data. Open DevTools → Network while extracting and you'll see zero outbound requests for your PDF.
When should I use PDF OCR vs the regular text-extraction in PDF tools?
If your PDF already has a real text layer (you can select + copy text inside a PDF reader), the PDF → DOCX tool is much faster — it extracts the existing text directly. Use OCR only for scanned / image-only PDFs where the page is essentially a picture of text with no underlying selectable text.
Why is OCR so slow compared to text extraction?
Each page has to be rasterised at 150–300 DPI and then run through a WebAssembly OCR engine. Expect 2–10 seconds per page on a modern laptop, longer on mobile. For a 50-page scanned report, plan on a few minutes.
What DPI should I pick?
200 DPI is the sweet spot for most pages. 150 is faster but loses accuracy on small text; 300 is best for tiny fonts or noisy scans but doubles the OCR time. If the source PDF was scanned at a higher DPI, rendering above that doesn't help — it just creates bigger images.
Can I OCR only some pages?
Yes. Pick "Selected pages" and enter a range like 1-3,5,7-9. Only those pages get rendered + OCRed.
What languages are supported?
14 curated languages out of the box: English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Hindi, Arabic, Japanese, Korean, and both Simplified + Traditional Chinese. The matching trained-data file is fetched from the default Tesseract CDN the first time you use a language and then cached.
Can I get a searchable PDF instead of plain text?
Yes — use the dedicated /tools/pdf-searchable tool. It runs the same OCR but overlays the recognised text invisibly back onto the original PDF pages, so the result is a normal-looking PDF where text is selectable, copyable and searchable.
Will it preserve tables, columns or formatting?
No. OCR output is plain text in reading order — multi-column layouts may interleave, and tables come out as flowing text. For structured extraction you need a heavier pipeline (e.g. AWS Textract or Azure Document Intelligence), which can't run in a pure browser.
Can I OCR a password-protected PDF?
No. Decrypt the file first using /tools/pdf-unlock (or any PDF reader's Save-without-password), then OCR it here.
Keep exploring
More tools you'll like
Hand-picked utilities that pair well with the one you're on — all
free, client-side, and zero-signup.