Skip to content
epitometool

PDF OCR

PDF tools

Extract text from a scanned PDF with tesseract.js — fully local.

Updated

Up to 200 MB. PDF never leaves your device — pages are OCRed locally.

  • Vpaste PDF
  • Ccopy text

Quick start

How to OCR a scanned PDF

Turn a scanned / image-only PDF into selectable text using pdf.js + tesseract.js, locally.

  1. Step 1
    Drop or pick a PDF

    Drag a scanned PDF onto the drop zone, click to pick it, or paste from the clipboard. The file stays on your device.

  2. Step 2
    Pick language, DPI and pages

    Choose the dominant language, render DPI (200 is a good default), and whether to OCR every page or just a range.

  3. Step 3
    Extract and download

    Hit Extract text. Each page is rendered locally, OCRed by tesseract.js, and the combined text appears in an editable textarea — copy or download as .txt.

In-depth guide

Run OCR on a scanned PDF in your browser — full guide

This tool turns scanned / image-only PDFs into selectable text. Each page is rendered with Mozilla's pdf.js, then the rendered image is run through tesseract.js — a WebAssembly port of Google's Tesseract OCR engine. Everything happens inside your browser; the only network traffic is downloading the language model the first time you use it.

When you need PDF OCR (and when you don't)

Test first: open the PDF in any reader and try to select and copy a sentence of body text.

  • If the text selects normally the PDF already has a text layer. Don't OCR it — use PDF → DOCX or just copy the text. OCR will be slower and less accurate than the existing text.
  • If selection grabs a whole page block or "selects" nothing — the PDF is image-only (scanned, photographed, exported from screenshot software). This is exactly what this tool is for.

Choosing DPI and language

DPI controls how detailed each rendered page is before OCR.

  • 150 DPI — fastest. Use for clean modern scans with big text.
  • 200 DPI — recommended default. Best speed / accuracy balance.
  • 300 DPI — slowest but most accurate. Use for small print, faded scans, or anything that came in at a low original resolution.

Language — pick the dominant script of the document. Mismatched language packs cause cascading misreads (English digits will still work in any pack, but accented or non-Latin text needs the matching language). For CJK content, Simplified and Traditional Chinese are separate packs.

Picking specific pages

OCR work is O(pages × DPI²). Doubling DPI quadruples the work per page. A 50-page document at 300 DPI may take several minutes on mobile.

For long documents — or to test settings before committing to a full pass — pick Selected pages and enter ranges:

  • 1-5 → first five pages
  • 3,7,11 → those three pages
  • 1-3,9,15-18 → mixed ranges

This is how we recommend you test: run a 2-page sample first, check the text quality, then re-run on the full document with confidence.

Honest limits

What in-browser OCR is not good at:

  • Tables and forms. Output is plain text in reading order. Tables come out as flowing text with no cell structure.
  • Multi-column layouts. Tesseract attempts column detection but gets it wrong on dense newspaper-style pages.
  • Handwriting. Don't rely on it for handwritten content.
  • Math / chemistry notation. Specialised symbols are unreliable.

For everything else — printed body text, headings, captions, scanned receipts and invoices in any of the 14 supported languages — accuracy is solid.

Frequently asked questions

Is my PDF uploaded anywhere?

No. pdf.js opens the file in your browser, each page is rendered to a local canvas, and tesseract.js OCRs the canvas — all without a network round-trip for your data. Open DevTools → Network while extracting and you'll see zero outbound requests for your PDF.

When should I use PDF OCR vs the regular text-extraction in PDF tools?

If your PDF already has a real text layer (you can select + copy text inside a PDF reader), the PDF → DOCX tool is much faster — it extracts the existing text directly. Use OCR only for scanned / image-only PDFs where the page is essentially a picture of text with no underlying selectable text.

Why is OCR so slow compared to text extraction?

Each page has to be rasterised at 150–300 DPI and then run through a WebAssembly OCR engine. Expect 2–10 seconds per page on a modern laptop, longer on mobile. For a 50-page scanned report, plan on a few minutes.

What DPI should I pick?

200 DPI is the sweet spot for most pages. 150 is faster but loses accuracy on small text; 300 is best for tiny fonts or noisy scans but doubles the OCR time. If the source PDF was scanned at a higher DPI, rendering above that doesn't help — it just creates bigger images.

Can I OCR only some pages?

Yes. Pick "Selected pages" and enter a range like 1-3,5,7-9. Only those pages get rendered + OCRed.

What languages are supported?

14 curated languages out of the box: English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Hindi, Arabic, Japanese, Korean, and both Simplified + Traditional Chinese. The matching trained-data file is fetched from the default Tesseract CDN the first time you use a language and then cached.

Can I get a searchable PDF instead of plain text?

Yes — use the dedicated /tools/pdf-searchable tool. It runs the same OCR but overlays the recognised text invisibly back onto the original PDF pages, so the result is a normal-looking PDF where text is selectable, copyable and searchable.

Will it preserve tables, columns or formatting?

No. OCR output is plain text in reading order — multi-column layouts may interleave, and tables come out as flowing text. For structured extraction you need a heavier pipeline (e.g. AWS Textract or Azure Document Intelligence), which can't run in a pure browser.

Can I OCR a password-protected PDF?

No. Decrypt the file first using /tools/pdf-unlock (or any PDF reader's Save-without-password), then OCR it here.

Keep exploring

More tools you'll like

Hand-picked utilities that pair well with the one you're on — all free, client-side, and zero-signup.