OCR a scanned PDF and embed an invisible text layer back in.
Updated
Up to 200 MB. PDF stays on your device — OCR runs locally.
⌘Vpaste PDF
Quick start
How to make a scanned PDF searchable
OCR each page of a scanned PDF and embed an invisible text layer back into the document, locally.
Step 1
Drop or pick a PDF
Drag a scanned PDF onto the drop zone, click to pick it, or paste from the clipboard. The file stays on your device.
Step 2
Pick language, DPI and pages
Choose the dominant language, render DPI (200 is a good default), and whether to process every page or just a range.
Step 3
Build and download
Hit Make searchable. Each page is rendered + OCRed + reassembled as a page with an invisible text layer. The merged PDF downloads as <basename>-searchable.pdf.
In-depth guide
Make a scanned PDF searchable — full guide
Turn a scanned PDF — where each page is essentially a picture of text — into a normal PDF you can search, select and copy text from. The visible content stays identical to the original scan; an invisible OCR text layer is added underneath. Everything runs locally: pdf.js rasterises pages, tesseract.js OCRs them, pdf-lib stitches the per-page PDFs into the final document. Your file never leaves your browser.
When to use this tool
Use it whenever you have a PDF where the text isn't actually text — typical examples:
PDFs exported from screenshot software where the source was an image
After processing, every PDF reader (Adobe Acrobat, Preview, Foxit, Chrome, Firefox) will let you select text, search with Ctrl/Cmd-F, and copy content into other apps — exactly as if the document had been typed natively.
How the text layer is added
This is the same "image-over-text" technique used by every commercial OCR pipeline (ABBYY FineReader, Acrobat Pro, Tesseract's own pdf output):
Rasterise. Each page is rendered to a PNG at the chosen DPI.
OCR. Tesseract recognises the text and remembers where each word landed on the rasterised page.
Build a page. Tesseract writes a one-page PDF where the PNG is the visible page content and the recognised text is drawn on top in text rendering mode 3 (invisible but still selectable and searchable).
Merge. pdf-lib copies all the per-page PDFs into a single document.
The visible result is pixel-identical to a clean scan, but the document is now a real searchable PDF.
Picking DPI and language
If size matters more than accuracy, generate the searchable PDF at 200 DPI and then run it through /tools/pdf-compress to shrink the embedded images back down. The invisible text layer survives compression.
DPI controls the resolution of the rasterised pages — and also affects how well OCR can read them:
150 DPI — smallest output, faster. Use for clean modern scans.
200 DPI — recommended default. Best speed / accuracy balance.
300 DPI — best accuracy on small / faded text. Doubles file size.
Language — pick the dominant script of the document. Tesseract tunes character recognition per language, so a mismatched pack will produce noticeably worse text. Mixed-language documents are still best served by picking the dominant language.
Expect a bigger file
If the searchable PDF is too large, run it through /tools/pdf-compress afterwards — the invisible text layer survives JPEG re-encoding of the page images and recovers most of the size penalty.
The output is almost always larger than the input — frequently 2×–4× depending on the source. Two costs add up:
Re-rasterisation. Each page is rendered to a fresh PNG at your chosen DPI and embedded back into the PDF. A "small" scan compressed with JPEG can balloon when it's re-encoded losslessly.
Invisible text layer. Tesseract embeds a glyphless font and one positioned text run per recognised word. On a dense page that's hundreds of objects.
Rules of thumb at the recommended 200 DPI:
Clean printed scan: ~1.5×–2× the input size.
Phone-camera "scan" with a photo background: ~2×–3×.
Heavy colour or grayscale photo content: can exceed 4×.
Honest limits
Things this tool doesn't try to do:
Reconstruct vector content. The output stores each page as a rasterised image — vector text, lines and shapes from the original become pixels. This is by design (it's how every "image-over-text" searchable PDF works) and is fine for scans; it's the wrong tool for clean vector PDFs.
Preserve form fields, annotations or bookmarks. The output is a fresh PDF built from rendered pages — any AcroForm / annotation / outline metadata from the source is dropped. Run /tools/pdf-fill or similar separately if you need those features.
Recognise handwriting. Tesseract is trained on printed text.
Detect tables. Cells are searchable individually, but the table structure isn't surfaced anywhere in the PDF.
For the canonical "I have a scan and I want to search inside it" workflow this tool is exactly right. For anything beyond that, treat the searchable PDF as an intermediate format and run downstream tools on it.
Frequently asked questions
Is my PDF uploaded anywhere?
No. pdf.js opens the file locally, each page is rasterised on a canvas, tesseract.js OCRs the canvas inside your browser, and pdf-lib merges the per-page PDFs into the final output — all without a network round-trip for your data. Only the language model itself is fetched (once, then cached).
What's the difference between this tool and PDF OCR?
PDF OCR gives you plain text (a .txt file). This tool produces a PDF that looks identical to the original scan but has an invisible text layer underneath — so you can highlight, copy, paste, and search inside any PDF reader as if the document had always been typed.
How is the text overlaid invisibly?
Tesseract emits a one-page PDF per source page where the visible content is the rasterised image and the OCR text sits on top in PDF text rendering mode 3 ("invisible — but selectable by mouse and findable by Ctrl-F"). We then merge those per-page PDFs into a single document with pdf-lib. This is the same technique used by ABBYY FineReader, Adobe Acrobat Pro and every commercial OCR pipeline.
Will it work on a non-scanned PDF?
Yes, but it's wasteful. If the PDF already has a real text layer, this tool will replace it with an OCRed text layer derived from rasterising the original — slower and slightly less accurate than just using the existing text. Run this only on scanned / image-only PDFs.
Will the output be searchable in Ctrl-F across every reader?
Yes — the output is a standards-compliant PDF with a text layer. Adobe Acrobat, Preview, Foxit, browser PDF viewers, and PDF.js all expose the text for selection and search.
Why is the output file larger than the input?
Because we re-encode each page as a PNG image at the chosen DPI (lossless), and the original may have used JPEG compression. If size matters, run the result through /tools/pdf-compress afterwards — it strips PNG bloat back down to JPEG-equivalent sizes.
What DPI should I pick?
200 is the sweet spot. 150 is faster + smaller but loses accuracy on small text. 300 is best for accuracy but doubles the file size and processing time. Don't render above the original scan's native DPI — it doesn't add information.
What languages are supported?
Same 14 curated packs as the other OCR tools: English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Hindi, Arabic, Japanese, Korean, Simplified Chinese, Traditional Chinese.
Can I process a password-protected PDF?
No. Decrypt first with /tools/pdf-unlock (or any PDF reader's Save-without-password), then run this tool.
Keep exploring
More tools you'll like
Hand-picked utilities that pair well with the one you're on — all
free, client-side, and zero-signup.