Someone emails you a PDF. You open it. You try to copy a sentence. Your cursor moves but nothing highlights. You zoom in to read something and the letters dissolve into pixels. The file is 40 MB for eight pages.
What happened: the PDF isn't a document. It's a stack of photographs of a document. Here's why that's different, and what you can do about it.
Two kinds of PDF
The PDF format can hold either:
- Text-based PDFs — generated from Word, a web page, a design tool. They store the letters. Selectable, searchable, sharp at any zoom, small file size.
- Image-based PDFs — from a scanner, phone camera, or "Print to PDF" of a screenshot. They store pixels. Not selectable, not searchable, fuzzy when zoomed, often huge.
Both are valid PDFs. They look similar on a quick glance. But they behave completely differently the moment you try to do anything with them.
Why this matters
An image-based PDF is, for computer purposes, a picture. You can't search it. You can't copy from it. If you convert it to DOCX with a naïve tool, you get a DOCX containing the picture — not editable text. Accessibility tools (screen readers) can't read it.
How to tell which one you have
Open it and try to select a word with your cursor. If text highlights, you have a text-based PDF. If your cursor just sweeps over the page without selecting anything, you have an image-based PDF.
The fix: OCR
Optical character recognition (OCR) reads the pixels and extracts the text. After OCR, you have actual characters you can copy, search, and edit.
In Formatly, drop the PDF on the home page and pick OCR (Extract Text). You'll get back a .txt file containing the readable text. For a proper Word document, open the .txt in Word or Google Docs and format from there.
Tips for better scans
If you're the one doing the scanning, a few things help OCR enormously:
- Scan at 300 DPI or higher. Below that, characters lose definition.
- Black-and-white for text, grayscale for mixed. Color adds size without improving readability.
- Keep the page flat. Books curving near the spine distort the letters and confuse OCR.
- Crop tightly. Excess margin wastes bytes and can introduce noise.
- Good light. For phone captures, daylight or even overhead lighting beats the mix of shadows most rooms have.
Why the file is so big
Each page is a photo — often at high resolution, sometimes in full color. Eight photos easily adds up to 40 MB. A text-based PDF of the same content might be 200 KB.
If you need a smaller file, convert the scan's individual pages to JPG with moderate compression, or re-export after OCR as a text-based PDF. Size drops by an order of magnitude.