OCR (Optical Character Recognition)
Tool ID: ocr-pdf
Make scanned PDFs searchable and selectable by recognizing text in images. Uses the Tesseract OCR engine.
When You Need OCR
- Scanned paper documents (no text layer)
- Photos of documents or whiteboards
- Image-only PDFs where you can't select or search text
How to Use
- Upload Your PDF - Select a scanned or image-based PDF
- Select Language(s) - Choose the language(s) in your document
- Configure Options - Adjust OCR mode and preprocessing (optional)
- Process - Run OCR
- Download - Get your searchable PDF
Options
| Option | Values | Description |
|---|---|---|
| Languages | Select from installed packs | Must match the languages in your document. Select multiple if needed |
| OCR Mode | Auto (default), Force, Strict | Auto skips pages that already have text. Force re-OCRs everything. Strict aborts if any text is found |
| Compatibility Mode | On/Off | Uses sandwich PDF format for better compatibility with older software (larger files) |
Advanced Options
| Option | Description |
|---|---|
| Deskew | Automatically straighten tilted/skewed pages |
| Clean Input | Preprocess by removing noise and enhancing contrast for better recognition |
| Clean Output | Post-process the final PDF to remove OCR artifacts |
| Create Text File | Generate a separate .txt file with the extracted text (output as ZIP) |
Advanced options require OCRmyPDF. With Tesseract only, they are ignored.
Language Packs
Available languages depend on which Tesseract language packs are installed. The default Docker image includes English, German, French, Portuguese, and Chinese Simplified. To add more languages, see the OCR Configuration Guide.
Limitations
- Handwritten text has limited accuracy
- Stylized/decorative fonts and very small text (< 8pt) are challenging
- For best results, use 300 DPI or higher scans with good contrast
API Usage
curl -X POST http://stirling-pdf:8080/api/v1/misc/ocr-pdf \
-F "[email protected]" \
-F "languages=eng" \
-F "languages=spa" \
-F "ocrType=skip-text" \
-F "ocrRenderType=hocr" \
-F "deskew=true" \
-F "clean=true" \
-F "cleanFinal=true" \
-F "sidecar=false" \
-o searchable.pdf
See API Documentation for complete endpoint reference.
Related Tools
- Convert - Convert OCR'd PDFs to Word, text, or other formats
- Compress - Reduce file size after OCR
- Auto-Rename - Rename files based on OCR'd content