OCR (Optical Character Recognition)
Tool ID: ocr-pdf
Make scanned PDFs searchable and editable by recognizing text in images. Stirling-PDF's OCR tool uses Tesseract OCR engine to extract text from image-based PDFs and convert them into searchable, selectable documents.
What is OCR?
OCR (Optical Character Recognition) is technology that recognizes text within images and converts it into actual, selectable text. This makes scanned documents searchable, editable, and accessible.
When You Need OCR
- Scanned paper documents (no text layer)
- Photos of documents or whiteboards
- Screenshots of text
- Image-only PDFs
- PDFs where you can't select or search text
How to Use OCR
- Upload Your PDF - Select a scanned or image-based PDF
- Select Language(s) - Choose the language(s) in your document
- Configure Options - Adjust OCR settings (optional)
- Process - Run OCR on the document
- Download - Get your searchable PDF
Language Support
Stirling-PDF supports OCR in 100+ languages including:
Common Languages
- English - eng
- Spanish - spa
- French - fra
- German - deu
- Italian - ita
- Portuguese - por
- Russian - rus
- Chinese (Simplified) - chi_sim
- Chinese (Traditional) - chi_tra
- Japanese - jpn
- Korean - kor
- Arabic - ara
- Hindi - hin
- Dutch - nld
- Polish - pol
- Turkish - tur
- Vietnamese - vie
- Thai - tha
Multiple Languages
If your document contains multiple languages, you can select multiple language packs for better accuracy.
Example: A document with English and Spanish text should have both eng and spa selected.
OCR Options
Layout Preservation
Options:
- Preserve Original Layout - Maintains original page structure, formatting, and layout
- Simple Text Layer - Adds searchable text without preserving complex formatting
- Clean Text Only - Extracts text without any layout preservation
Recommendation: Use "Preserve Original Layout" for documents where visual structure matters (forms, tables, multi-column layouts).
OCR Quality Settings
Options:
- Fast - Quick processing, good for clean scans
- Balanced - Good quality with reasonable speed (recommended)
- Best - Maximum accuracy, slower processing
Preprocessing Options
Improve OCR accuracy by preprocessing images:
- Auto-rotate - Automatically detect and correct page orientation
- Deskew - Fix slightly tilted/skewed scans
- Despeckle - Remove noise and artifacts from scans
- Remove Background - Clean up paper texture and shadows
- Enhance Contrast - Improve readability of faded text
Tip: For poor quality scans, enable multiple preprocessing options.
Best Practices
For Best OCR Results
-
Use high-quality scans
- 300 DPI or higher recommended
- Higher resolution = better accuracy
- Minimum 150 DPI for acceptable results
-
Clean, clear images
- High contrast between text and background
- Minimal shadows or stains
- Sharp focus, not blurry
-
Correct orientation
- Text should be right-side up
- Use auto-rotate if unsure
-
Select correct language
- Choose all languages present in document
- Wrong language = poor accuracy
-
Preprocess poor scans
- Enable deskew for tilted pages
- Use despeckle for noisy scans
- Enhance contrast for faded text
Document Types
Works Best With:
- ✅ Scanned documents (text documents, contracts, letters)
- ✅ Photos of documents taken with good lighting
- ✅ Clean screenshots of text
- ✅ Printed text (books, magazines, reports)
Challenging Cases:
- ⚠️ Handwritten text (limited accuracy)
- ⚠️ Stylized or decorative fonts
- ⚠️ Very small text (< 8pt font)
- ⚠️ Low resolution images
- ⚠️ Heavily compressed or artifacted images
Common Issues
"No text recognized"
Possible Causes:
- Wrong language selected
- Image quality too poor
- Text too small or blurry
- Extreme skew/rotation
Solutions:
- Verify correct language pack selected
- Use higher quality scan
- Enable preprocessing options
- Check document orientation
"Poor accuracy / Garbled text"
Possible Causes:
- Low quality scan
- Wrong language selected
- Unusual font or formatting
- Background interference
Solutions:
- Increase scan resolution
- Select multiple language packs
- Enable despeckle and contrast enhancement
- Clean up document before scanning
"Processing takes too long"
Possible Causes:
- Large document (many pages)
- High resolution images
- "Best" quality setting
- Multiple preprocessing options
Solutions:
- Process in smaller batches
- Use "Balanced" quality setting
- Reduce resolution if very high
- Disable unnecessary preprocessing
Technical Details
OCR Engine
Stirling-PDF uses Tesseract OCR, an industry-standard open-source OCR engine originally developed by HP and now maintained by Google.
Key Features:
- Over 100 languages supported
- Multiple output formats
- Layout analysis and preservation
- Character and word confidence scores
Processing Steps
- Image Analysis - Detect page layout and text regions
- Preprocessing - Apply selected image enhancements
- Text Recognition - Recognize characters using language models
- Layout Reconstruction - Preserve original formatting
- PDF Generation - Create searchable PDF with text layer
Output Format
OCR produces a PDF with embedded text layer:
- Original image preserved (visual appearance unchanged)
- Invisible text layer added on top
- Text is searchable and selectable
- Layout matches original document
Configuration
Installing Language Packs
By default, Stirling-PDF includes common language packs. To add additional languages:
Docker:
# Install additional language packs
RUN apt-get update && apt-get install -y \
tesseract-ocr-ara \
tesseract-ocr-chi-sim \
tesseract-ocr-jpn
See: OCR Configuration Guide for detailed setup instructions.
Environment Variables
# OCR Settings
system:
ocr:
enabled: true
languages: "eng,spa,fra,deu" # Default languages
pageSegmentationMode: auto
Use with Other Tools
Common Workflows
OCR → Convert to Word
- Run OCR to make document searchable
- Use Convert to export to DOCX
- Edit document in Word
OCR → Search & Redact
- Run OCR to add text layer
- Use search to find sensitive information
- Use Redact to remove it
OCR → Extract Data
- Run OCR on scanned forms/invoices
- Use PDF to CSV to extract tables
- Import data into spreadsheet
Scan → OCR → Compress
- Scan documents to PDF
- Run OCR to make searchable
- Use Compress to reduce file size
API Usage
Perform OCR programmatically via API:
curl -X POST http://stirling-pdf:8080/api/v1/ocr/pdf \
-F "fileInput=@scanned.pdf" \
-F "languages=eng+spa" \
-F "sidecar=false" \
-F "deskew=true" \
-F "clean=true" \
-F "cleanFinal=true" \
-F "ocrType=force" \
-F "ocrRenderType=hocr" \
-o searchable.pdf
Parameters:
languages- Language codes (+ separated)sidecar- Generate separate text filedeskew- Fix tilted pagesclean- Remove noisecleanFinal- Final cleanupocrType-skip,force, orautoocrRenderType- Output format
See API Documentation for complete endpoint reference.
Related Tools
- Convert - Convert OCR'd PDFs to Word, text, or other formats
- Compress - Reduce file size after OCR
- Multi-Tool - Chain OCR with other operations
- Auto-Rename - Rename files based on OCR'd content
Summary
Stirling-PDF's OCR tool provides:
✅ 100+ language support - Recognize text in any language ✅ Layout preservation - Maintain original document formatting ✅ Batch processing - OCR multiple files at once ✅ Preprocessing options - Enhance poor quality scans ✅ Industry-standard engine - Tesseract OCR with proven accuracy ✅ API access - Automate OCR workflows
Perfect for digitizing scanned documents, making PDFs searchable, and extracting text from images!