Version: 2.0 (Beta)

OCR (Optical Character Recognition)

Tool ID: ocr-pdf

Make scanned PDFs searchable and editable by recognizing text in images. Stirling-PDF's OCR tool uses Tesseract OCR engine to extract text from image-based PDFs and convert them into searchable, selectable documents.

What is OCR?

OCR (Optical Character Recognition) is technology that recognizes text within images and converts it into actual, selectable text. This makes scanned documents searchable, editable, and accessible.

When You Need OCR

Scanned paper documents (no text layer)
Photos of documents or whiteboards
Screenshots of text
Image-only PDFs
PDFs where you can't select or search text

How to Use OCR

Upload Your PDF - Select a scanned or image-based PDF
Select Language(s) - Choose the language(s) in your document
Configure Options - Adjust OCR settings (optional)
Process - Run OCR on the document
Download - Get your searchable PDF

Language Support

Stirling-PDF supports OCR in 100+ languages including:

Common Languages

English - eng
Spanish - spa
French - fra
German - deu
Italian - ita
Portuguese - por
Russian - rus
Chinese (Simplified) - chi_sim
Chinese (Traditional) - chi_tra
Japanese - jpn
Korean - kor
Arabic - ara
Hindi - hin
Dutch - nld
Polish - pol
Turkish - tur
Vietnamese - vie
Thai - tha

Multiple Languages

If your document contains multiple languages, you can select multiple language packs for better accuracy.

Example: A document with English and Spanish text should have both eng and spa selected.

OCR Options

Layout Preservation

Options:

Preserve Original Layout - Maintains original page structure, formatting, and layout
Simple Text Layer - Adds searchable text without preserving complex formatting
Clean Text Only - Extracts text without any layout preservation

Recommendation: Use "Preserve Original Layout" for documents where visual structure matters (forms, tables, multi-column layouts).

OCR Quality Settings

Options:

Fast - Quick processing, good for clean scans
Balanced - Good quality with reasonable speed (recommended)
Best - Maximum accuracy, slower processing

Preprocessing Options

Improve OCR accuracy by preprocessing images:

Auto-rotate - Automatically detect and correct page orientation
Deskew - Fix slightly tilted/skewed scans
Despeckle - Remove noise and artifacts from scans
Remove Background - Clean up paper texture and shadows
Enhance Contrast - Improve readability of faded text

Tip: For poor quality scans, enable multiple preprocessing options.

Best Practices

For Best OCR Results

Use high-quality scans
- 300 DPI or higher recommended
- Higher resolution = better accuracy
- Minimum 150 DPI for acceptable results
Clean, clear images
- High contrast between text and background
- Minimal shadows or stains
- Sharp focus, not blurry
Correct orientation
- Text should be right-side up
- Use auto-rotate if unsure
Select correct language
- Choose all languages present in document
- Wrong language = poor accuracy
Preprocess poor scans
- Enable deskew for tilted pages
- Use despeckle for noisy scans
- Enhance contrast for faded text

Document Types

Works Best With:

✅ Scanned documents (text documents, contracts, letters)
✅ Photos of documents taken with good lighting
✅ Clean screenshots of text
✅ Printed text (books, magazines, reports)

Challenging Cases:

⚠️ Handwritten text (limited accuracy)
⚠️ Stylized or decorative fonts
⚠️ Very small text (< 8pt font)
⚠️ Low resolution images
⚠️ Heavily compressed or artifacted images

Common Issues

"No text recognized"

Possible Causes:

Wrong language selected
Image quality too poor
Text too small or blurry
Extreme skew/rotation

Solutions:

Verify correct language pack selected
Use higher quality scan
Enable preprocessing options
Check document orientation

"Poor accuracy / Garbled text"

Possible Causes:

Low quality scan
Wrong language selected
Unusual font or formatting
Background interference

Solutions:

Increase scan resolution
Select multiple language packs
Enable despeckle and contrast enhancement
Clean up document before scanning

"Processing takes too long"

Possible Causes:

Large document (many pages)
High resolution images
"Best" quality setting
Multiple preprocessing options

Solutions:

Process in smaller batches
Use "Balanced" quality setting
Reduce resolution if very high
Disable unnecessary preprocessing

Technical Details

OCR Engine

Stirling-PDF uses Tesseract OCR, an industry-standard open-source OCR engine originally developed by HP and now maintained by Google.

Key Features:

Over 100 languages supported
Multiple output formats
Layout analysis and preservation
Character and word confidence scores

Processing Steps

Image Analysis - Detect page layout and text regions
Preprocessing - Apply selected image enhancements
Text Recognition - Recognize characters using language models
Layout Reconstruction - Preserve original formatting
PDF Generation - Create searchable PDF with text layer

Output Format

OCR produces a PDF with embedded text layer:

Original image preserved (visual appearance unchanged)
Invisible text layer added on top
Text is searchable and selectable
Layout matches original document

Configuration

Installing Language Packs

By default, Stirling-PDF includes common language packs. To add additional languages:

Docker:

# Install additional language packs
RUN apt-get update && apt-get install -y \
    tesseract-ocr-ara \
    tesseract-ocr-chi-sim \
    tesseract-ocr-jpn

See: OCR Configuration Guide for detailed setup instructions.

Environment Variables

# OCR Settings
system:
  ocr:
    enabled: true
    languages: "eng,spa,fra,deu"  # Default languages
    pageSegmentationMode: auto

Use with Other Tools

Common Workflows

OCR → Convert to Word

Run OCR to make document searchable
Use Convert to export to DOCX
Edit document in Word

OCR → Search & Redact

Run OCR to add text layer
Use search to find sensitive information
Use Redact to remove it

OCR → Extract Data

Run OCR on scanned forms/invoices
Use PDF to CSV to extract tables
Import data into spreadsheet

Scan → OCR → Compress

Scan documents to PDF
Run OCR to make searchable
Use Compress to reduce file size

API Usage

Perform OCR programmatically via API:

curl -X POST http://stirling-pdf:8080/api/v1/ocr/pdf \
  -F "fileInput=@scanned.pdf" \
  -F "languages=eng+spa" \
  -F "sidecar=false" \
  -F "deskew=true" \
  -F "clean=true" \
  -F "cleanFinal=true" \
  -F "ocrType=force" \
  -F "ocrRenderType=hocr" \
  -o searchable.pdf

Parameters:

languages - Language codes (+ separated)
sidecar - Generate separate text file
deskew - Fix tilted pages
clean - Remove noise
cleanFinal - Final cleanup
ocrType - skip, force, or auto
ocrRenderType - Output format

See API Documentation for complete endpoint reference.

Convert - Convert OCR'd PDFs to Word, text, or other formats
Compress - Reduce file size after OCR
Multi-Tool - Chain OCR with other operations
Auto-Rename - Rename files based on OCR'd content

Summary

Stirling-PDF's OCR tool provides:

✅ 100+ language support - Recognize text in any language ✅ Layout preservation - Maintain original document formatting ✅ Batch processing - OCR multiple files at once ✅ Preprocessing options - Enhance poor quality scans ✅ Industry-standard engine - Tesseract OCR with proven accuracy ✅ API access - Automate OCR workflows

Perfect for digitizing scanned documents, making PDFs searchable, and extracting text from images!

What is OCR?​

When You Need OCR​

How to Use OCR​

Language Support​

Common Languages​

Multiple Languages​

OCR Options​

Layout Preservation​

OCR Quality Settings​

Preprocessing Options​

Best Practices​

For Best OCR Results​

Document Types​

Common Issues​

"No text recognized"​

"Poor accuracy / Garbled text"​

"Processing takes too long"​

Technical Details​

OCR Engine​

Processing Steps​

Output Format​

Configuration​

Installing Language Packs​

Environment Variables​

Use with Other Tools​

Common Workflows​

API Usage​

Related Tools​

Summary​

What is OCR?

When You Need OCR

How to Use OCR

Language Support

Common Languages

Multiple Languages

OCR Options

Layout Preservation

OCR Quality Settings

Preprocessing Options

Best Practices

For Best OCR Results

Document Types

Common Issues

"No text recognized"

"Poor accuracy / Garbled text"

"Processing takes too long"

Technical Details

OCR Engine

Processing Steps

Output Format

Configuration

Installing Language Packs

Environment Variables

Use with Other Tools

Common Workflows

API Usage

Related Tools

Summary