Skip to main content

OCR Language Packs and Setup

This document provides instructions on how to add additional language packs for the OCR tab in Stirling-PDF, both inside and outside of Docker.

How does the OCR Work

Stirling-PDF uses Tesseract for its text recognition. All credit goes to them for this awesome work!

Language Packs

Tesseract OCR supports a variety of languages. You can find additional language packs in the Tesseract GitHub repositories:

  • tessdata_fast: These language packs are smaller and faster to load but may provide lower recognition accuracy.
  • tessdata: These language packs are larger and provide better recognition accuracy, but may take longer to load.

Depending on your requirements, you can choose the appropriate language pack for your use case. By default, Stirling-PDF uses tessdata_fast for English, but this can be replaced.

Installing Language Packs manually

  1. Download the desired language pack(s) by selecting the .traineddata file(s) for the language(s) you need.
  2. Place the .traineddata files in the Tesseract tessdata directory: /usr/share/tessdata (or equivalent)


Docker Setup

If you are using Docker, you need to expose the Tesseract tessdata directory as a volume in order to use the additional language packs.

Modify your docker-compose.yml file to include the following volume configuration:

image: your_docker_image_name
- /location/of/trainingData:/usr/share/tessdata

Non-Docker Setup

For Debian-based systems, use the following commands to manage Tesseract languages:

sudo apt update &&\
# All languages
# sudo apt install -y 'tesseract-ocr-*'

# Find available languages:
apt search tesseract-ocr-

# View installed languages:
dpkg-query -W tesseract-ocr- | sed 's/tesseract-ocr-//g'