Fix Garbled PDF Text: Apache Tika Docker Tutorial

Why use Tika

Copy-pasting from some PDFs shows garbled characters because of custom font encoding.
Apache Tika reads the real text layer directly, so it returns clean Unicode text without OCR.


1. Install Docker (Ubuntu)

sudo apt update
sudo apt install -y docker.io curl
sudo systemctl enable --now docker

2. Start Tika Server

sudo docker pull apache/tika:latest-full
sudo docker run -d --name tika \
  -p 127.0.0.1:9998:9998 \
  apache/tika:latest-full

Check it:

curl http://127.0.0.1:9998 | head

3. Convert a Single PDF (example: Japanese)

curl -sT sample-jp.pdf http://127.0.0.1:9998/tika \
     -H "Accept: text/plain" > sample-jp.txt

sample-jp.txt will contain proper Japanese text if the PDF has a text layer.


4. Batch Conversion Script

Save as tika_convert.sh in the folder with your PDFs:

#!/usr/bin/env bash
TIKA_HOST="127.0.0.1"
TIKA_PORT="9998"

if ! curl -s "http://${TIKA_HOST}:${TIKA_PORT}" >/dev/null; then
    echo "Tika server not reachable"; exit 1
fi

for f in *.pdf *.PDF; do
    [ -f "$f" ] || continue
    out="${f%.pdf}.txt"
    echo "→ $f"
    curl -sT "$f" "http://${TIKA_HOST}:${TIKA_PORT}/tika" \
        -H "Accept: text/plain" > "$out"
    echo "   ✔ $out"
done

echo "Done."

Run:

chmod +x tika_convert.sh
./tika_convert.sh

Each PDF will produce a matching .txt file.


Notes

  • Works for any language if the PDF contains real text.
  • No OCR or language packs needed.
  • If the PDF is image-only, you must enable OCR and install the corresponding Tesseract language data.

Leave a Reply

Your email address will not be published. Required fields are marked *