Fix Garbled PDF Text: Apache Tika Docker Tutorial
Why use Tika
Copy-pasting from some PDFs shows garbled characters because of custom font encoding.
Apache Tika reads the real text layer directly, so it returns clean Unicode text without OCR.
1. Install Docker (Ubuntu)
sudo apt update sudo apt install -y docker.io curl sudo systemctl enable --now docker
2. Start Tika Server
sudo docker pull apache/tika:latest-full sudo docker run -d --name tika \ -p 127.0.0.1:9998:9998 \ apache/tika:latest-full
Check it:
curl http://127.0.0.1:9998 | head
3. Convert a Single PDF (example: Japanese)
curl -sT sample-jp.pdf http://127.0.0.1:9998/tika \ -H "Accept: text/plain" > sample-jp.txt
sample-jp.txt
will contain proper Japanese text if the PDF has a text layer.
4. Batch Conversion Script
Save as tika_convert.sh
in the folder with your PDFs:
#!/usr/bin/env bash TIKA_HOST="127.0.0.1" TIKA_PORT="9998" if ! curl -s "http://${TIKA_HOST}:${TIKA_PORT}" >/dev/null; then echo "Tika server not reachable"; exit 1 fi for f in *.pdf *.PDF; do [ -f "$f" ] || continue out="${f%.pdf}.txt" echo "→ $f" curl -sT "$f" "http://${TIKA_HOST}:${TIKA_PORT}/tika" \ -H "Accept: text/plain" > "$out" echo " ✔ $out" done echo "Done."
Run:
chmod +x tika_convert.sh ./tika_convert.sh
Each PDF will produce a matching .txt
file.
Notes
- Works for any language if the PDF contains real text.
- No OCR or language packs needed.
- If the PDF is image-only, you must enable OCR and install the corresponding Tesseract language data.
Recent Comments