Balochi NLP Ecosystem | Balochi Academy

Vision Lab (OCR)

Balochi script recognition engine — coming in Phase 3.

Phase 3: Applied Modules (Months 10-14)

Coming Soon

The OCR module will be built after the core NLP tools are ready. It requires a trained Balochi character dataset and a working lexicon for post-OCR correction.

Fine-Tuning Plan

How we'll build Balochi OCR from world-class models.

Base Models

Google Vision + Tesseract OCR

Transfer Learning From

Urdu / Persian OCR models

Training Data

Custom Balochi character dataset + scanned manuscripts from Academy archives

Correction Engine

Lexicon-powered post-OCR correction to fix misreads using dictionary lookup

Nastaliq Script Recognition

Fine-tuned OCR specifically for Balochi Nastaliq calligraphic script, handling the complex ligatures, dots, and diacritical marks unique to this writing style.

Naskh Script Recognition

Support for printed Naskh-style Balochi text found in modern publications, official documents, and digital media.

Post-OCR Correction

Automatic error correction using our growing Balochi lexicon database. Fixes common OCR misreads by cross-referencing against known Balochi words and morphological patterns.

Prerequisites: Phase 1 corpus extraction must complete, and Phase 2 NLP tools (POS tagger, morphological analyzer) must be trained before the OCR correction engine can work effectively.