Vision Lab (OCR)
Balochi script recognition engine — coming in Phase 3.
Phase 3: Applied Modules (Months 10-14)
The OCR module will be built after the core NLP tools are ready. It requires a trained Balochi character dataset and a working lexicon for post-OCR correction.
Google Vision + Tesseract OCR
Urdu / Persian OCR models
Custom Balochi character dataset + scanned manuscripts from Academy archives
Lexicon-powered post-OCR correction to fix misreads using dictionary lookup
Nastaliq Script Recognition
Fine-tuned OCR specifically for Balochi Nastaliq calligraphic script, handling the complex ligatures, dots, and diacritical marks unique to this writing style.
Naskh Script Recognition
Support for printed Naskh-style Balochi text found in modern publications, official documents, and digital media.
Post-OCR Correction
Automatic error correction using our growing Balochi lexicon database. Fixes common OCR misreads by cross-referencing against known Balochi words and morphological patterns.
Prerequisites: Phase 1 corpus extraction must complete, and Phase 2 NLP tools (POS tagger, morphological analyzer) must be trained before the OCR correction engine can work effectively.