System Overview
Monitoring the state of the Balochi NLP Infrastructure.
Balochi Academy NLP Project
Building a complete NLP ecosystem for the Balochi language by leveraging the world's best AI models and fine-tuning them with our own linguistic data. From Google Gemini to Whisper, from mBERT to OpenNMT — we take proven, state-of-the-art models and specialize them for Balochi, creating tools that Balochi Academy fully owns.
Resource Building
Months 1-5
Core NLP Tools
Months 6-9
Applied Modules
Months 10-14
Advanced AI
Months 15-18
Neural extraction active
Corpus growing
Dictionary growing
Vision (OCR)
Fine-tuning Google Vision and Gemini models to accurately read Balochi Nastaliq and Naskh scripts. Transfer learning from Urdu/Persian OCR models, combined with a custom Balochi character dataset and post-OCR correction engine powered by our growing lexicon.
Speech Studio
Fine-tuning Whisper (OpenAI) and Wav2Vec2 (Meta) for Balochi automatic speech recognition across Eastern, Western, and Southern dialects. Building a TTS prototype using Tacotron 2 / FastSpeech 2 with Balochi phoneme mapping from our lexicon.
Translation Hub
Fine-tuning transformer-based neural machine translation models on Balochi-English and Balochi-Urdu parallel corpora. Using transfer learning from multilingual models to overcome low-resource challenges, targeting 50,000-100,000 aligned sentence pairs.
Text Analytics
Building core NLP tools by fine-tuning multilingual BERT (mBERT) and XLM-RoBERTa for Balochi. Includes a custom POS tagger, morphological analyzer using finite-state transducers, named entity recognition, and a dependency parser following Universal Dependencies.
2,133 sentences indexed
The Data Vault
The foundation of everything — a structured, dialect-balanced linguistic corpus targeting 50M+ tokens and 100,000+ lexicon entries. Houses digitized books, annotated datasets, speech recordings, and parallel corpora that power all model fine-tuning.
97 books available
Extraction Pipeline
Automated AI-powered engine using Gemini 2.5 Flash to extract sentences, tokens, and dictionary entries from Balochi books. Performs dialect tagging, noise filtering, and POS analysis — building the training data that all future fine-tuned models depend on.
Ready to run