System Overview

Monitoring the state of the Balochi NLP Infrastructure.

Balochi Academy NLP Project

Building a complete NLP ecosystem for the Balochi language by leveraging the world's best AI models and fine-tuning them with our own linguistic data. From Google Gemini to Whisper, from mBERT to OpenNMT — we take proven, state-of-the-art models and specialize them for Balochi, creating tools that Balochi Academy fully owns.

Phase 1

Resource Building

Months 1-5

Phase 2

Core NLP Tools

Months 6-9

Phase 3

Applied Modules

Months 10-14

Phase 4

Advanced AI

Months 15-18

Total Tokens Processed
41,810

Neural extraction active

Progress to Target0%
Corpus Sentences
2,133

Corpus growing

Progress to Target0%
Lexicon Entries
5,307

Dictionary growing

Progress to Target5%
NLP Toolkit — Fine-Tuned for Balochi
Each module takes a world-class AI model and fine-tunes it specifically for the Balochi language using our growing dataset of 41,810 tokens.

Vision (OCR)

Google Vision + GeminiPhase 3

Fine-tuning Google Vision and Gemini models to accurately read Balochi Nastaliq and Naskh scripts. Transfer learning from Urdu/Persian OCR models, combined with a custom Balochi character dataset and post-OCR correction engine powered by our growing lexicon.

Nastaliq RecognitionNaskh RecognitionPost-OCR Correction

Speech Studio

Whisper + Wav2Vec2Phase 3-4

Fine-tuning Whisper (OpenAI) and Wav2Vec2 (Meta) for Balochi automatic speech recognition across Eastern, Western, and Southern dialects. Building a TTS prototype using Tacotron 2 / FastSpeech 2 with Balochi phoneme mapping from our lexicon.

ASR (Whisper Fine-tune)TTS PrototypeDialect-Aware

Translation Hub

OpenNMT TransformerPhase 4

Fine-tuning transformer-based neural machine translation models on Balochi-English and Balochi-Urdu parallel corpora. Using transfer learning from multilingual models to overcome low-resource challenges, targeting 50,000-100,000 aligned sentence pairs.

Balochi ↔ English MTBalochi ↔ Urdu MTParallel Corpus

Text Analytics

mBERT + XLM-RPhase 2

Building core NLP tools by fine-tuning multilingual BERT (mBERT) and XLM-RoBERTa for Balochi. Includes a custom POS tagger, morphological analyzer using finite-state transducers, named entity recognition, and a dependency parser following Universal Dependencies.

POS TaggerNER SystemMorphological Analyzer

2,133 sentences indexed

The Data Vault

Phase 1

The foundation of everything — a structured, dialect-balanced linguistic corpus targeting 50M+ tokens and 100,000+ lexicon entries. Houses digitized books, annotated datasets, speech recordings, and parallel corpora that power all model fine-tuning.

50M+ Token Corpus100k+ LexiconSpeech Data

97 books available

Extraction Pipeline

Gemini 2.5 FlashPhase 1

Automated AI-powered engine using Gemini 2.5 Flash to extract sentences, tokens, and dictionary entries from Balochi books. Performs dialect tagging, noise filtering, and POS analysis — building the training data that all future fine-tuned models depend on.

Batch ExtractionDialect TaggingLexicon Building

Ready to run