Balochi NLP Ecosystem | Balochi Academy

Text Analytics

Core NLP tools being built for Balochi — Phase 2 of the 18-month roadmap.

Phase 2: Core NLP Tools (Months 6-9)

These tools will be built by fine-tuning world-class multilingual AI models on the Balochi corpus being extracted in Phase 1. Each tool requires thousands of annotated sentences — the extraction pipeline is currently building this foundation.

Total Tokens

79,594

50M target

Corpus Sentences

4,053

Training data for all models

Lexicon Entries

11,995

100K target

Books Processed

97 in library

POS Tagger

Phase 2 (Months 6-9)

Part-of-speech tagging for Balochi text — labeling every word as noun, verb, adjective, adverb, etc. Essential for understanding sentence structure and grammar patterns across dialects.

Base ModelmBERT / XLM-RoBERTa

ApproachFine-tune multilingual BERT on manually annotated Balochi sentences from our corpus. Transfer learning from Urdu/Persian POS models to bootstrap accuracy.

Data Needed10,000+ manually annotated sentences

Collecting training data

Morphological Analyzer

Phase 2 (Months 6-9)

Breaks Balochi words into their root forms, prefixes, suffixes, and inflections. Critical for understanding how Balochi words change form based on tense, gender, number, and case.

Base ModelFinite-State Transducers (FST)

ApproachBuild rule-based FST from Balochi grammar rules, augmented with neural models trained on our lexicon entries. Handles agglutinative morphology across Southern, Eastern, and Western dialects.

Data NeededComplete morphological paradigm tables + lexicon

Lexicon growing

Named Entity Recognition

Phase 2 (Months 6-9)

Identifies and classifies named entities in Balochi text — people, places, organizations, dates, and cultural terms. Key for information extraction and knowledge graph construction.

Base ModelXLM-RoBERTa

ApproachFine-tune XLM-R on Balochi NER dataset created from our corpus. Semi-automatic annotation using entity patterns from lexicon, followed by manual validation.

Data Needed5,000+ annotated sentences with entity labels

Awaiting corpus milestone

Dependency Parser

Phase 2 (Months 6-9)

Analyzes the grammatical structure of Balochi sentences — identifying subject, object, verb relationships and modifier chains. Follows Universal Dependencies (UD) framework for cross-lingual compatibility.

Base ModelUniversal Dependencies + mBERT

ApproachCreate a Balochi UD treebank from annotated sentences. Fine-tune a neural dependency parser using the treebank, with transfer learning from related languages (Persian, Urdu).

Data Needed10,000+ parsed sentences in UD format

Framework selection done

What comes next: Once the corpus reaches critical mass (10K+ annotated sentences), model fine-tuning begins. Each tool will get a dedicated training pipeline, evaluation metrics, and a live demo on this page.