Text Analytics

Core NLP tools being built for Balochi — Phase 2 of the 18-month roadmap.

Phase 2: Core NLP Tools (Months 6-9)

These tools will be built by fine-tuning world-class multilingual AI models on the Balochi corpus being extracted in Phase 1. Each tool requires thousands of annotated sentences — the extraction pipeline is currently building this foundation.

Total Tokens

79,594

50M target

Corpus Sentences

4,053

Training data for all models

Lexicon Entries

11,995

100K target

Books Processed

1

97 in library

POS Tagger
Phase 2 (Months 6-9)
Part-of-speech tagging for Balochi text — labeling every word as noun, verb, adjective, adverb, etc. Essential for understanding sentence structure and grammar patterns across dialects.
Base ModelmBERT / XLM-RoBERTa
ApproachFine-tune multilingual BERT on manually annotated Balochi sentences from our corpus. Transfer learning from Urdu/Persian POS models to bootstrap accuracy.
Data Needed10,000+ manually annotated sentences
Collecting training data
Morphological Analyzer
Phase 2 (Months 6-9)
Breaks Balochi words into their root forms, prefixes, suffixes, and inflections. Critical for understanding how Balochi words change form based on tense, gender, number, and case.
Base ModelFinite-State Transducers (FST)
ApproachBuild rule-based FST from Balochi grammar rules, augmented with neural models trained on our lexicon entries. Handles agglutinative morphology across Southern, Eastern, and Western dialects.
Data NeededComplete morphological paradigm tables + lexicon
Lexicon growing
Named Entity Recognition
Phase 2 (Months 6-9)
Identifies and classifies named entities in Balochi text — people, places, organizations, dates, and cultural terms. Key for information extraction and knowledge graph construction.
Base ModelXLM-RoBERTa
ApproachFine-tune XLM-R on Balochi NER dataset created from our corpus. Semi-automatic annotation using entity patterns from lexicon, followed by manual validation.
Data Needed5,000+ annotated sentences with entity labels
Awaiting corpus milestone
Dependency Parser
Phase 2 (Months 6-9)
Analyzes the grammatical structure of Balochi sentences — identifying subject, object, verb relationships and modifier chains. Follows Universal Dependencies (UD) framework for cross-lingual compatibility.
Base ModelUniversal Dependencies + mBERT
ApproachCreate a Balochi UD treebank from annotated sentences. Fine-tune a neural dependency parser using the treebank, with transfer learning from related languages (Persian, Urdu).
Data Needed10,000+ parsed sentences in UD format
Framework selection done

What comes next: Once the corpus reaches critical mass (10K+ annotated sentences), model fine-tuning begins. Each tool will get a dedicated training pipeline, evaluation metrics, and a live demo on this page.