Translation Hub
Neural machine translation for Balochi — coming in Phase 4.
Phase 4: Advanced AI (Months 15-18)
Machine translation requires aligned parallel corpora (sentence pairs in both languages). The extraction pipeline is building the Balochi side — parallel alignment with English and Urdu translations will follow.
OpenNMT Transformer
Encoder-Decoder Transformer
Bootstrap from multilingual models trained on related languages (Persian, Urdu, Hindi) to overcome low-resource challenges
50,000-100,000 aligned Balochi-English and Balochi-Urdu sentence pairs
Extract Balochi sentences
Pipeline extracts clean Balochi text from Academy books (in progress)
Generate translations
Use Gemini to create initial English/Urdu translations of extracted sentences
Human validation
Native speakers review and correct machine-generated translations
Train model
Fine-tune OpenNMT transformer on validated parallel corpus
Prerequisites: Phase 1 corpus extraction must complete, and the POS tagger from Phase 2 is needed for better translation alignment. Parallel corpus creation begins in Phase 3.