Translation Hub

Neural machine translation for Balochi — coming in Phase 4.

Phase 4: Advanced AI (Months 15-18)

Coming Soon

Machine translation requires aligned parallel corpora (sentence pairs in both languages). The extraction pipeline is building the Balochi side — parallel alignment with English and Urdu translations will follow.

Fine-Tuning Plan
How we'll build the Balochi translation engine.
Base Model

OpenNMT Transformer

Architecture

Encoder-Decoder Transformer

Transfer Learning

Bootstrap from multilingual models trained on related languages (Persian, Urdu, Hindi) to overcome low-resource challenges

Training Data

50,000-100,000 aligned Balochi-English and Balochi-Urdu sentence pairs

BalochiEnglish
Phase 4
Primary translation direction for international accessibility and academic research on Balochi language and culture.
Target Pairs50,000-100,000 aligned sentences
Parallel corpus being built
BalochiUrdu
Phase 4
Essential for regional communication and bridging Balochi speakers with the national language of Pakistan.
Target Pairs50,000-100,000 aligned sentences
Parallel corpus being built
Parallel Corpus Pipeline
1

Extract Balochi sentences

Pipeline extracts clean Balochi text from Academy books (in progress)

2

Generate translations

Use Gemini to create initial English/Urdu translations of extracted sentences

3

Human validation

Native speakers review and correct machine-generated translations

4

Train model

Fine-tune OpenNMT transformer on validated parallel corpus

Prerequisites: Phase 1 corpus extraction must complete, and the POS tagger from Phase 2 is needed for better translation alignment. Parallel corpus creation begins in Phase 3.