Balochi NLP Ecosystem | Balochi Academy

Translation Hub

Neural machine translation for Balochi — coming in Phase 4.

Phase 4: Advanced AI (Months 15-18)

Coming Soon

Machine translation requires aligned parallel corpora (sentence pairs in both languages). The extraction pipeline is building the Balochi side — parallel alignment with English and Urdu translations will follow.

Fine-Tuning Plan

How we'll build the Balochi translation engine.

Base Model

OpenNMT Transformer

Architecture

Encoder-Decoder Transformer

Transfer Learning

Bootstrap from multilingual models trained on related languages (Persian, Urdu, Hindi) to overcome low-resource challenges

Training Data

50,000-100,000 aligned Balochi-English and Balochi-Urdu sentence pairs

Balochi — English

Phase 4

Primary translation direction for international accessibility and academic research on Balochi language and culture.

Target Pairs50,000-100,000 aligned sentences

Parallel corpus being built

Balochi — Urdu

Phase 4

Essential for regional communication and bridging Balochi speakers with the national language of Pakistan.

Target Pairs50,000-100,000 aligned sentences

Parallel corpus being built

Parallel Corpus Pipeline

Extract Balochi sentences

Pipeline extracts clean Balochi text from Academy books (in progress)

Generate translations

Use Gemini to create initial English/Urdu translations of extracted sentences

Human validation

Native speakers review and correct machine-generated translations

Train model

Fine-tune OpenNMT transformer on validated parallel corpus

Prerequisites: Phase 1 corpus extraction must complete, and the POS tagger from Phase 2 is needed for better translation alignment. Parallel corpus creation begins in Phase 3.