Self-Supervised Learning for Multilingual Visual Speech Recognition
Visual Speech Recognition (VSR) aims to decode spoken words from lip movements alone. My thesis explores adaptive self-supervision for cross-language generalization in a multilingual Conformer-based architecture.
The problem
Many languages lack large labeled VSR datasets. Models trained on high-resource languages often fail to transfer to low-resource settings where labeled visual speech data is scarce.
Approach
Self-supervised pre-training learns representations from unlabeled video before fine-tuning on limited labeled data. The Conformer architecture combines convolution and self-attention, useful for capturing both local lip dynamics and longer temporal context.
What I'm measuring
- Cross-language transfer after self-supervised pre-training
- Robustness under varied lighting and speaker conditions
- Comparison against fully supervised baselines per language
Why it matters
Accessible speech interfaces shouldn't depend on whether your language has million-hour labeled corpora. Self-supervision is one path toward more equitable VSR systems.