Skip to main content
Back to Blog
Self-Supervised Learning for Multilingual Visual Speech Recognition

Self-Supervised Learning for Multilingual Visual Speech Recognition

By ·
AI/ML
Computer Vision
Research
NLP

Visual Speech Recognition (VSR) aims to decode spoken words from lip movements alone. My thesis explores adaptive self-supervision for cross-language generalization in a multilingual Conformer-based architecture.

The problem

Many languages lack large labeled VSR datasets. Models trained on high-resource languages often fail to transfer to low-resource settings where labeled visual speech data is scarce.

Approach

Self-supervised pre-training learns representations from unlabeled video before fine-tuning on limited labeled data. The Conformer architecture combines convolution and self-attention, useful for capturing both local lip dynamics and longer temporal context.

What I'm measuring

  • Cross-language transfer after self-supervised pre-training
  • Robustness under varied lighting and speaker conditions
  • Comparison against fully supervised baselines per language

Why it matters

Accessible speech interfaces shouldn't depend on whether your language has million-hour labeled corpora. Self-supervision is one path toward more equitable VSR systems.