A paper presented in ICASSP2020

Created

May 30, 2020

Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms - IEEE Conference Publication

IEEE Xplore, delivering full text access to the world's highest quality technical literature in engineering and technology. | IEEE Xplore

ieeexplore.ieee.org

This paper discusses a trilingual semantic embedding model that associates visual objects in images with segments of speech signals corresponding to spoken words in an unsupervised manner.

Spoken captions are spontaneous descriptions by individual speakers, rather than readings based on prepared transcripts. This implies that the captions of different languages or speakers may focus on different aspects in the same image.

Based on this insight, we introduce a self-attention mechanism into the model to better map the spoken captions associated with the same image into the embedding space. We hope that the self-attention mechanism efficiently captures relationships between widely separated word-like segments.

This is a collaborative work with MIT CSAIL.