Please check out our ICASSP2020 paper “Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms” by Yasunori Ohishi, me (Akisato Kimura), Takahito Kawanishi, Kunio Kashino, David Harwath and James Glass.
This paper discusses a trilingual semantic embedding model that associates visual objects in images with segments of speech signals corresponding to spoken words in an unsupervised manner.
Spoken captions are spontaneous descriptions by individual speakers, rather than readings based on prepared transcripts. This implies that the captions of different languages or speakers may focus on different aspects in the same image.
Based on this insight, we introduce a self-attention mechanism into the model to better map the spoken captions associated with the same image into the embedding space. We hope that the self-attention mechanism efficiently captures relationships between widely separated word-like segments.
This is a collaborative work with MIT CSAIL.