We are pleased to share our paper “Pair expansion for learning multilingual semantic embeddings using disjoint visually-grounded speech audio datasets” presented at InterSpeech2020.
ISCA Archive
We propose a data expansion method for learning a multilingual semantic embedding model using disjoint datasets containing images and their multilingual audio captions. Here, disjoint means that there are no shared images among the multiple language datasets, in contrast to existing works on multilingual semantic embedding based on visually-grounded speech audio, where it has been assumed that each image is associated with spoken captions of multiple languages.
isca-speech.org
We propose a data expansion method for learning a multi-lingual semantic embedding model using disjoint datasets containing images and their multi-lingual audio captions.
Here, disjoint means that there are no shared images among the multiple language datasets, as shown at the bottom part of the image below.
Our main idea is "pair expansion", utilizing even disjoint pairs by find similarities that may exist in different images. We examine two approaches for evaluating similarities: one using image embeddings and the other using object recognition results.
This is a collaborative work with MIT CSAIL.