A paper presented at InterSpeech2020

Created

October 25, 2020

ISCA Archive

We propose a data expansion method for learning a multilingual semantic embedding model using disjoint datasets containing images and their multilingual audio captions. Here, disjoint means that there are no shared images among the multiple language datasets, in contrast to existing works on multilingual semantic embedding based on visually-grounded speech audio, where it has been assumed that each image is associated with spoken captions of multiple languages.

isca-speech.org

We propose a data expansion method for learning a multi-lingual semantic embedding model using disjoint datasets containing images and their multi-lingual audio captions.

Here, disjoint means that there are no shared images among the multiple language datasets, as shown at the bottom part of the image below.

Our main idea is "pair expansion", utilizing even disjoint pairs by find similarities that may exist in different images. We examine two approaches for evaluating similarities: one using image embeddings and the other using object recognition results.

This is a collaborative work with MIT CSAIL.