We are pleased to share our paper “Pair expansion for learning multilingual semantic embeddings using disjoint visually-grounded speech audio datasets” presented at InterSpeech2020.
We propose a data expansion method for learning a multi-lingual semantic embedding model using disjoint datasets containing images and their multi-lingual audio captions.
Here, disjoint means that there are no shared images among the multiple language datasets, as shown at the bottom part of the image below.
Our main idea is "pair expansion", utilizing even disjoint pairs by find similarities that may exist in different images. We examine two approaches for evaluating similarities: one using image embeddings and the other using object recognition results.
This is a collaborative work with MIT CSAIL.