A paper accepted to ACMMM2022

A paper accepted to ACMMM2022

July 4, 2022
August 7, 2022

We are excited to announce that our paper “ConceptBeam: Concept driven target speech extraction” has been accepted to ACMMM2022 (acceptance rate = 690/2473 = 27.9%).

This paper proposes a novel framework for target speech extraction based on semantic information, called ConceptBeam.

Target speech extraction means extracting the speech of a target speaker in a mixture of overlapping speakers. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrivals.


In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image and speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our proposed scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space.

This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept.

As a proof of our proposed scheme, we perform experiments using a set of images associated with spoken captions. That is, we generate speech mixtures from these spoken captions and use the images or speech signals as the concept specifiers. We then extract the target speech using the acoustic characteristics estimated from the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam outperforms the baseline methods in most cases and effectively extracts speech based on the semantic representation.

You will see the details at the project page prepared by the first author, Yasunori.