We are pleased to announce that our paper “Video discourse parsing and its application to multimodal summarization: A dataset and baseline approaches” has been accepted to Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP).
Tsutomu Hirao, Naoki Kobayashi, Hidetaka Kamigaito, Manabu Okumura, Akisato Kimura, “Video discourse parsing and its application to multimodal summarization: A dataset and baseline approaches,” Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. (Accepted, Findings)
This paper tackles a task: discourse parsing for videos, inspired by text discourse parsing based on Rhetorical Structure Theory (RST). The task aims to construct an RST tree for a video to represent its storyline and illustrate the event relationships.
We first construct a benchmark dataset by identifying events with their time spans, providing corresponding captions, and constructing RST trees with events as leaves.
We then evaluate baseline approaches to video RST parsing: the “parsing after captioning” framework and “parsing with visual features” approach. The results show that a parser using gold captions performed the best, while parsers relying on generated captions performed the worst; a parser using visual features provided intermediate performance. However, we observed that parsing via visual features could be improved by pre-training it with video captioning designed to produce a coherent video story.
Furthermore, we demonstrated that RST trees obtained from videos contribute to multi-modal summarization consisting of keyframes with texts.
Please check ACL Anthology https://aclanthology.org/2024.findings-emnlp.581/ for more details. The dataset (annotation for videos) will be released at Embed GitHub (now in preparation).