A paper presented at DAS2022

Created

March 14, 2022

Tags

PaperComputer Vision

Updated

March 14, 2022

We are pleased to announce that our paper “Font shape-to-impression translation” has been accepted to DAS2022 (IAPR International Workshop on Document Analysis Systems) as an oral presentation. Among 88 submissions, 53 papers (60%) were accepted and 31 (top 35%) of them have been selected as orals (see https://das2022.univ-lr.fr/index.php/list-of-accepted-papers/ for the details).

Different fonts have different impressions, such as elegant, scary, and cool. The figure below shows several examples of fonts and their impression tags. Some tags directly express font style types such as Sans-Serif and Script, more shape-related properties such as Bold and Oblique and more abstract impressions such as Elegant and Scary.

To understand those relationships between font shapes and impressions, one of the promising approaches is part-based approach, where a font image is decomposed into a set of local parts and then the individual parts and their combinations are correlated with impressions. The part-based approach can discard the letter shape, while retaining various impression clues from local shapes, such as serif, curvature, corner shape, stroke width, etc.

This paper proposes a novel method for part-based shape-impression analysis that fully utilizes Transformer. We first train Transformer to output impressions for a given set of local shapes. Then, we analyze the trained Transformer in various ways to understand the important local shapes for a specific impression. The advantages of Transformer for our analysis are threefold.

Transformer is a versatile model and offers us two different approaches. As shown in the figure below, the classification approach (a) accepts $N$ local descriptors as its input elements and outputs the probability of each of K impression classes. In the translation approach (b), one Transformer as an encoder accepts $N$ local descriptors as its input and then encodes them into “keys” and “values”, and then the latent vectors are fed to another Transformer as a decoder that outputs a set of impression words like a translation result.
Transformer can accept a variable number of input elements. Since the number of local shapes from a single font image is not constant, this property is suitable for our task.
The most important advantage is its self-attention mechanism. Self-attention determines a weight for every input element by considering other input elements. Therefore, if we input local shapes to Transformer as multiple input elements, their correlation is internally calculated and used for the task.

A preprint has been disclosed at https://arxiv.org/abs/2203.05808 , and the official proceeding can be seen at https://link.springer.com/chapter/10.1007/978-3-031-06555-2_1