# An end-to-end TextSpotter with Explicit Alignment and Attention

## Abstract

Text detection and recognition in natural images have long been considered as two separate tasks that are processed sequentially. Training of two tasks in a unified framework is non-trivial due to significant differences in optimisation difficulties. In this work, we present a conceptually simple yet efficient framework that simultaneously processes the two tasks in one shot. Our main contributions are three-fold: 1) we propose a novel text-alignment layer that allows it to precisely compute convolutional features of a text instance in arbitrary orientation, which is the key to boost the performance; 2) a character attention mechanism is introduced by using character spatial information as explicit supervision, leading to large improvements in recognition; 3) two technologies, together with a new RNN branch for word recognition, are integrated seamlessly into a single model which is end-to-end trainable. This allows the two tasks to work collaboratively by sharing convolutional features, which is critical to identify challenging text instances. Our model achieves impressive results in end-to-end recognition on the ICDAR2015 [1] dataset, significantly advancing most recent results [2], with improvements of F-measure from (0.54, 0.51, 0.47) to (0.82, 0.77, 0.63), by using a strong, weak and generic lexicon respectively. Thanks to joint training, our method can also serve as a good detector by achieving a new state-of-the-art detection performance on two datasets.

## 1. Introduction

The goal of text spotting is to map an input natural image into a set of character sequences or word transcripts and corresponding location. It has attracted increasing attention in the vision community, due to its numerous potential applications. It has made rapid progress riding on the wave of recent deep learning technologies, as substantiated by recent works [3, 4, 2, 5, 6, 7, 8, 9, 10, 11]. However, text spotting in the wild still remains an open problem, since text instances often exhibit vast diversity in font, scale and orientation with various illumination affects, which often come with a highly complicated background.

Past works in text spotting often consider it as two individual tasks: text detection and word recognition, which are implemented sequentially. The goal of text detection is to precisely localize all text instances (e.g., words) in a natural image, and then a recognition model is processed repeatedly through all detected regions for recognizing corresponding text transcripts. Recent approaches for text detection are mainly extended from general object detectors (such as Faster R-CNN [12] and SSD [13]) by directly regressing a bounding box for each text instance, or from semantic segmentation methods (e.g., Fully Convolutional Networks (FCN) [14]) by predicting a text/non-text probability at each pixel. With careful model design and development, these approaches can be customized properly towards this highly domain-specific task, and achieve the state-of-the-art performance [4, 6, 7, 8, 9, 15]. The word recognition can be cast into a sequence labeling problem where convolutional recurrent models have been developed recently [9, 16]. Some of them were further incorporated with an attention mechanism for improving the performance [17, 18]. However, training two tasks separately does not exploit the full potential of convolutional networks, where the convolutional features are not shared. It is natural for us to make a more reliable decision if we clearly understand or recognize the meaning of a word and all characters within it. Besides, it is also possible to introduce a number of heuristic rules and hyper-parameters that are costly to tune, making the whole system highly complicated.

Recent Mask R-CNN [19] incorporates an instance segmentation task into the Faster R-CNN [12] detection framework, resulting in a multi-task learning model that jointly predicts a bounding box and a segmentation mask for each object instance. Our work draws inspiration from this pipeline, but has a different goal of learning a direct mapping between an input image and a set of character sequences. We create a recurrent sequence modeling branch for word recognition within a text detection framework, where the RNN based word recognition is processed in parallel to the detection task.

However, the RNN branch, where the gradients are back-propagated through time, is clearly much more difficult to optimize than the task of bounding box regression in detection. This naturally leads to significant differences in learning difficulties and convergence rates between two tasks, making the model particularly hard to be trained jointly. For example, the magnitude of images for training a text detection model is about 103 (e.g., 1000 training images in the ICDAR 2015 [1]) , but the number is increased significantly by many orders of magnitude when a RNN based text recognition model is trained, such as the 800K synthetic images used in [20]. Furthermore, simply using a set of character sequences as direct supervision may be too abstractive (high-level) to provide meaningful detailed information for training such an integrated model effectively, which will make the model difficult to convergence. In this work, we introduce strong spatial constraints in both word and character levels, which allows the model to be optimized gradually by reducing the search space at each step.

Contributions In this work, we present a single-shot textspotter capable of learning a direct mapping between an input image and a set of character sequences or word transcripts. We propose a solution that combines a text-alignment layer tailed for multi-orientation text detection, together with a character attention mechanism that explicitly encodes strong spatial information of characters into the RNN branch, as shown in Fig. 1. These two technologies faithfully preserve the exact spatial information in both text instance and character levels, playing a key role in boosting the overall performance. We develop a principled learning strategy that allows the two tasks to be trained collaboratively by sharing convolutional features. Our main contributions are described as follows.

Firstly, we develop a text-alignment layer by introducing a grid sampling scheme instead of conventional RoI pooling. It computes fixed-length convolutional features that precisely align to a detected text region of arbitrary orientation, successfully avoiding the negative effects caused by orientation changing and quantization factor of the RoI pooling.

Secondly, we introduce a character attention mechanism by using character spatial information as an addition supervision. This explicitly encodes strong spatial attentions of characters into the model, which allows the RNN to focus on current attentional features in decoding, leading to performance boost in word recognition.

Thirdly, both approaches, together with a new RNN branch for word recognition, are integrated elegantly into a CNN detection framework, resulting in a single model that can be trained in an end-to-end manner. We develop a principled and intuitive learning strategy that allows the two tasks to be trained effectively by sharing features, with fast convergence.

Finally, we show by experiments that word recognition can significantly improve detection accuracy in our model, demonstrating strong complementary nature of them, which is unique to this highly domain-specific application. Our model achieves new state-of-the-art results on the ICDAR2015 in end-to-end recognition of multi-orientation texts, largely outperforming the most recent results in [2], with improvements of F-measure from (0.54, 0.51, 0.47) to (0.82, 0.77, 0.63) in terms of using a strong, weak and generic lexicon. Code is avail- able at https://github.com/tonghe90/textspotter

Related work Here we briefly introduce some related works on text detection, recognition and end-to-end wordspotting.

Scene text detection Recently, some methods cast previous character based detection [21, 22, 23, 24] into direct text region estimation [25, 8, 15, 26, 4, 27, 28], avoiding multiple bottom-up post-processing steps by taking word or text-line as a whole. Tian et al. [7] modified Faster-RCNN [12] by applying a recurrent structure on the convolution feature maps of the top layer horizontally. The methods in [4, 25] were inspired from [13]. They both explored the framework from generic objects and convert to scene text detection by adjusting the feature extraction process to this domain-specific task. However, these methods are based on prior boxes, which need to be carefully designed in order to fulfill the requirements for training. Methods of direct regression for inclined bounding boxes, instead of offsets to fixed prior boxes, have been proposed recently. EAST [8] designed a fully convolutional network structure which outputs a pixel-wise prediction map for text/non-text and five values for every point of text region, i.e., distances from the current point to the four edges with an inclined angle. He et al. [6] proposed a method to generate arbitrary quadrilaterals by calculating offsets between every point of text region and vertex coordinates.

Scene text recognition With the success of recurrent neural networks on digit recognition and speech translation, a lot of works have been proposed for text recognition. He et al. [16] and Shi et al. [9, 29] treat text recognition as a sequence labeling problem by introducing LSTM [30] and connectionist temporal classification (CTC) [31] into a unified framework. [17] proposed an attention-based LSTM for text recognition, which mainly contains two parts: encoder and decoder. In the encoding stage, text images are transformed into a sequence of feature vectors by CNN/LSTM. Attention weights, indicating relative importance for recognition, will be learned during the decoding stage. However, these weights are totally learned by the distribution of data and no supervision is provided to guide the learning process.

End-to-end wordspotting End-to-end wordspotting is an emerging research area. Previous methods usually try to solve it by splitting the whole process into two independent problems: training two cascade models, one for detection and one for recognition. Detected text regions are firstly cropped from original image, followed by affine transforming and rescaling. Corrected images are repeatedly precessed by recognition model to get corresponding transcripts. However, training errors will be accumulated due to cascading models without sharable features. Li et al. [5] proposed a unified network that simultaneously localizes and recognizes text in one forward pass by sharing convolution features under a curriculum strategy. But the existing RoI pooling operation limits it to detect and recognize only horizontal examples. Busta et al. [2] brought up deep text spotter, which can solve wordspotting of multi-orientation problem. However, the method does not have sharable feature, meaning that the recognition loss of the later stage has no influence on the former localization results.

## 2. Single Shot TextSpotter by Joint Detection and Recognition

In this section, we present the details of the proposed textspotter which learns a direct mapping between an input image and a set of word transcripts with corresponding bounding boxes of arbitrary orientations. Our model is a fully convolutional architecture built on the PVAnet framework [32]. As shown in Fig. 2, we introduce a new recurrent branch for word recognition, which is integrated into our CNN model in parallel with the existing detection branch for text bounding box regression. The RNN branch is composed of a new text-alignment layer and a LSTM-based recurrent module with a novel character attention embedding mechanism. The text-alignment layer extracts precise sequence feature within the detected region, preventing encoding irrelevant texts or background information. The character attention embedding mechanism regulates the decoding process by providing more detailed supervisions of characters. Our textspotter directly outputs final results in one shot, without any post-processing step except for a simple non-maximum suppression (NMS).

Network architecture Our model is a fully convolutional architecture inspired by [8], where a PVA network [32] is utilized as backbone due to its significantly low computational cost. Unlike generic objects, texts often have a much larger variations in both sizes and aspect ratios. Thus it not only needs to preserve local details for small-scale text instances, but also should maintain a large receptive field for very long instances. Inspired by the success in semantic segmentation [33], we exploit feature fusion by combining convolutional features of conv5, conv4, conv3 and conv2 layers gradually, with the goal of maintaining both local detailed features and high-level context information. This results in more reliable predictions on multi-scale text instances. Size of the top layer is $\frac {1} {4}$ of the input image for simplicity.

Text detection This branch is similar to that of [8], where a multi-task prediction is implemented at each spatial location on the top convolutional maps, by adopting an Intersection over Union (IoU) loss described in [34]. It contains two sub-branches on the top convolutional layer designed for joint text/non-text classification and multi-orientation bounding boxes regression. The first sub-branch returns a classification map with an equal spatial size of the top feature maps, indicating the predicted text/non-text probabilities using a softmax function. The second sub-branch outputs five localization maps with the same spatial size, which estimate five parameters for each bounding box with arbitrary orientation at each spatial location of text regions. The five parameters represent the distances of the current point to the top, bottom, left and right sides of an associated bounding box, together with its inclined orientation. With these configurations, the detection branch is able to predict a quadrilateral of arbitrary orientation for each text instance. The feature of the detected quadrilateral region is then feed into the RNN branch for word recognition via a text-alignment layer which is described below.

### 2.1. Text-Alignment Layer

We create a new recurrent branch for word recognition, where a text-alignment layer is proposed to pre- cisely compute fixed-size convolutional features from a quadrilateral region of arbitrary size. The text-alignment layer is extended from RoI pooling [35] which is widely used for general objects detection. The RoI pooling computes a fixed-size convolutional features (e.g., 7 × 7) from a rectangle region of arbitrary size, by performing quantization operation. It can be integrated into the convolutional layers for in-network region cropping, which is a key component for end-to-end training a detection framework. However, directly applying the RoI pooling to a text region will lead to a significant performance drop in word recognition due to the issue of misalignment.

– First, unlike object detection and classification where the RoI pooling computes global features of a RoI region for discriminating an object, word recognition requires more detailed and accurate local features and spatial information for predicting each character sequentially. As pointed out in [19], the RoI pooling performs quantizations which inevitably introduce misalignments between the original RoI region and the extracted features. Such misalignments have a significant negative effect on predicting characters, particularly on some small-scale ones such as ‘i’, ‘l’.

• Second, RoI pooling was designed for a rectangle region which is only capable of localizing horizontal instances. It will make larger misalignments when applied to multi-orientation text instances. Furthermore, a large amount of background information and irrelevant texts are easily encoded when a rectangle RoI region is applied to a highly inclined text instance, as shown in Fig. 3. This severely reduces the performance on RNN decoding process for recognizing sequential characters.

Recent Mask R-CNN considers explicit per-pixel spatial correspondence by introducing RoIAlign pooling [19]. This inspires current work that develops a new text-alignment layer tailored for text instance which is a quadrilateral shape with arbitrary orientation. It provides strong word-level alignment with accurate per-pixel correspondence, which is of critical importance to extract exact text information from the convolutional maps, as shown in Fig. 3.