Detecting Text in Natural Image with Connectionist Text Proposal Network论文翻译——中英文对照

文章作者:Tyan
博客:noahsnail.com  |  CSDN  |  简书

声明:作者翻译论文仅为学习,如有侵权请联系作者删除博文,谢谢!

Detecting Text in Natural Image with Connectionist Text Proposal Network

Abstract

We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin. The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/.

摘要

我们提出了一种新颖的连接文本提议网络(CTPN),它能够准确定位自然图像中的文本行。CTPN直接在卷积特征映射中的一系列细粒度文本提议中检测文本行。我们开发了一个垂直锚点机制,联合预测每个固定宽度提议的位置和文本/非文本分数,大大提高了定位精度。序列提议通过循环神经网络自然地连接起来,该网络无缝地结合到卷积网络中,从而形成端到端的可训练模型。这使得CTPN可以探索丰富的图像上下文信息,使其能够检测极其模糊的文本。CTPN在多尺度和多语言文本上可靠地工作,而不需要进一步的后处理,脱离了以前的自底向上需要多步后过滤的方法。它在ICDAR 2013和2015的基准数据集上达到了0.88和0.61的F-measure,大大超过了最近的结果[8,35]。通过使用非常深的VGG16模型[27],CTPN的计算效率为0.14s每张图像。在线演示获取地址:http://textdet.com/

Keywords

Scene text detection, convolutional network, recurrent neural network, anchor mechanism

关键词

场景文本检测;卷积网络;循环神经网络;锚点机制

1. Introduction

Reading text in natural image has recently attracted increasing attention in computer vision [8,14,15,10,35,11,9,1,28,32]. This is due to its numerous practical applications such as image OCR, multi-language translation, image retrieval, etc. It includes two sub tasks: text detection and recognition. This work focus on the detection task [14,1,28,32], which is more challenging than recognition task carried out on a well-cropped word image [15,9]. Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization.

1. 引言

在自然图像中阅读文本最近在计算机视觉中引起越来越多的关注[8,14,15,10,35,11,9,1,28,32]。这是由于它的许多实际应用,如图像OCR,多语言翻译,图像检索等。它包括两个子任务:文本检测和识别。这项工作的重点是检测任务[14,1,28,32],这是比在一个良好的裁剪字图像[15,9]进行的识别任务更具有挑战性。文本模式的大变化和高度杂乱的背景构成了精确文本定位的主要挑战。

Current approaches for text detection mostly employ a bottom-up pipeline [28,1,14,32,33]. They commonly start from low-level character or stroke detection, which is typically followed by a number of subsequent steps: non-text component filtering, text line construction and text line verification. These multi-step bottom-up approaches are generally complicated with less robustness and reliability. Their performance heavily rely on the results of character detection, and connected-components methods or sliding-window methods have been proposed. These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. However, they are not robust by identifying individual strokes or characters separately, without context information. For example, it is more confident for people to identify a sequence of characters than an individual one, especially when a character is extremely ambiguous. These limitations often result in a large number of non-text components in character detection, causing main difficulties for handling them in following steps. Furthermore, these false detections are easily accumulated sequentially in bottom-up pipeline, as pointed out in [28]. To address these problems, we exploit strong deep features for detecting text information directly in convolutional maps. We develop text anchor mechanism that accurately predicts text locations in fine scale. Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information.

目前的文本检测方法大多采用自下而上的流程[28,1,14,32,33]。它们通常从低级别字符或笔画检测开始,后面通常会跟随一些后续步骤:非文本组件过滤,文本行构建和文本行验证。这些自底向上的多步骤方法通常复杂,鲁棒性和可靠性较差。它们的性能很大程度上依赖于字符检测的结果,并且已经提出了连接组件方法或滑动窗口方法。这些方法通常探索低级特征(例如,基于SWT[3,13],MSER[14,33,23]或HoG[28])来区分候选文本和背景。但是,如果没有上下文信息,他们不能鲁棒的单独识别各个笔划或字符。例如,相比单个字符人们更信任一个字符序列,特别是当一个字符非常模糊时。这些限制在字符检测中通常会导致大量非文本组件,在后续步骤中的主要困难是处理它们。此外,正如[28]所指出的,这些误检很容易在自下而上的过程中连续累积。为了解决这些问题,我们利用强大的深度特征直接在卷积映射中检测文本信息。我们开发的文本锚点机制能在细粒度上精确预测文本位置。然后,我们提出了一种网内循环架构,用于按顺序连接这些细粒度的文本提议,从而允许它们编码丰富的上下文信息。

Deep Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps. Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection. However, it is difficult to apply these general object detection systems directly to scene text detection, which generally requires a higher localization accuracy. In generic object detection, each object has a well-defined closed boundary [2], while such a well-defined boundary may not exist in text, since a text line or word is composed of a number of separate characters or strokes. For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it. By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word. Therefore, text detection generally requires a more accurate localization, leading to a different evaluation standard, e.g., the Wolf’s standard [30] which is commonly employed by text benchmarks [19,21].

深度卷积神经网络(CNN)最近已经基本实现了一般物体检测[25,5,6]。最先进的方法是Faster Region-CNN(R-CNN)系统[25],其中提出了区域提议网络(RPN)直接从卷积特征映射中生成高质量类别不可知的目标提议。然后将RPN提议输入Faster R-CNN[5]模型进行进一步的分类和微调,从而实现通用目标检测的最新性能。然而,很难将这些通用目标检测系统直接应用于场景文本检测,这通常需要更高的定位精度。在通用目标检测中,每个目标都有一个明确的封闭边界[2],而在文本中可能不存在这样一个明确定义的边界,因为文本行或单词是由许多单独的字符或笔划组成的。对于目标检测,典型的正确检测是松散定义的,例如,检测到的边界框与其实际边界框(例如,PASCAL标准[4])之间的重叠>0.5,因为人们可以容易地从目标的主要部分识别它。相比之下,综合阅读文本是一个细粒度的识别任务,需要正确的检测,覆盖文本行或字的整个区域。因此,文本检测通常需要更准确的定义,导致不同的评估标准,例如文本基准中常用的Wolf标准[19,21]。

In this work, we fill this gap by extending the RPN architecture [25] to accurate text line localization. We present several technical developments that tailor generic object detection model elegantly towards our problem. We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model.

在这项工作中,我们通过将RPN架构[25]扩展到准确的文本行定义来填补这个空白。我们提出了几种技术发展,针对我们的问题可以优雅地调整通用目标检测模型。我们通过提出一种网络内循环机制争取更进一步,使我们的模型能够直接在卷积映射中检测文本序列,避免通过额外昂贵的CNN检测模型进行进一步的后处理。

1.1 Contributions

We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers. This overcomes a number of main limitations raised by previous bottom-up approaches building on character detection. We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1. It makes the following major contributions:

Figure 1

Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). We densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) of the VGG16 model [27]. The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of $k$ anchors. (b) The CTPN outputs sequential fixed-width fine-scale text proposals. Color of each box indicates the text/non-text score. Only the boxes with positive scores are presented.

1.1 贡献

我们提出了一种新颖的连接文本提议网络(CTPN),它可以直接定位卷积层中的文本序列。这克服了以前的建立在字符检测基础上的自下而上方法带来的一些主要限制。我们利用强深度卷积特性和共享计算机制的优点,提出了如图1所示的CTPN架构。主要贡献如下:

Figure 1

图1:(a)连接文本提议网络(CTPN)的架构。我们通过VGG16模型[27]的最后一个卷积映射(conv5)密集地滑动3×3空间窗口。每行的序列窗口通过双向LSTM(BLSTM)[7]循环连接,其中每个窗口的卷积特征(3×3×C)被用作256维的BLSTM(包括两个128维的LSTM)的输入。RNN层连接到512维的全连接层,接着是输出层,联合预测$k$个锚点的文本/非文本分数,y轴坐标和边缘调整偏移。(b)CTPN输出连续的固定宽度细粒度文本提议。每个框的颜色表示文本/非文本分数。只显示文本框正例的分数。

First, we cast the problem of text detection into localizing a sequence of fine-scale text proposals. We develop an anchor regression mechanism that jointly predicts vertical location and text/non-text score of each text proposal, resulting in an excellent localization accuracy. This departs from the RPN prediction of a whole object, which is difficult to provide a satisfied localization accuracy.

首先,我们将文本检测的问题转化为一系列细粒度的文本提议。我们开发了一个锚点回归机制,可以联合预测每个文本提议的垂直位置和文本/非文本分数,从而获得出色的定位精度。这背离了整个目标的RPN预测,RPN预测难以提供令人满意的定位精度。

Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. This connection allows our detector to explore meaningful context information of text line, making it powerful to detect extremely challenging text reliably.

其次,我们提出了一种在卷积特征映射中优雅连接序列文本提议的网络内循环机制。通过这种连接,我们的检测器可以探索文本行有意义的上下文信息,使其能够可靠地检测极具挑战性的文本。

Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model. Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement.

第三,两种方法无缝集成,以符合文本序列的性质,从而形成统一的端到端可训练模型。我们的方法能够在单个过程中处理多尺度和多语言的文本,避免进一步的后过滤或细化。

Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27].

第四,我们的方法在许多基准数据集上达到了新的最先进成果,显著改善了最近的结果(例如,0.88的F-measure超过了2013年ICDAR的[8]中的0.83,而0.64的F-measure超过了ICDAR2015上[35]中的0.54 )。此外,通过使用非常深的VGG16模型[27],这在计算上是高效的,导致了每张图像0.14s的运行时间(在ICDAR 2013上)。

Text detection. Past works in scene text detection have been dominated by bottom-up approaches which are generally built on stroke or character detection. They can be roughly grouped into two categories, connected-components (CCs) based approaches and sliding-window based methods. The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3]. The sliding-window based methods detect character candidates by densely moving a multi-scale window through an image. The character or non-character window is discriminated by a pre-trained classifier, by using manually-designed features [28,29], or recent CNN features [16]. However, both groups of methods commonly suffer from poor performance of character detection, causing accumulated errors in following component filtering and text line construction steps. Furthermore, robustly filtering out non-character components or confidently verifying detected text lines are even difficult themselves [1,33,14]. Another limitation is that the sliding-window methods are computationally expensive, by running a classifier on a huge number of the sliding windows.

2. 相关工作

文本检测。过去在场景文本检测中的工作一直以自下而上的方法为主,一般建立在笔画或字符检测上。它们可以粗略地分为两类,基于连接组件(CC)的方法和基于滑动窗口的方法。基于CC的方法通过使用快速滤波器来区分文本和非文本像素,然后通过使用低级属性(例如强度,颜色,梯度等[33,14,32,13,3])将文本像素贪婪地分为笔划或候选字符。基于滑动窗口的方法通过在图像中密集地滑动多尺度窗口来检测候选字符。字符或非字符窗口通过预先训练的分类器,使用手动设计的特征[28,29]或最近的CNN特征[16]进行区分。然而,这两种方法通常都会受到较差的字符检测性能的影响,导致在接下来的组件过滤和文本行构建步骤中出现累积的错误。此外,强大地过滤非字符组件或者自信地验证检测到的文本行本身就更加困难[1,33,14]。另一个限制是通过在大量的滑动窗口上运行分类器,滑动窗口方法在计算上是昂贵的。

Object detection. Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. A common strategy is to generate a number of object proposals by employing inexpensive low-level features, and then a strong CNN classifier is applied to further classify and refine the generated proposals. Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5]. Recently, Ren et al. [25] proposed a Faster R-CNN system for object detection. They proposed a Region Proposal Network (RPN) that generates high-quality class-agnostic object proposals directly from the convolutional feature maps. The RPN is fast by sharing convolutional computation. However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. More importantly, text is different significantly from general objects, making it difficult to directly apply general object detection system to this highly domain-specific task.

目标检测。卷积神经网络(CNN)近来在通用目标检测[25,5,6]上已经取得了实质的进步。一个常见的策略是通过使用廉价的低级特征来生成许多目标提议,然后使用强CNN分类器来进一步对生成的提议进行分类和细化。生成类别不可知目标提议的选择性搜索(SS)[4]是目前领先的目标检测系统中应用最广泛的方法之一,如CNN(R-CNN)[6]及其扩展[5]。最近,Ren等人[25]提出了Faster R-CNN目标检测系统。他们提出了一个区域提议网络(RPN),可以直接从卷积特征映射中生成高质量的类别不可知的目标提议。通过共享卷积计算RPN是快速的。然而,RPN提议不具有判别性,需要通过额外的成本高昂的CNN模型(如Fast R-CNN模型[5])进一步细化和分类。更重要的是,文本与一般目标有很大的不同,因此很难直接将通用目标检测系统应用到这个高度领域化的任务中。

3. Connectionist Text Proposal Network

This section presents details of the Connectionist Text Proposal Network (CTPN). It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement.

3. 连接文本提议网络

本节介绍连接文本提议网络(CTPN)的细节。它包括三个关键的贡献,使文本定位可靠和准确:检测细粒度提议中的文本,循环连接文本提议和边缘细化。

3.1 Detecting Text in Fine-scale Proposals

Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size. It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals, as shown in Fig. 1 (b).

We take the very deep 16-layer vggNet (VGG16) [27] as an example to describe our approach, which is readily applicable to other deep models. Architecture of the CTPN is presented in Fig. 1 (a). We use a small spatial window, 3×3, to slide the feature maps of last convolutional layer (e.g., the conv5 of the VGG16). The size of conv5 feature maps is determined by the size of input image, while the total stride and receptive field are fixed as 16 and 228 pixels, respectively. Both the total stride and receptive field are fixed by the network architecture. Using a sliding window in the convolutional layer allows it to share convolutional computation, which is the key to reduce computation of the costly sliding-window based methods.

Generally, sliding-window methods adopt multi-scale windows to detect objects of different sizes, where one window scale is fixed to objects of similar size. In [25], Ren et al. proposed an efficient anchor regression mechanism that allows the RPN to detect multi-scale objects with a single-scale window. The key insight is that a single window is able to predict objects in a wide range of scales and aspect ratios, by using a number of flexible anchors. We wish to extend this efficient anchor mechanism to our text task. However, text differs from generic objects substantially, which generally have a well-defined enclosed boundary and center, allowing inferring whole object from even a part of it [2]. Text is a sequence which does not have an obvious closed boundary. It may include multi-level components, such as stroke, character, word, text line and text region, which are not distinguished clearly between each other. Text detection is defined in word or text line level, so that it may be easy to make an incorrect detection by defining it as a single object, e.g., detecting part of a word. Therefore, directly predicting the location of a text line or word may be difficult or unreliable, making it hard to get a satisfied accuracy. An example is shown in Fig. 2, where the RPN is directly trained for localizing text lines in an image.

We look for a unique property of text that is able to generalize well to text components in all levels. We observed that word detection by the RPN is difficult to accurately predict the horizontal sides of words, since each character within a word is isolated or separated, making it confused to find the start and end locations of a word. Obviously, a text line is a sequence which is the main difference between text and generic objects. It is natural to consider a text line as a sequence of fine-scale text proposals, where each proposal generally represents a small part of a text line, e.g., a text piece with 16-pixel width. Each proposal may include a single or multiple strokes, a part of a character, a single or multiple characters, etc. We believe that it would be more accurate to just predict the vertical location of each proposal, by fixing its horizontal location which may be more difficult to predict. This reduces the search space, compared to the RPN which predicts 4 coordinates of an object. We develop a vertical anchor mechanism that simultaneously predicts a text/non-text score and y-axis location of each fine-scale proposal. It is also more reliable to detect a general fixed-width text proposal than identifying an isolate character, which is easily confused with part of a character or multiple characters. Furthermore, detecting a text line in a sequence of fixed-width text proposals also works reliably on text of multiple scales and multiple aspect ratios.

To this end, we design the fine-scale text proposal as follow. Our detector investigates each spatial location in the conv5 densely. A text proposal is defined to have a fixed width of 16 pixels (in the input image). This is equal to move the detector densely through the conv5 maps, where the total stride is exactly 16 pixels. Then we design k vertical anchors to predict y-coordinates for each proposal. The k anchors have a same horizontal location with a fixed width of 16 pixels, but their vertical locations are varied in k different heights. In our experiments, we use ten anchors for each proposal, k = 10, whose heights are varied from 11 to 273 pixels (by ÷0.7 each time) in the input image. The explicit vertical coordinates are measured by the height and y-axis center of a proposal bounding box. We compute relative predicted vertical coordinates (v) with respect to the bounding box location of an anchor as,

where v = {vc,vh} and v∗ = {vc∗,vh∗} are the relative predicted coordinates and ground truth coordinates, respectively. cay and ha are the center (y-axis) and height of the anchor box, which can be pre-computed from an input image. cy and h are the predicted y-axis coordinates in the input image, while c∗y and h∗ are the ground truth coordinates. Therefore, each predicted text proposal has a bounding box with size of h × 16 (in the input image), as shown in Fig. 1 (b) and Fig. 2 (right). Generally, an text proposal is largely smaller than its effective receptive field which is 228×228.

The detection processing is summarised as follow. Given an input image, we have W × H × C conv5 features maps (by using the VGG16 model), where C is the number of feature maps or channels, and W × H is the spatial arrangement. When our detector is sliding a 3×3 window densely through the conv5, each sliding-window takes a convolutional feature of 3 × 3 × C for producing the prediction. For each prediction, the horizontal location (x-coordinates) and k- anchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image. Our detector outputs the text/non-text scores and the predicted y-coordinates (v) for k anchors at each window location. The detected text proposals are generated from the anchors having a text/non-text score of > 0.7 (with non-maximum suppression). By the designed vertical anchor and fine-scale detection strategy, our detector is able to handle text lines in a wide range of scales and aspect ratios by using a single-scale image. This further reduces its computation, and at the same time, predicting accurate localizations of the text lines. Compared to the RPN or Faster R-CNN system [25], our fine-scale detection provides more detailed supervised information that naturally leads to a more accurate detection.

3.2 Recurrent Connectionist Text Proposals

To improve localization accuracy, we split a text line into a sequence of fine-scale text proposals, and predict each of them separately. Obviously, it is not robust to regard each isolated proposal independently. This may lead to a number of false detections on non-text objects which have a similar structure as text pat- terns, such as windows, bricks, leaves, etc. (referred as text-like outliers in [13]). It is also possible to discard some ambiguous patterns which contain weak text information. Several examples are presented in Fig. 3 (top). Text have strong sequential characteristics where the sequential context information is crucial to make a reliable decision. This has been verified by recent work [9] where a recurrent neural network (RNN) is applied to encode this context information for text recognition. Their results have shown that the sequential context information is greatly facilitate the recognition task on cropped word images.

Motivated from this work, we believe that this context information may also be of importance for our detection task. Our detector should be able to explore this important context information to make a more reliable decision, when it works on each individual proposal. Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals. RNN provides a natural choice for encoding this information recurrently using its hidden layers. To this end, we propose to design a RNN layer upon the conv5, which takes the convolutional feature of each window as sequential inputs, and updates its internal state recurrently in the hidden layer, Ht,

References

  1. Busta, M., Neumann, L., Matas, J.: Fastext: Efficient unconstrained scene text detector (2015), in IEEE International Conference on Computer Vision (ICCV)

  2. Cheng, M., Zhang, Z., Lin, W., Torr, P.: Bing: Binarized normed gradients for objectness estimation at 300fps (2014), in IEEE Computer Vision and Pattern Recognition (CVPR)

  3. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform (2010), in IEEE Computer Vision and Pattern Recognition (CVPR)

  4. Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) 88(2), 303–338 (2010)

  5. Girshick, R.: Fast r-cnn (2015), in IEEE International Conference on Computer Vision (ICCV)

  6. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2014), in IEEE Computer Vision and Pattern Recognition (CVPR)

  7. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18(5), 602–610 (2005)

  8. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images (2016), in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  9. He,P.,Huang,W.,Qiao,Y.,Loy,C.C.,Tang,X.:Readingscenetextindeepconvo- lutional sequences (2016), in The 30th AAAI Conference on Artificial Intelligence (AAAI-16)

  10. He, T., Huang, W., Qiao, Y., Yao, J.: Accurate text localization in natural image with cascaded convolutional text network (2016), arXiv:1603.09423

  11. He, T., Huang, W., Qiao, Y., Yao, J.: Text-attentional convolutional neural net- works for scene text detection. IEEE Trans. Image Processing (TIP) 25, 2529–2541 (2016)

  12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Networks 9(8), 1735–1780 (1997)

  13. Huang, W., Lin, Z., Yang, J., Wang, J.: Text localization in natural images using stroke feature transform and text covariance descriptors (2013), in IEEE International Conference on Computer Vision (ICCV)

  14. Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolutional neural networks induced mser trees (2014), in European Conference on Computer Vision (ECCV)

  15. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. International Journal of Computer Vision (IJCV) (2016)

  16. Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting (2014), in European Conference on Computer Vision (ECCV)

  17. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding (2014), in ACM International Conference on Multimedia (ACM MM)

  18. Karatzas,D.,Gomez-Bigorda,L.,Nicolaou,A.,Ghosh,S.,Bagdanov,A.,Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S.,Valveny, E.: Icdar 2015 competition on robust reading (2015), in International Conference on Document Analysis and Recognition (ICDAR)

  19. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras., L.P.: Icdar 2013 robust reading competition (2013), in International Conference on Document Analysis and Recognition (ICDAR)

  20. Mao, J., Li, H., Zhou, W., Yan, S., Tian, Q.: Scale based region growing for scene text detection (2013), in ACM International Conference on Multimedia (ACM MM)

  21. Minetto, R., Thome, N., Cord, M., Fabrizio, J., Marcotegui, B.: Snoopertext: A multiresolution system for text detection in complex visual scenes (2010), in IEEE International Conference on Pattern Recognition (ICIP)

  22. Neumann, L., Matas, J.: Efficient scene text localization and recognition with local character refinement (2015), in International Conference on Document Analysis and Recognition (ICDAR)

  23. Neumann, L., Matas, J.: Real-time lexicon-free scene text localization and recognition. In IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI) (2015)

  24. Pan, Y., Hou, X., Liu, C.: Hybrid approach to detect and localize texts in natural scene images. IEEE Trans. Image Processing (TIP) 20, 800–813 (2011)

  25. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks (2015), in Neural Information Processing Systems (NIPS)

  26. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015)

  27. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015), in International Conference on Learning Representation (ICLR)

  28. Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text flow: A unified text detection system in natural scene images (2015), in IEEE International Conference on Computer Vision (ICCV)

  29. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition (2011), in IEEE International Conference on Computer Vision (ICCV)

  30. Wolf, C., Jolion, J.: Object count / area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis 8, 280–296 (2006)

  31. Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection and recognition. IEEE Trans. Image Processing (TIP) 23(11), 4737–4749 (2014)

  32. Yin, X.C., Pei, W.Y., Zhang, J., Hao, H.W.: Multi-orientation scene text detection with adaptive clustering. IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI) 37, 1930–1937 (2015)

  33. Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene images. IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI) 36, 970–983 (2014)

  34. Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in natural scenes (2015), in IEEE Computer Vision and Pattern Recognition (CVPR)

  35. Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text de- tection with fully convolutional networks (2016), in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

如果有收获,可以请我喝杯咖啡!