Historical document digitization through layout analysis and deep content classification

Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the “Enciclopedia Treccani”, a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.


I. INTRODUCTION
Document digitization is a very important aspect of document preservation.In particular, regarding historical documents, the digitization process is useful to improve not only the durability of the documents, but also their accessibility.Often times historical documents have been transcribed, but a plain text version lacks of all the metadata available in the original version, regarding images, layout arrangements and linking to the original page.An extensive digitization process involves, along with the creation of a digital copy, also the extraction of the information contained in a document.Simple optical character recognition (OCR) is only one of the necessary steps that lead to the creation of a fully digital version of a document, which include the recognition of all the different elements that might appear in a document, such as images, tables and formulas.All these should be to be segmented and classified to obtain a truly digital version of the document.
Despite its long history, layout analysis is still a difficult task due the great number of possible layout configurations and content arrangements.In this paper we present a document analysis pipeline which includes a layout segmentation algorithm and a content classification method to classify the actual content of the segmented regions.Moreover, in case a manual transcription is available, we propose a technique to map OCR-ed text to transcription, to enrich the digitized version with additional metadata.
We test our pipeline on the Italian historical encyclopedia, the "Enciclopedia Treccani", published between 1929 and 1936.The encyclopedia consists of 35 volumes, for a total of around 60000 articles.In our experiments we focus on the first volume which contains all the different layout arrangements found in the encyclopedia and which has been manually annotated by us.A possible outcome of this work is indeed the digitization of the entire encyclopedia.
The rest of this paper is structured as follows: Section 2 gives a brief discussion of the state of the art in layout analysis and content classification, Section 3 explains the main components of our pipeline, and Section 4 reports the performance evaluation and a comparison against the state of the art.

II. RELATED WORK
Layout analysis has been an active area of research since the seventies.There are two main approaches to this task, namely bottom up and top down.Top-down methods, such as XY cuts [1], [2] or methods that exploit white streams [3] or projection profiles are usually fast but tend to fail when dealing with complex layouts.Bottom-up methods are instead more flexible and process the image page from the pixel level and subsequently aggregate into higher level regions but with an higher computational complexity.
These approaches are usually based on mathematical morphology, Connected Components (CCs), Voronoi diagrams [4] or run-length smearing [5].Many other methods exist which do not fit exactly into either of these categories: the so called mixed or hybrid approaches try to combine the high speed of the top-down approaches with the robustness of the bottom-up ones.Chen et al. [6] propose a method based on whitespace rectangles extraction and grouping: initially the foreground CCs are extracted and linked into chains according to their horizontal adjacency relationship; whitespace rectangles are then extracted from the gap between horizontally adjacent CCs; CCs and whitespaces are progressively grouped and filtered to form text lines and afterward text blocks.Lazzara et al. [7] provide a chain of steps to first recognize text regions and successively non-text elements.Foreground CCs are extracted, then delimiters (such as lines, whitespaces and tab-stop) are detected with object alignment and morphological algorithms.Since text components are usually well aligned, have a uniform size and are close to each other, the authors propose to regroup CCs by looking for their neighbors.Filters can also be applied on a group of CCs to validate the link between two CCs.Another algorithm based on whitespace analysis has been proposed by Baird et al. [8]: the algorithm uses the white space in the page as a layout delimiter and tries to find the biggest background empty rectangles to extract connected regions.
On a different note, document content classification is also a problem that has been faced by many researchers.Regarding the applied strategies, existing algorithms can be subdivided in two main categories: rule-based and statistical-based.Some papers present algorithms to classify whole pages into different categories, such as "title page", "table of contents", etc. [9], while a different approach is to classify homogeneous regions in a document into different classes, such as "text", "image", "table", etc.Is interesting to note that many papers face this problem trying to distinguish only one class, for example Zanibbi et al. [10] focuses on mathematical expressions, Hu et al. [11] on tables, Chen et al. [12] and Pham [13] on logo detection and Futrelle et al. [14] on diagrams.These approaches solve only part of the classification problem.
Regarding multi-class algorithms, many of them exploit rules built specifically for certain document classes, for example Krishnamoorthy et al. [15] and Lee et al. [16] developed algorithms to identify entities in scientific journals and papers, Mitchell et al. [17] identifies entities in newspaper pages and Jain and Yu [18] identifies various types of entities in different types of documents.These approaches often rely on hand-built heuristics, strictly tied to specific document types.Other approaches use statistical features and image features to classify document regions in different categories.The algorithm proposed by Sivaramakrishnan et al. [19] distinguishes between nine different classes extracting features for each region using run length mean and variance, number of black pixels and aspect ratio of the region.Fan and Wang [20] use density features and connectivity histograms to classify regions into text, images and graphics.Li et al. [21] proposed an algorithm that models images using two-dimensional HMMs and Wang et al. [22] proposed an algorithm based on the representation of a region by means of a 25-dimensional vector and on a classification tree.

A. Layout Segmentation
Layout segmentation aims at segmenting an input page into coherent regions, like text, tables and images, and is therefore a necessary step for classifying the contents of a document.In the following, we propose a top-down layout analysis approach, which builds upon the classic XY-Cut algorithm [1] and is able to deal with more complex layouts.
The XY-Cut algorithm is applied as the first step of our layout segmentation pipeline.This classic top-down algorithm recursively splits a region of interest into two, based on the spatial arrangement of white pixels.The algorithm exploits the pixels' vertical and horizontal projections in order to find low density regions in the projections' histograms.These low density points correspond to white spaces separating two different layout elements and therefore the region is split accordingly.
The XY-Cut algorithm is suitable to find rectangular regions surrounded by white space, but in real layouts it is not uncommon to find images within text areas (see the image in Fig. 1 for an example).The involved text area, instead of being a plain rectangle, is identifiable as a rectilinear polygon which can't be recognized by the XY-Cut algorithm.To address this problem an additional analysis step is performed in order to detect images and illustrations in the page.
The image detection problem is approached exploiting local autocorrelation features and the method we propose is inspired by the algorithm proposed in [23].The assumption is that text areas are clearly distinguishable thanks to the text lines pattern which is absent in images, thus the autocorrelation matrix of a region is used as an effective descriptor for this task.The image is subdivided into square blocks of size n×n, and for each block the autocorrelation matrix C(k, l), with k, l ∈ [−n/2, n/2], is computed.Then, the autocorrelation matrix is encoded into a directional histogram w(•), in which each bin contains the sum of the pixels along that direction.Formally, We compute the directional histogram in the range θ ∈ [0, 180], and quantize θ with a step of 1, and r with a step of 1 pixel.The histogram is then concatenated with the vertical and horizontal projections of the autocorrelation matrix, to enhance the repeating pattern of the text lines.The resulting descriptor is fed to a two-class SVM classifier with RBF kernel, trained to distinguish between blocks of text and blocks of illustrations.
Once the original image is split by the first XY-Cut application, each region is subdivided into n × n blocks which are then classified into text and images.Illustration boundaries are then detected by finding the connected components created by the illustration blocks.
The last step of this process involves the removal of the detected illustrations in order to use the XY-Cut algorithm again on the same regions.The final result of the segmentation process is the union of the regions found by the XY-Cut applications and the illustrations regions.Fig. 1 shows an example of a page through the various phases of the segmentation process.

B. Content Classification
Once the input document has been divided into a set of coherent regions, these can be classified according to the class they belong to.Notice that in a structured document blocks of different classes are separated by whitespaces, and it is reasonable to assume that each region belongs to a unique class.In the case of the "Enciclopedia Treccani", there are seven different classes: text, tables with border, borderless tables, images, graphics, scores and mathematical formulas.
Content classification is carried out by exploiting a combination of deeply learned features and classical local features encoding techniques.Given an input region, a Convolutional Neural Network (CNN) is used to produce local features from squared n × n blocks, which are then encoded according to their mean and variance.The final descriptor, which is global with respect to a region, is then classified using a Random Forest classifier.
A CNN is a neural model composed by a sequence of Convolutional, Spatial Pooling and Fully connected layers.
Each Convolutional layer takes as input a c × w × h tensor, and applies a set of convolutional filters to produce an output c × w × h tensor, where each of the c output channels is given by the application of c learned convolutional features to each of the c input channels.The input tensor of the first Convolutional layer is the input image itself, which in our case is a grey-level image, thus having c = 1.Spatial pooling layers, instead, downsample their input tensor on spatial dimensions, while preserving the same number of channels.This is usually done through a max-pooling operation, that computes the maximum on a square k × k kernel which slides over the input image with stride k, thus reducing a tensor with size c×w×h to size c× w/k × h/k .Eventually, the output tensor of the last Spatial Pooling layer is flattened, and given to a sequence of Fully Connected (FC) layers.Given an input vector with size l, a FC layer with l neurons learns a l × l matrix of weights and l biases, and each output neuron is given by the dot product of the input vector and a column of the weight matrix, plus the bias.Convolutional and FC layers are usually followed by an activation function, which creates non-linearities inside the network.
To perform content classification, we design a CNN which takes as input a square n × n patch extracted from a region, in a sliding window manner.The network contains three Convolutional-MaxPooling stages, plus two FC layers.The last FC layer has 7 neurons, and outputs the probability of each class, while the overall network is trained with Stochastic Gradient Descent by minimizing a categorical cross-entropy function over the activations of the last FC layer.Details of the architecture are given in Table I.
Once the network has been trained, it is able to predict the class of a n × n squared block.In order to classify an entire region, we take all the blocks of a region, describe them with the activations of the last but one FC layer (fc1 in Table I), and encode the resulting set of feature vectors according to their mean and covariance matrix, plus geometrical statistics of the region.In particular, for a region R, having coordinates (R x , R y , R w , R h ), and containing a set B of squared blocks, the following feature vector is computed: where W and H are the page height and width.µ(B) and σ(B) are respectively the mean vector and the flattened upper-left triangular of the covariance matrix computed over the descriptors of all blocks in B. The resulting set of descriptors is finally given to a Random Forest classifier, which is used to classify input regions.

C. OCR Mapping
After the layout segmentation and classification steps we are able to distinguish between text and non-text regions.Historical documents are often written using fonts that are not used anymore and thus present greater challenges for OCR systems, which have to be tuned on specific sets of characters to produce a reliable transcription of the content.
On the other hand is not uncommon for historical documents to be manually transcribed.The "Enciclopedia Treccani" is no exception to this practice.We present here a technique to map OCR-ed text, read using an open-source OCR system trained on generic documents, and thus containing errors, to the manual transcription.This also allows for an enrichment of the extracted text by means of the metadata encoded in the manual transcription, such as paragraphs subdivision, titles, bulleted and numbered lists and bibliography notes.
The OCR system output must include the bounding boxes of the text lines, since their position is used to determine a preliminary paragraph splitting.We consider that each paragraph starts with a tabbed line, as Figure 2 shows.
We assume that the manual transcription is hierarchically organized in articles and paragraphs.Following this structure and since a brute force search approach is not possible we decided to build a two-level hierarchy of word-based inverted indexes.The first level is built at the article level, where each word votes for the articles in which it appears.The lower level consists of an inverted index for each article, these are built using the same principle but considering only the words in the article itself, in this case then, each word votes for the paragraphs in which it appears.All the stop words are removed from the text during this process since they do not carry substantial information and only add computational load.The inverted indexes are all prebuilt before the matching process.
The matching process analyzes each paragraph independently and does not use any information about the previous and following paragraphs in order to avoid error propagation.During the matching the inverted index hierarchy is followed from top to bottom, starting from the article level.Each word is looked up in the article-level inverted index and an histogram is built.Each bin corresponds to the number of votes that each article has received.The articles that have received more votes are the best candidates to contain the searched paragraph.For each top-5 article the same process is repeated at the lower level of the inverted index hierarchy.Each word of the paragraph votes for the paragraphs in which it appears within a specific article and the top-5 candidate paragraphs are then compared with the searched one using the Levenshtein distance.The Levenshtein distance measures the the difference between two character sequences as the number of singlecharacter edits (insertions, deletions and substitutions) and it's suitable to determine which paragraph is the most similar despite errors that might have occurred during the OCR phase.The best match from each article is then compared with the others and the one with the lowest Levenshtein distance is considered to be the best match for the searched paragraph.
A pseudo-code description of the algorithm is given in Algorithm 1.

IV. EXPERIMENTAL EVALUATION
The evaluation of layout segmentation performances can be approached at two different levels, the region level and the pixel level.The region level approach takes into account the semantically coherent regions extracted from a page and tries to find the best matching between them and ground truth annotated regions in order to determine an accuracy value for the entire page.The shape of a region may vary depending on the segmentation methodology.The pixel level approach evaluates the class assigned to each pixel in the segmented page and compares it with the class assigned to the same pixel in the ground truth annotation.The segmentation accuracy is calculated as the percentage of correctly classified pixels in the page.In both cases, the overall accuracy of the segmentation method can be calculated as the mean accuracy over multiple pages.
Since our hybrid method produces polygonal regions as output results we have chosen a region level evaluation methodology.In particular we used a method proposed by Phillips and Chabra [24] used for the ICDAR 2003 page segmentation competition [25].We used the suggested acceptance threshold and rejection threshold, respectively 0.85 and 0.05 and we used intersection over union as a similarity metric to evaluate matching scores between different regions.Block size n was set to 64 in all experiments.Our dataset consists of all the 999 pages of the first volume of the "Enciclopedia Treccani"1 .The whole dataset has been manually annotated by three different people and presents various layout configurations.Each page is accurately scanned at a maximum resolution of 2764×3602 and no color information is retained since the encyclopedia was not meant to be printed in color.The overall dataset contains 9489 text regions, 965 images, 126 graphics, 121 tables with border, 80 mathematical formulas, 21 music scores and 77 borderless tables.To train the SVM classifier for layout analysis, we split the dataset into a training and test set, the first one containing 2/3 of the samples of each class.In this case text samples are considered positive, and samples from all other classes are considered negatives.
We compare our method with the standard XY-Cut algorithm and with the Whitespace Analysis algorithm proposed by Baird in [8].Results are shown in Table II: as it can be seen, our hybrid approach outperforms other classic algorithms by a large margin.Some segmentation samples are also reported in Figure 3, which shows some of the possible layout arrangements found in the encyclopedia, with entities ranging from tables, formulas, musical notation and images.Different colors are related to different classes of images and also to the paragraph subdivision.Regarding content classification, the proposed Deep Network was implemented using Theano, and we employed the Random Forest classifier included in the OpenCV library.Since the dataset splits previously described are heavily unbalanced, the network was trained with mini-batches containing the same amount of samples for each class, randomly chosen from the training set.The same train and test splits were used to train the Random Forest classifier.Content classification performances are shown in Table IV.
On a different note, the OCR mapping performance has proven to be very good, with an accuracy of 97.5%.Nevertheless a drawback of this algorithm involves very short paragraphs and titles.Paragraphs composed by a single word or two very short words often do not carry much information about their belonging to a specific article unless they are composed of very uncommon words.For this reason we heuristically set up a threshold and only paragraphs with more than 15 characters have been considered by the algorithm.We have evaluated the algorithm sampling 1160 paragraphs read from the encyclopedia using the open-source Tesseract OCR software [26].An error breakdown, split into too short and wrongly mapped paragraphs, is presented in table III.

V. CONCLUSIONS
In this paper we presented a layout analysis pipeline and a document content classification method.The layout analysis process is based on the classic XY-Cut algorithm and on a SVM classifier used to detect illustrations.Content classification is approached using the combination of a Convolutional Neural Network and a Random Forest classifier used to distinguish between seven different classes of layout entities.We also provided an algorithm to map OCR-ed text to a manual transcription which exploits inverted indexes built on the manual transcription.Experimental results prove the effectiveness of our approach when tested on an historical document, the "Enciclopedia Treccani".

Fig. 1 .
Fig. 1.The page layout segmentation pipeline.First the Recursive XY-Cut algorithm is applied to detect candidate regions inside the page; then, illustrations are detected using local autocorrelation features.A second application of the XY-Cut algorithm gives the final segmentation.

Fig. 2 .
Fig. 2. Paragraph subdivision example.In this figure the text lines' bounding boxes are the output of the OCR system.Tabbed lines are used to mark the beginning of a new paragraph.

Fig. 3 .
Fig. 3. Some examples of the possible layout arrangements found in the "Enciclopedia Treccani" and the extracted segmentations.Different colors are related to the different classes associated with each entity

TABLE I STRUCTURE
OF THE CONTENT CLASSIFIER CNN.THE NETWORK TAKES A INPUT A GRAY-SCALE n × n PATCH, AND CONSISTS OF A SEQUENCE OF CONVOLUTIONAL (C O N V ), MAXPOOLING (M P ) AND FULLY-CONNECTED (F C ) LAYERS.NEURONS FROM ALL LAYERS USE RELU ACTIVATIONS, EXCEPT FROM THE F C2 LAYER WHICH USES A SOFTMAX ACTIVATION.

TABLE IV CONFUSION
MATRIX FOR CLASSIFICATION RESULTS OF THE CNN CLASSIFIER.