Indexing of Historical Document Images: Ad Hoc Dewarping Technique for Handwritten Text

. This work presents a research project, named XDOCS, aimed at extending to a much wider audience the possibility to access a variety of historical documents published on the web. The paper presents an overview of the indexing process that will be used to achieve the goal, focusing on the adopted dewarping technique. The proposed dewarp-ing approach performs its task with the help of a transformation model which maps the projection of a curved surface to a 2D rectangular area. The novelty introduced with this work regards the possibility of applying dewarping to document images which contain both handwritten and typewritten text.


Introduction
XDOCS is designed with the intention of extending to a much wider audience of scholars, or even simply curious people, the possibility to access a variety of historical documents published on the web 1 .
To that purpose, the project is developing an innovative data capturing technique able to extract document indexes in quasi-automatic mode from their handwritten contents. The devised solution intervenes after the dematerialisation action of scanning the historic documents and obtaining one image per couple of adjacent pages, and it is intended to be especially applied to a long series of documents such as the large number of civil registries that are available since the constitution of the Italian state.
Since warping affects, as well as documents readability, most of the high level text processing such as OCR, word spotting, and handwritten recognition, dewarping digital text is one of the fundamental requirements to perform a correct extraction of indexes. This process starts from a curled page, usually captured by a flatbed scanner or by a digital camera, and aims to obtain an output image constituted only of horizontal straight text lines, without suffering from any distortion due to perspective or page warping.

Dewarp
The indexing process is split into two main phases, namely "image rectification" and "indexing & publication". The former is depicted in the above figure, showing the steps that move from a scanned image up to its definite squaring. More in detail: • Repository is the place where the original scanned images are found (link to the webpage?).
• Preprocessing is the document image processing that filters out noise due to the intrinsic features of the original image and to the digitization process, removes background and binarizes the image. • Extraction is the process finding out the vertexes of the area to be further analysed.
• Adjust is the manual operation required whenever the Extraction operation fails.
• Save is the final step associating the rectified image to the original one in the Repository for next processing.

Rectified image
Bad shape Document image processing

Dewarping Extraction
Identify the curved projection (a) Image rectification.

PORTAL PORTAL
The latter phase of the indexing process is depicted in turn in the above figure, showing the steps that cut individual registrations from the rectified image and lead to indexing each of those registrations. More in detail: • Portal is the place where the registrations and their indexes are made available for consultation.
• Cut is the image processing step separating and normalising the three registrations that are present in every rectified image.
• Save is the step storing the normalised registrations into the Portal.
• Extract is the image processing function xxxxxxxx (non so come spiegarlo con termini tecnici).
• Publish is the final step associating the extracted indexes to the corresponding registration in the Portal.
Of course completeness and confidence degree of the extracted indexes are strongly affected by the quality of the handwritten text and the preservation state of the original registry. Those factors can however be increased by driving the Extraction action by templates representing the limited areas to be examined for finding out each of the desired indexes. The templates are manually defined on the normalised registrations and in principle can depend on the single registry (year, municipality, handwriting style of the registrar).

Cut Save
Indexes from each registrations

Link indexes to registration
Normalized registrations Over the last two decades many methods for document dewarping have been proposed. These approaches are usually classified in two categories according to the surface model adopted: restoration approaches based on 2D document image processing [13,8] and restoration approaches based on 3D document shape reconstruction [5,7]. Most of the dewarping techniques proposed in the past, of both categories, are specifically designed for typewritten text. These methodologies produce bad results when applied to handwritten text or, worse, to documents containing a mix of handwritten and typewritten text. In order to improve the XDOCS indexes extraction, the coarse dewarping technique originally proposed by Stamatopoulos et. al [12] was adjusted to address also handwritten documents.
The remainder of the paper is organized as follows. Section 2 presents an overall description of the indexing process. In Section 3, the dewarping adopted technique is detailed. Section 4 reports some visual experimental results. Finally, in Section 5, are drawn the conclusions.

Indexing Process
The XDOCS indexing process is split into two main phases, namely "image rectification" and "indexing & publication". The former is depicted in Figure 1(a), showing the steps that move from a scanned image up to its final squaring. More specifically: -Repository is the place where the original scanned images are found.
-Preprocessing is the document image processing step, filtering out noise due to the intrinsic features of the original image and to the digitization process. -The Extraction step aims to find the projection of the curved surface represented by two almost vertical straight lines and by two third degree polynomial curves surrounding the document page (see Section 3.2 for details). This is required by the proposed dewarping method. -Adjust is the manual operation required whenever the Extraction operation fails. -Dewarping is the core step of the "image rectification" phase; its purpose is to compute the dewarping, which transforms the original image into a rectified and normalized one. -Save is the final step associating the rectified image to the original one in the Repository for further processing.
The latter phase of the indexing process is depicted in turn in Figure 1 Of course, the degree of completeness and confidence of the extracted indexes are strongly affected by the quality of the handwritten text and the state of preservation of the original registry. Those factors can however be increased by driving the Extraction action by templates representing the limited areas to be examined for the purpose of finding out each of the desired indexes. The templates are manually defined on the normalized registrations and in principle can depend on the single registry (year, municipality, handwriting style of the registrar). Figure 2 reports two examples of birth registry, the first one dated June 1888 and the second one September 1900. Each image shows two pages containing three birth registrations: one on the left-upper side, one on the right-bottom side and one split between the two pages. The three registrations share the same structure and present the intended indexes in equivalent and well identified positions. Moreover, the most critical indexes, namely family name and given name, appear twice in each registration and this redundancy can increase the level of confidence in the indexing.  The metadata identifying the double page image are: registry type, year and volume, place of registration (typically, a municipality). The intended indexes are in turn: birth month and day, name and family name, sex, father's name, mother's name and family name, possibly grandparents names.

The Proposed Dewarping Approach
This section describes in detail "Image Rectification" phase focusing on Preprocesing, Extraction and Dewarping steps.

Preprocessing
Before proceeding with the dewarping step, which is detailed in the following, the gray level images are mapped into black-white ones using the adaptive threshold described in [11] (see Figure 3(a) as example). Then, noise is filtered out principally using information related to statistics of connected components calculated using [9]. An example of preprocessing output is reported in Figure 3(b).

Extraction
The extraction step aims at identifying the 2D projection of the curved surface defined by the four polynomial curves which surround the document text on every single page (see Figure 5).  According to the warping model characterizing historical documents, the right and left polynomial curves are supposed to be lines and generically defined as: To identify these lines the approach combines the information obtained by the Hough transform [6] and the position of A, B, C and D vertexes of the curved surface projection retrieved using the Harris algorithm and starting from a thinned image [10]. Top and bottom curves, instead, are supposed to be third degree polynomial lines and their coefficients are fitted with the Least Square Estimation algorithm and have the following general expression: More accurate results could be achieved modeling these curves as higher polynomial functions. This change increases dewarping computation time slightly improving accuracy, so it is not recommended. Boundary extraction significantly influences the quality of the dewarping process, and then the indexes extraction: if it fails the Adjust step leaves the user the possibility to correct curves via a GUI: examples of the automatic extracted 2D projections are reported in Figure 4.

Dewarping
This is the core step of the dewarping approach which aim to map the projection of the curved surface to a 2D rectangular area with fixed dimensions H and W (see Figure 5 for details). Stages of the mapping process are detailed in this section using the following notation: A(x A , y A ), B(x B , y B ), C(x C , y C ), and D(x D , y D ) are the vertexes of the projection surface whereas A (x A , y A ), B (x B , y B ), C (x C , y C ), and D (x D , y D ) are the ones of the rectangular destination area. Moreover, the euclidean distances between points A and D, B and C are respectively called |AD| and |BC|, and the lengths of the polynomial curves | > AB| and | > CD| are defined as:  where f(x) and g(x) are the functions describing the polynomial lines. Therefore, given a generic point K(x K , y K ) on the warped image, the corresponding one on the 2D rectangular area K (x K , y K ) can be found preserving proportions between dimensions of projected curves and 2D destination area. First of all it is necessary to find the two points T (x T , y T ) ∈ | > AB| and G(x G , y G ) ∈ | > DC| such that K ∈ T G and | > AT | : The transformation equations are then defined as follows: To compute the final page dewarping every pixel in the dewarped image is mapped to a floating-point coordinate in the warped image, therefore the process is concluded using a simple interpolation.

Experimental Results
Common practices in the evaluation of dewarping techniques consist of comparing the error rate of OCR software applied on the original and dewarped images or are simply based on visual pleasing impressions. Unfortunately, the first strategy is unfeasible for documents which contain handwritten text, so the second one is adopted in this paper.   Figure 6 reports an example result of the proposed dewarping technique applied to a regular modern form image, which contains a bounding box and both handwritten and typewritten text. It is possible to see that the horizontal lines in the dewarped image are perfectly aligned with the horizontal boundaries of the box. Figure 7 instead, shows another result applied on the typical historical documents treated in this work. Also here both handwritten and printed text is present and the detection phase is complicated by the presence of many distracting elements.
The method has been tested on more than 4.000 birth acts and on almost 200 generic digital documents similar to the one reported in Figure 6. Experimental results reveal that more than 85% of curved 2D projections are correctly extracted and do not require the manual Adjust step before performing the dewarping procedure.

Conclusions
This paper describes the rationale and objectives of a research project presently underway at SATA s.r.l. in collaboration with the University of Modena and Reggio-Emilia, and co-funded by the Emilia-Romagna regional administration. In particular, a relatively novel approach for performing dewarping on digital document images containing both handwritten and typewritten text was detailed. The proposed method assumes that original text is surrounded by a bounding box from which the projection of the curved surface is extracted. This is a strong assumption, but it is not uncommon to find such documents: most of the "precompiled" modules present this kind of structure and the historical documents tested confirm the assumption. Moreover, experimental results demonstrate the quality of the proposed approach. Future work will require the exploration of Convolutional Neural Networks architectures, in order to improve the image Extract stage [2,3,1].