Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.
Historical Document Digitization through Layout Analysis and Deep Content Classification / Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita. - (2016). (Intervento presentato al convegno 23rd International Conference on Pattern Recognition tenutosi a Cancun, Mexico nel 4-8 Dec 2016) [10.1109/ICPR.2016.7900272].
Historical Document Digitization through Layout Analysis and Deep Content Classification
CORBELLI, ANDREA;BARALDI, LORENZO;GRANA, Costantino;CUCCHIARA, Rita
2016
Abstract
Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.File | Dimensione | Formato | |
---|---|---|---|
main.pdf
Open access
Tipologia:
Versione originale dell'autore proposta per la pubblicazione
Dimensione
4.77 MB
Formato
Adobe PDF
|
4.77 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris