Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.

Historical Document Digitization through Layout Analysis and Deep Content Classification / Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita. - (2016). (Intervento presentato al convegno 23rd International Conference on Pattern Recognition tenutosi a Cancun, Mexico nel 4-8 Dec 2016) [10.1109/ICPR.2016.7900272].

Historical Document Digitization through Layout Analysis and Deep Content Classification

CORBELLI, ANDREA;BARALDI, LORENZO;GRANA, Costantino;CUCCHIARA, Rita
2016

Abstract

Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.
2016
23rd International Conference on Pattern Recognition
Cancun, Mexico
4-8 Dec 2016
Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
Historical Document Digitization through Layout Analysis and Deep Content Classification / Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita. - (2016). (Intervento presentato al convegno 23rd International Conference on Pattern Recognition tenutosi a Cancun, Mexico nel 4-8 Dec 2016) [10.1109/ICPR.2016.7900272].
File in questo prodotto:
File Dimensione Formato  
main.pdf

Open access

Tipologia: Versione originale dell'autore proposta per la pubblicazione
Dimensione 4.77 MB
Formato Adobe PDF
4.77 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1103792
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 7
social impact