The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

Cascianelli, Silvia; Pippi, Vittorio; Maarand, Martin; Cornia, Marcella; Baraldi, Lorenzo; Kermorvant, Christopher; Cucchiara, Rita

doi:10.1109/ICPR56361.2022.9956189

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting – even of the same author over a wide time-span – and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at https://aimagelab.ing.unimore.it/go/lam.

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition / Cascianelli, S., Pippi, V., Maarand, M., Cornia, M., Baraldi, L., Kermorvant, C., Cucchiara, R.. - 2022-:(2022), pp. 1506-1513. (26th International Conference on Pattern Recognition, ICPR 2022 Montréal Québec August 21-25, 2022) [10.1109/ICPR56361.2022.9956189].

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

Silvia Cascianelli;Vittorio Pippi;Martin Maarand;Marcella Cornia;Lorenzo Baraldi;Christopher Kermorvant;Rita Cucchiara

2022

Abstract

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting – even of the same author over a wide time-span – and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at https://aimagelab.ing.unimore.it/go/lam.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo del Convegno
	
				26th International Conference on Pattern Recognition, ICPR 2022
			
	Luogo del Convegno
	
				Montréal Québec
			
	Data del Convegno
	
				August 21-25, 2022
			
	Codice DOI
	
				https://dx.doi.org/10.1109/ICPR56361.2022.9956189
			
	Codice WoS
	
				WOS:000897707601072
			
	Codice Scopus
	
				2-s2.0-85143614192
			
	Serie
	
				INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION
			
	N° del Volume
	
				2022-
			
	Pagina iniziale
	
				1506
			
	Pagina finale
	
				1513
			
	Tutti gli autori
	
						Cascianelli, Silvia; Pippi, Vittorio; Maarand, Martin; Cornia, Marcella; Baraldi, Lorenzo; Kermorvant, Christopher; Cucchiara, Rita
					
	Citazione
	
				The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition / Cascianelli, S., Pippi, V., Maarand, M., Cornia, M., Baraldi, L., Kermorvant, C., Cucchiara, R.. - 2022-:(2022), pp. 1506-1513. (26th International Conference on Pattern Recognition, ICPR 2022 Montréal Québec August 21-25, 2022) [10.1109/ICPR56361.2022.9956189].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
2022_ICPR_HTR.pdf Open access Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 630.29 kB Formato Adobe PDF Visualizza/Apri	630.29 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris