AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text

Aftar, Sania; Rehman, Abdul; Bergamaschi, Sonia; Gagliardelli, Luca

doi:10.1109/access.2025.3584309

Topic modeling is an unsupervised learning technique, that is extensively used for discovering latent topics in huge text corpora. However, existing models often fall short in cross-lingual scenarios, particularly for morphologically rich and low-resource languages such as Arabic. Cross-lingual topic analysis extracts shared topics across languages but often relies on resource-intensive datasets or limited translation dictionaries, restricting its diversity and effectiveness. Transfer learning provides a promising solution to these challenges. This presents AZIM, an Arabic-centric extension of ZeroShotTM, adapted to use Arabic as the training language for zero-shot multilingual topic modeling. The model’s performance is evaluated across diverse Latin-script and non-Latin-script languages, focusing on its adaptability to Modern Standard Arabic (MSA) and Classical Arabic (CA). Additionally, the study explores the impact of summarized versus general text. The results illustrate that the summarized versions of the datasets consistently outperform their baselines in terms of interpretability and coherence. Furthermore, the model also illustrates robust cross-lingual generalization as shown by non-Latin scripts such as Persian and Urdu outperforming certain Latin-based languages. However, variations in performance between the languages show the complex nature of multilingual embeddings. The performance difference between Modern Standard Arabic and Classical Arabic reveals that the limitations of the pre-trained embeddings, namely, their bias towards modern corpora. These findings point out the importance of adapting techniques for morphologically rich and low-resource languages for the purpose of enhancing the cross-lingual topic modeling.

AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text / Aftar, S., Rehman, A., Bergamaschi, S., Gagliardelli, L.. - In: IEEE ACCESS. - ISSN 2169-3536. - 13:(2025), pp. 114370-114383. [10.1109/access.2025.3584309]

AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text

Aftar, Sania;Rehman, Abdul;Bergamaschi, Sonia;Gagliardelli, Luca

2025

Abstract

Topic modeling is an unsupervised learning technique, that is extensively used for discovering latent topics in huge text corpora. However, existing models often fall short in cross-lingual scenarios, particularly for morphologically rich and low-resource languages such as Arabic. Cross-lingual topic analysis extracts shared topics across languages but often relies on resource-intensive datasets or limited translation dictionaries, restricting its diversity and effectiveness. Transfer learning provides a promising solution to these challenges. This presents AZIM, an Arabic-centric extension of ZeroShotTM, adapted to use Arabic as the training language for zero-shot multilingual topic modeling. The model’s performance is evaluated across diverse Latin-script and non-Latin-script languages, focusing on its adaptability to Modern Standard Arabic (MSA) and Classical Arabic (CA). Additionally, the study explores the impact of summarized versus general text. The results illustrate that the summarized versions of the datasets consistently outperform their baselines in terms of interpretability and coherence. Furthermore, the model also illustrates robust cross-lingual generalization as shown by non-Latin scripts such as Persian and Urdu outperforming certain Latin-based languages. However, variations in performance between the languages show the complex nature of multilingual embeddings. The performance difference between Modern Standard Arabic and Classical Arabic reveals that the limitations of the pre-trained embeddings, namely, their bias towards modern corpora. These findings point out the importance of adapting techniques for morphologically rich and low-resource languages for the purpose of enhancing the cross-lingual topic modeling.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Presenza di Autori afferenti a Enti stranieri
	
				no
			
	Lingua/e di pubblicazione
	
				Inglese
			
	Rivista
	
				IEEE ACCESS
			
	N° del Volume
	
				13
			
	Pagina iniziale
	
				114370
			
	Pagina finale
	
				114383
			
	Codice DOI
	
				https://dx.doi.org/10.1109/access.2025.3584309
			
	Codice WoS
	
				WOS:001525527900025
			
	Codice Scopus
	
				2-s2.0-105010098715
			
	Parole chiave
	
				Low resource languages; MSA and classical Arabic; multilingual embeddings; zero-shot cross-lingual topic modeling
			
	Fulltext
	
				open
			
	Tipologia
	
				info:eu-repo/semantics/article
			
	Tipologia
	
				Contributo su RIVISTA::Articolo su rivista
			
	Tipologia sito docente
	
				262
			
	Citazione
	
				AZIM: Arabic-Centric Zero-Shot Inference for Multilingual Topic Modeling With Enhanced Performance on Summarized Text / Aftar, S., Rehman, A., Bergamaschi, S., Gagliardelli, L.. - In: IEEE ACCESS. - ISSN 2169-3536. - 13:(2025), pp. 114370-114383. [10.1109/access.2025.3584309]
			
	Tutti gli autori
	
						Aftar, Sania; Rehman, Abdul; Bergamaschi, Sonia; Gagliardelli, Luca
					
	Numero autori
	
				4
			
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
AZIM_Arabic-Centric_Zero-Shot_Inference_for_Multilingual_Topic_Modeling_With_Enhanced_Performance_on_Summarized_Text.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 1.74 MB Formato Adobe PDF Visualizza/Apri	1.74 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris