Advancing Handwritten Text Recognition for Non-Latin Script

Convocatoria y Organismo Financiador: Programa Momentum CSIC. Financiado por la Secretaría de Estado de Digitalización e Inteligencia Artificial – Ministerio para la Transformación Digital y de la Función Pública, a través de Red.es, y con fondos del Plan de Recuperación, Transformación y Resiliencia

Referencia del proyecto: MMT24-ILC-01

IP:  Jan Thiele (ILC, CSIC)

Periodo de ejecución: 20/11/2024   a  19/12/2028

The textual production of cultures utilizing non-Latin scripts—particularly Arabic, Byzantine and Renaissance Greek, as well as Hebrew—remains vastly understudied compared to the Latin tradition. It is fair to say that more than half of the surviving texts remain in manuscript form, severely limiting accessibility and creating substantial gaps in our understanding of these cultures. If traditional methods of exploration, transcription, and editing of manuscript texts continued—namely, manual approaches—we would likely require another century to obtain a reasonably complete picture of textual production in non-Latin scripts. However, by harnessing the expertise of scholars versed in diverse non-Latin scripts to develop and refine artificial intelligence, there exists a genuine opportunity for a paradigm shift with the emergence of computer-assisted philology. Recent advancements in technologies aimed at automating the recognition and transcription of handwritten text offer promising prospects for a more efficient and less biased exploration of past cultures.

The landscape of Handwritten Text Recognition (HTR) technology has long been dominated by a heavy bias towards Latin script, leaving a significant gap in capabilities when it comes to non-Latin scripts. Our project seeks to address this disparity by targeting the challenges inherent in non-Latin handwritten texts and focusing on three main pillars: open science principles, applicability to diverse linguistic and cultural contexts, and the development of specialists in HTR technology. To achieve this overarching goal, the grant outlines the following objectives:

1. Creation of training datasets: The grant aims to create datasets for Arabic and Greek handwritten texts, recognizing the unique challenges posed by these non-Latin scripts. Arabic's cursive nature and variability in character shapes, along with the presence of diacritics, present significant obstacles to existing HTR technologies. In a similar vein, the Greek minuscule script, which became prevalent from the 9th century onward in nearly all Greek manuscripts, evolved an intricate system of ligatures and abbreviations, including tachygraphic symbols and "suspended" abbreviations; alongside its diverse range of character forms, it encompasses a vast array of graphic symbols that pose challenges in terms of systematization and recognition.

2. Training and fine-tuning of pre-trained models: The grant seeks to train and fine-tune pre-trained computer vision models using datasets of Arabic and Greek handwritten scripts, aiming to specialize them for efficient HTR technology.

3. Training of HTR technology experts: The grant will facilitate the training of experts in experimental developments of HTR technology, preparing them to play critical roles in advancing HTR technology at the ILC. These experts will contribute to ongoing research and development efforts, ensuring sustained progress in the field.

Jan Thiele (ILC, CSIC) Principal Investigator

Carmen García Bueno (ILC, CSIC) Postdoctoral Fellow

MOMENTUM CSIC