No Cover Image

Conference Paper/Proceeding/Abstract 11 views

Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati

Sanjay Booshanam, Kelly Chen, Ondrej Klejch, Thomas Reitmaier Orcid Logo, Dani Kalarikalayil Raju, Electra Wallington, Nina Markl, Jen Pearson Orcid Logo, Matt Jones Orcid Logo, Simon Robinson Orcid Logo, Peter Bell

Findings of the Association for Computational Linguistics: EMNLP 2025, Pages: 22497 - 22509

Swansea University Authors: Thomas Reitmaier Orcid Logo, Jen Pearson Orcid Logo, Matt Jones Orcid Logo, Simon Robinson Orcid Logo

Abstract

Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development expe...

Full description

Published in: Findings of the Association for Computational Linguistics: EMNLP 2025
ISBN: 979-8-89176-335-7
Published: Suzhou, China Association for Computational Linguistics 2025
Online Access: https://aclanthology.org/2025.findings-emnlp.1224/
URI: https://cronfa.swan.ac.uk/Record/cronfa70213
Abstract: Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.
College: Faculty of Science and Engineering
Start Page: 22497
End Page: 22509