Pranav A

I am Pranav and my research interests are in multilinguality. I am currently working as a Machine Learning Engineeer at Dayta AI. I graduated from Hong Kong University of Science and Technology studying Big Data Technology. My email is cs.pranav.a (at)

Github Link  /  LinkedIn Link  /  CV (Link to PDF)


Paper title: How to Make Virtual Conferences Queer-Friendly: A Guide
Authors: Organizers of QueerInAI, A Pranav, MaryLena Bleile, Arjun Subramonian, Luca Soldaini, Danica J. Sutherland, Sabine Weber and Pan Xu
Conference: Widening NLP, 2021
Link to Queer in AI page

Abstract: Queer in AI’s demographic survey reveals that most queer scientists in our community do not feel completely welcome in conferences and their work environments, with the main reasons being a lack of queer community and role models. Over the past years, Queer in AI has worked towards eliminating these issues, yet we have observed that the voices of marginalized queer communities, especially transgender, non-binary folks and queer BIPOC folks have been neglected. Furthermore, the coronavirus pandemic has introduced many novel scenarios including the ubiquity of virtual conferences, with which D&I chairs may not have prior experience. Queer in AI frequently gets inquires about making virtual conferences more inclusive from both conference organizers and queer community organizers. The purpose of this document is to provide a tutorial for D&I organizers on how to make virtual conferences queer friendly.

Paper title: 2kenize: Tying Subword Sequences for Chinese Script Conversion
Authors: Pranav A, Isabelle Augenstein
Conference: ACL, 2020
Github Link /  SIGTYP Newsletter Link /  arXiv Link /  Video and Slides Link

Abstract: Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.

Paper Title: Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection
Authors: Pranav A
Conference: ACL Student Research Workshop, 2018
Github Link  /  ACL Anthology Link

Abstract: Ranking functions in information retrieval are often used in search engines to extract the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional tokenization, which incorporates the sequential notion; (2) graphical error modelling, which calculates the morphological shifts. The current research work only distinguishes whether a pair of words are cognates or not. However, we also study if we could predict a possible cognate from the given input. Our study shows that language modelling based retrieval functions with positional tokenization and error modelling tend to give better results than competing baselines.