Pranav A

I am Pranav and my research interests are in multilinguality and fact-checking. I am currently working as a Machine Learning Engineeer at Miro AI. I graduated from Hong Kong University of Science and Technology studying Big Data Technology. My email is cs.pranav.a (at)

Github  /  LinkedIn  /  CV (for Industry Jobs)  /  1-page Academic CV

profile photo

My research interests are mainly in Natural Language Processing, particularly, representation learning and morphology. Currently, my work is focused on understanding subword models in multilingual tasks, specifically addressing these issues:

  • Sparsity: How to construct subword tokenization approaches which address the issue of sparsity and skewness for tokenization in monolingual NLU approaches?
  • Multilingual subwords: How to construct subword tokenization approaches for multilingual and multitask learning methods by leveraging mapping probabilities?
  • Cross-lingual transfer: How to construct subword mappings for cross-lingual transfer learning methods especially for morphologically rich languages?
Here are some of my publications addressing similar issues:

2kenize: Tying Subword Sequences for Chinese Script Conversion
Pranav A, Isabelle Augenstein
ACL, 2020
Github /  SIGTYP Newsletter /  arXiv /  Video and Slides

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.

Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection
Pranav A
ACL Student Research Workshop, 2018
Github  /  ACL Anthology

Ranking functions in information retrieval are often used in search engines to extract the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional tokenization, which incorporates the sequential notion; (2) graphical error modelling, which calculates the morphological shifts. The current research work only distinguishes whether a pair of words are cognates or not. However, we also study if we could predict a possible cognate from the given input. Our study shows that language modelling based retrieval functions with positional tokenization and error modelling tend to give better results than competing baselines.

Bit Partitioning Schemes for Multicell Zero-Forcing Coordinated Beamforming
Pranav A
arXiv  /  Slides

In this paper, we have studied the bit partitioning schemes for the multicell multiple-input and single-output (MISO) infrastructure. Zero forcing beamforming is used to null out the interference signals and the random vector quantization, quantizes the channel vectors. For minimal feedback period (MFP), the upper bound of rate loss is calculated and optimal bit partitioning among the channels is shown. For adaptive feedback period scheme (AFP), joint optimization schemes of feedback period and bit partitioning are proposed. Finally, we compare the sum rate efficiency of each scheme and conclude that minimal feedback period outperforms other schemes.