Google Translate add’s Konkani: What are the Limitations?

Saurabhk
4 min readMay 14, 2022

Dataset is the main cause for Artificial Intelligence/Machine Learning adoption for Konkani

Sundar Pichai presenting In Google I/O 22- अति सुंदर

So Finally the wait is over!
Google officially announced the addition of Konkani along with other 23 languages to the “Google Translate” Ecosystem.
It is personally really exciting and satisfying to see my mother-tongue finally found place in Google Translate. Sanjeet Hegde Dessai (Google Staff) a Goan who took this up and pushed the request to Google managerial board to make it happen. Thanks a ton!

Konkani: A Low-Resourced Language

Goa🏖️ is the smallest state of India. Konkani is the official language of the state of Goa🌴, India. Despite of 2 million Konkani speaking population, it has relatively very low online presence as the books, literature, news are often actively being published (hard copy) but digital format(soft copy) are often not looked upon.

Konkani as a language is recognized by Constitution of India and also has dedicated Konkani department which should possibly make attempt to work towards creating ,digitizing, maintaining the language along with building quality annotated dataset to have such more accessible AI enabled System.

Konkani Bhasha Mandal, Goa University Konkani Department, Art and Culture Goa Government, Renowned Authors, Enthusiasts and others such bodies should collaborate to work towards it.

Translation System:

Ideally machine translation system is devised as a supervised learning task, where you often have a parallel corpora containing bi-lingual sentence pairs (English sentence , Konkani sentence) which is fed feed to Encoder-Decoder neural network architecture network.

This requires millions of sentence pairs dataset to train and produce such quality translations. But, due to the unavailability of such a dataset on many of the languages and amount of time, efforts require to develop the dataset for which techniques like zero-shot approach, Multi-lingual Knowledge Distillation approaches evolved.

Google Neural Multi-lingual Machine Translation (GNMT)
the capability of a translation system to translate between arbitrary languages, including language pairs for which it has not been trained in zero shot fashion.

Zero-Shot Machine Translation system:

Zero-Shot Machine Translation system model is pre-trained & fine-tuned model which serves as the knowledge base as it has been fine-tuned on a huge amount of multilingual sentence translation pairs. This enables the system to transfer the “translation knowledge” from one language pair to the others.

> Zero shot model leverage its learning from similarity it shares with Hindi and Marathi language to model the representation for unseen Konkani translation.

> The Initial monolingual training of Konkani text helps the model to generate the relevant Konkani translation ( by mapping semantically similar vectors in embedding space) from the source language and thus giving impressive results for translations for frequently used sentences/phrases are .

This transfer learning and the need to translate between multiple languages forces the system to better use its modeling power even for language never seen by the system.

Future Possibilities (Speech): Importance of Dataset

Konkani is still not yet a part of Speech ecosystem and needs improvement on GBoard, that’s again mainly due to not enough availability of even unstructured dataset. Having a quality dataset is anytime always good way to build such interpretable, faithful and accountable ML system. In such cases, Few-Shot Learning system can be bit more performant then Zero-shot that can be used if atleast there is few data.

Having more native dataset will help identify correct word like मान्ने (bathroom)

I urge Konkani community to digitize the content for such AI/ML adoption. In such cases companies like Google can provide more Generalized and capable AI System.

AI / Machine Learning in Konkani

Note: I have mentioned only my contribution here

GLOVE, FastText, Word2Vec Embedding model

Non contextual word embeddings are detailed in blog post here .

Long Short Term Memory Networks for Language Modelling

Almost around 2 year back, I had used AWD-LSTM based sequence-sequence neural architecture to build language model to help generate relevant next for word predictions link to video

Transformer Networks BERT( Tiny Konkani dataset for Thirsty BERT)

During same time around I had tried training newly released BERT(Bidirectional Encoder Representations from Transformers) model on Masked Language Modelling and Next Sentence Prediction Objective to train it on very small dataset.

But BERT being BERT (base model — 12 attention layers and 768 hidden dim for BYTE-Pair encoded tokens) were to sparse to provide meaningful vector representation for generating quality embeddings on tiny dataset that Konkani has to offer.

Here’s a News Article link my work for Konkani covered by Navhind Times: https://www.navhindtimes.in/2021/10/01/magazines/kuriocity/an-app-for-konkani-lovers/

Many advances are happening currently in AI/ML NLP domain and there is still lot more scope for its development in downstream NLP task (Named Entiy Recognition, Relation Extraction, Semantic Search); but only when enough dataset is available!

References:

https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html

--

--

Saurabhk

Data Science Enthusiast. Love Applied Research.