Topic Modelling on Hindi Text: NLP

Saurabhk
3 min readSep 25, 2021

--

Get Insights to the topics from text corpus using unsupervised approach

Pic Credit: Joviton D’costa

Topic Modelling is a process to automatically identify topics present in a text corpus to derive hidden patterns. Topic Modelling is an unsupervised approach which is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching.

About Hindi Language

Hindi is an Indic Language written in Devanagari Script. Hindi is the 4th largest spoken language, It follows Unicode (UTF-8) standards

  1. Hindi is morphologically rich which means that lot of information is contained in each of the words as compared to English.

2. Hindi is a free order language which means it can have text arranged in any word order unlike English that follows Subject-Verb-Object order for the sentence to be Grammatically Valid.

Hence preprocessing steps like Lemmatization isn’t a good idea.

Refer my previous blog on Text preprocessing for more details https://saurabhk30.medium.com/text-preprocessing-pipeline-nlp-c44ec82c7875

Dataset Details:

Data has a total 2100 news headline extracted from news agency like NDTV, INDIATV using Webscraper. The text Data collected from this sources has very diverse and broad topic covered.

Text Preprocessing

Text preprocessing is an important data preparation step that is performed on a raw text dataset. It involves normalizing, stripping/ removing all unwanted and non-textual contents or mark-up tags that act as a noise for any Information Extraction or Machine Learning related task

  1. Stripping Markup Tag & URL:
HTML tag, unnecessary special characters, bad encoding/formatting, accented characters etc.

2. Removing Hindi Stopwords :

{‘उनकी’, ‘रखें’, ‘करता’, ‘को’, ‘इंहिं’, ‘किर’, ‘कौन’, ‘तिन्हें’, ‘अदि’, ‘यहाँ’, ‘वहाँ’, ‘जहाँ’, ‘करते’, ‘ऐसे’, ‘इन्हें’, ‘जिन्हें’, ‘हें’, ‘तब ‘जैसा’, ‘अपनि’, ‘साबुत’, ‘इसकी’, ‘बही’, ‘उसि’, ‘दुसरे’, ‘बहुत’, ‘मेरा’, ‘ऱ्वासा’, ‘वगेरह’, ‘वे’, ‘सो’, ‘कोन’, ‘करना’, ‘हुई’, ‘वहीं’हीं , ‘कोनसा’, ‘नहीं’हीं , ‘सबसे’, ‘जिंहों’हों , ‘संग’, ‘जीधर’, ‘थि’, ‘उन’, ‘इन्हों’न्हों , ‘दो’, ‘वाले’, ‘ने’, ‘इसमें’, ‘रहा’, ‘दवारा’, ‘दिया’, ‘मैं’, ‘कुल’, ‘उंहें’, ‘उसी’, ‘उंहों’हों , ‘दूसरे’, ‘य ‘सभि’, ‘किसी’, ‘हुइ’, ‘इसे’, ‘अत’, ‘अभी’, ‘जहां’, ‘अपनी’, ‘इसि’, ‘एवं’, ‘रहे’, ‘कोई’, ‘उसे’, ‘करने’, ‘यहां’, ‘कर’, ‘हुआ’, ‘कइ’, ‘की’, ‘बनी’, ‘हो , ‘किस’, ‘हे’, ‘कई’, ‘ना’, ‘भितर’, ‘इसी’, ‘जितना’, ‘जैसे’, ‘इतयादि’, ‘इनका’, ‘जिंहें’, ‘अपने’, ‘यही’, ‘किन्हें’, ‘ये’, ‘कितना’, ‘साभ’, ‘हुए’, ‘जिधर’, ,‘आदि’, ‘एस’, ‘का’, ‘फिर’, ‘तिसे’, ‘लिये’, ‘उसके’, ‘तिंहें’, ‘कुछ’, ‘लेकिन’, ‘उनका’, ‘दबारा’, ‘लिए’, ‘वग़ैरह’, ‘हि’, ‘जिस’, ‘जेसा’, ‘हैं’, ‘जेसे’, ‘ ‘किंहों’हों , ‘इन्हीं’न्हीं , ‘के’, ‘इत्यादि’, ‘किसि’, ‘वरग’, ‘एक’, ‘बिलकुल’, ‘ओर’, ‘से’, ‘पहले’, ‘पे’, ‘मे’, ‘या’, ‘वहां’, ‘साथ’, ‘तिन’, ‘होने’, ‘तिन्हों’न्हों , ‘पूरा’, ‘कहा’, ‘इंहें’, ‘तो’, ‘में’, ‘वुह’, ‘कि’, ‘कहते’, ‘भि’, ‘हो’, ‘पर’, ‘एसे’, ‘उंहिं’, ‘अगर’, ‘इंहों’हों , ‘जो’, ‘होना’, ‘उन्हें’, ‘मुझको’, ‘भीतर’, ‘किसे’, ‘होति’ न्हीं’न्हीं , ‘अपना’, ‘सारा’, ‘काफ़ी’, ‘थी’, ‘तिस’, ‘द्वारा’, ‘उनको’, ‘तक’, ‘जिसे’, ‘निहायत’, ‘बाला’, ‘हूँ’, ‘जिन्हों’न्हों , ‘बहि’, ‘होती’, ‘मगर’, ‘सकते’, ‘इन’, ‘ही ‘हुअ’, ‘किंहें’, ‘गया’, ‘बाद’, ‘अंदर’, ‘दुसरा’, ‘होते’, ‘कोइ’, ‘उनके’, ‘व’, ‘करें’}

3. Removing Punctuation

{'!','"','#','$','%','&',"'",'(',')','*','+',',','-','.','..','...','/',':',';','<','=','>','?','@','[','\\',']','^','_','`','``','{','|','}','~','..','...','``'}

Latent Dirichlet Allocation

It is most popular topic modeling technique the way it works is it constructs vocabulary and forms document- words matrix which is further decomposed/ factorized to :
1.document — topic &
2. topic — word matrix

LDA tries to optimize by iterating though words in document and tries to adjust the current topic “k” . A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2.

P1 — p(topic t / document d)

P2 — p(word w / topic t)

lda_model = LDA(corpus=doc_term_matrix, 
id2word=dictionary,
num_topics=3,
random_state=100,
chunksize=1000,
passes=15)

Gensim package provides nice wrapper to tweak different options like to change number of topic to create new cluster, iteration and others do play around with this parameter as per dataset.

Observation

The interactive widget displayed in the notebook like shown below allows to explore relevant terms associated to the topic. With my current dataset it indeed gives relevant terms associated to topics such as sports, political, defence related news categories. Incases when there are no labels we can quickly see how this technique helps to provide insights.

Find full code here.

References

models.ldamodel — Latent Dirichlet Allocation — gensim (radimrehurek.com)

ELI 5 link

--

--