How to slim down your Docker Image size for Machine Learning(ML) App?
Docker Image size can get really big after transferring all those packages, model weights and other resources of your ML Apps. There are few quick ways we can try to reduce the image size so it gets manageable when you are pushing it on docker registry or during deployment.
Here’s a quick guide to Data Scientists for dockerizing your python app
As a ML Engineer or Data Scientist, you often train various ML Models. In most cases, you’d want to share it across teams to try it out on their device or maybe just want to deploy it on managed cloud services for demo. In such cases, Docker helps Containerizing all the code, packages, dependencies together allowing to ship across various environment easily.
In python there are plenty of frameworks that quickly lets you build an app around the trained ML model; rather it being limited to the JUPYTER…
This blog post is in the reference to my submission for NLP HACK 2021 held by CIE IIIT Hyderabad. The theme of the Hackathon is to work on Indian Languages. With so many advances in AI/ML & NLP domain, there is very little work done on the Low-Resourced Languages of India. Konkani is one of such languages where AI/ML & NLP-based techniques are not adopted. Being a Goan, I decided to work on native language of Goa. I was able to train 3 word embedding & develop a Streamlit based application. Follow along for details.
Goa🏖️ is the smallest state…
The experiment uses 50K review from IMDB labelled data .
For purpose of Binary Text Classification Word2Vec, Glove, FasText embeddings and Neural Network based architecture like CNN & RNN(LSTM & Bi-LSTM) is used.
Now lets discuss about these Word Embedding, Neural Network architecture briefly and also look at some of the Experimental setup which are considered in my experiments.
Zero-Shot Learning in Text Classification is an effective way to predict the class label without any prior training data, It can be used for tasks such as sentiment analysis, document classification, emotion analysis. Zero-Shot approach uses a transfer learning approach to achieve this amazing feat.
Zero-Shot Classification model is pre-trained language model by default loads a
bart-large-mnliwhich serves as the knowledge base as it has been trained on a huge amount of text data and which is essentially fine-tuned on a Natural Language Inference(NLI) task to classify corresponding labels as
In the Supervised Machine Learning (ML) task, data has associated class labels that are used for training the ML model. Also ML model tends to perform better if the data has sufficient and representative samples for each class labels. In Natural Language Processing (NLP), data augmentation is much harder than numerical & computer vision task . The major challenge is the availability of annotated data; especially for low resource languages. Collecting and annotating additional data is an obvious choice but it is time-consuming & has its challenge. To overcome this we can use data augmentation techniques.
Data augmentation is a…
Machine Learning (ML) model tends to perform better when it has sufficient data and a balanced class label.
Imbalanced text data means having uneven distribution of class labels in the dataset. The uneven distribution can occur in any ratio (1:10,1:100 etc.). Such a skewed distribution of class labels in the dataset results in poor classifying/predictive performance of the ML model. The poor performance of the ML model is due to the inability of the model to generalize well on minority class labels.
Infact, most of the real-world dataset has class labels that are unevenly distributed and often minority class is…
Role of preprocessing on Information Extraction & Text Categorization task
Text Data collected from various sources like an online portal, social media is very diverse and messy. Text preprocessing is an important data preparation step that is performed on a raw text dataset. It involves normalizing, stripping/ removing all unwanted and non-textual contents or mark-up tags that act as a noise for any Information Extraction or Machine Learning related task.
Data Collection is one most important and crucial aspects of the Sentiment Analysis application. Due to the wide adoption of machine learning models, simply having large datasets on a domain specific task does not ensure superior performance. The performance of the model depends on the quality of dataset and labelling/annotation. As ML models learn from the data they are trained with, automatic predictions are likely to mirror the human disagreement identified during annotation. As a result, having a proper guideline to annotate data is also of utmost importance (Mohammad, S. 2016).
Scenario: Website that capture movie, car, restaurant or product…
Pre word embedding era, statistical based text vectorization techniques such as N-gram, BoW, TF-IDF, counting word co-occurrences, weighting matrix and other approaches were not able to properly model entire context around a word(Turney, P. D., & Pantel, P. 2010). It also suffered from text sparsity issue and not being able to handle the long sequential nature of text. All of these methods are explained in my previous post on Traditional Text Vectorization Techniques in NLP.
Data Science Enthusiast. Love Applied Research.