Challenges Handling Imbalance Text Data

Saurabhk
3 min readNov 16, 2020
Image by Author

Machine Learning (ML) model tends to perform better when it has sufficient data and a balanced class label.

Imbalanced text data means having uneven distribution of class labels in the dataset. The uneven distribution can occur in any ratio (1:10,1:100 etc.). Such a skewed distribution of class labels in the dataset results in poor classifying/predictive performance of the ML model. The poor performance of the ML model is due to the inability of the model to generalize well on minority class labels.

Infact, most of the real-world dataset has class labels that are unevenly distributed and often minority class is more important like Clickbait, Spam detection. Such an Imbalanced dataset poses Challenges while Building, Evaluating, Training ML model. Below is an example of some publicly available imbalanced text datasets.

Incase of spam — ham dataset: In Spam class label, The text contains use of formal conversation style, text length is a bit long and does not have usage of short forms like LOL, ROLF & even the emoji; Unlike Ham class text.
Incase of clickbait — Not-Clickbait dataset: In Clickbait class label, The text has a short, attractive, catchy line and often has usage of adjectives; Unlike for Not-Clickbait text.

So here training these Imbalance Spam, Clickbait dataset on standard ML model along with a certain level of feature engineering can yield superior performance. But such is not the case with other text classification tasks like subjective/objective classification or Emotion analysis or Twitter sentiment analysis. Also sometimes Feature Engineering is a lot harder to address.

Several techniques are used to handle class imbalance. I have listed some of the drawbacks related to these techniques below.

1. Oversampling Minority Class Label:

To handle data imbalance the obvious thing to do is to collect more data but the flip side is that the data annotation/labeling is labor-intensive and takes too much time. Also, there exists a risk of potential human biases that can crawl when labeling, which must be properly dealt with.

2. Undersampling Majority Class Label:

Removing data points from majority class labels may run the risk of losing most representative samples that most crucial in determining this majority class label. The decision of deleting the data samples from the majority class label depends on the size of the dataset and distribution of class labels.

3. Generating Synthetic Data:

The synthetically generated data samples should come from similar distribution as our original dataset otherwise you cannot expect sensible prediction from the model. Also, newly generated text data should be representative of the respective class label otherwise may risk adding attributional bias.

For Non-textual data usually SMOTE (Synthetic Minority Oversampling Technique) primarily uses nearest neighbors algorithmic to generate these artificial data points, though there are several such variations. Specifically for text, several methods can be taken up which is mentioned in my other post on Data Augmentation for Text.

Also, you shouldn’t strive for perfectly balanced data for training. Considering the nature of most of ML models and the real world dataset it is important to find proper balance adjusting these trade-offs and finding the right fit.

Reference:

Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence. https://doi.org/10.1007/s13748-016-0094-0

--

--

Saurabhk

Data Science Enthusiast. Love Applied Research.