5 Data Augmentation Techniques for Text Classification

4 min readNov 20, 2020

In the Supervised Machine Learning (ML) task, data has associated class labels that are used for training the ML model. Also ML model tends to perform better if the data has sufficient and representative samples for each class labels. In Natural Language Processing (NLP), data augmentation is much harder than numerical & computer vision task . The major challenge is the availability of annotated data; especially for low resource languages. Collecting and annotating additional data is an obvious choice but it is time-consuming & has its challenge. To overcome this we can use data augmentation techniques.

Data augmentation is a data oversampling technique used to increase the size of the data by adding new samples that have a similar distribution to the original data or marginally altering the original data. The data needs to be altered in a way that preserves the class label for better performance at the classification task.

In this post, I will primarily address data augmentation with regard to the Text Classification and Some of these Techniques are listed below.

1. Translation:

Suppose we want to build a text classifier on a specific domain in Hindi Language and there exists is no labeled classification dataset. Then we can use a Machine Translation service to translates already available labeled data from some other language like English (as it has lot of publicly available datasets for classification in the various domain). The translated text from the English dataset to target Hindi can serve as training data.
Though there is a risk overhead of adding potential bias from the translation system but manual rechecking through this translated text could help save time and make data creation process much faster.

2. Backtranslation:

Backtranslation uses Machine translation service that takes Source Language (Eg. English) and translates to Target Language(Eg. Spanish); the translated text in Target language (Spanish) is again translated back to the Source(Eg. English). This approach relies on biases from the machine translation system that generates the variation. The slight variation that is formed by back translating text can serve as additional training sample.

3. Synonym Word Replacement

Synonym word replacement is like identifying & replacing a word/token from the original sentences (that are not StopWords) by its appropriate synonym. The variation formed by replacing the synonym is considered as synthetic sample. This data augmentation technique can achieved by any of the two was mentioned below

3.1 Word Embedding based Replacement:
Pretrained word embedding like GloVe, Word2Vec, fastText can be used to find the nearest word vector from latent space as a replacement in the original sentence.
Also more recent Contextual Bidirectional embedding such as ELMo, BERT can be used for more reliability as its vector representation is much richer. As Bi-LSTM & Transformer based models encodes longer text sequences & are contextually aware about surrounding words.

3.2 Lexical based Replacement:
Wordnet a lexical database for English that has meaning of words, hyponyms, other semantic relations etc. Wordnet can be used to find synonym for the desired token/word from the original sentence that needs to replaced. NLTK, Spacy is NLP Packages can be used to find & replace synonyms from the original sentence.

4. Generative Models

Generative language models such as BERT, RoBERTa, BART or the recent T5 model can be used to generate the text in a more class label preserving manner. The generative model encodes the class label along with its associated text sequences to generate newer samples with some modifications. The approach is usually more reliable and the sample generated are more representative of the associated class label.

5. Random Insertion / Swapping / Deletion

5.1 Random Insertion: Identifying and extracting synonyms for some randomly chosen words that are not StopWords in the sentence. Inserting this identified synonym at some random position in the sentence.

5.2 Random Swapping: Randomly choosing two words in the sentence and swap their positions. This may not be a good idea in a morphologically rich language like Hindi, Marathi as it may entirely change the meaning of the sentence.

5.3 Random Deletion: Randomly removing word in the sentence.

This approach may not work for many languages as each of the languages has a different semantic, syntactic form. Also, the swaps/insertion/deletion is very sensitive to the length of the input sentence. The modification made to the sentences has to be explicitly rechecked to see if changes made preserve the class label or not.

Data augmentation helps to increase the number of data samples that are required to train ML model. Even in the Data Imbalance scenario, it can be used to generate additional samples for the minority class label. In some cases, these augmentations are used to develop a classifier that is resistant to adversarial attacks and during fine-tuning of ML model these samples act as a regularizer to handle overfitting.

Data Augmentation Python Library

nlpaug
TextAugment

References

[1] J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” 2020, doi: 10.18653/v1/d19–1670.

[2] V. Kumar, A. Choudhary, and E. Cho, “Data Augmentation using Pre-trained Transformer Models,” arXiv. 2020.