Text Data collected from various sources like an online portal, social media is very diverse and messy. Text preprocessing is an important data preparation step that is performed on a raw text dataset. It involves normalizing, stripping/ removing all unwanted and non-textual contents or mark-up tags that act as a noise for any Information Extraction or Machine Learning related task.
Primarily Text Preprocessing step is done before the Text Vectorization task so as to remove unwanted noise and to reduce the vocabulary size. To know more about Vectorization read Traditional Text Vectorization Techniques in NLP.
Some of these text preprocessing steps is mentioned below :
1. Lowercasing
Lowercasing is a popular practice as it reduces the vocabulary size. All token/words of input text are lowercased. But in some of these cases may increase ambiguity by not capturing token/word meaning properly.
Example: Apple is a company & also apple is fruit. Similarly, US refers to United State & us is (plural used to refer 1 or more people).
2. Stop Word Removal
Stop words such as “the”,” a”,” is” are frequently occurring words in the English language which carry little meaning in a sentence/document. Removal of such words helps to reduce vocabulary size as large sparse representation is expensive for computation. In certain cases, even punctuation is removed.
Example: This is a sample sentence showing off the stop word filtration.
After Stop Word removal:[‘This’, ‘sample’, ‘sentence’, ‘showing’, ‘stop’, ‘words’, ‘filtration’, ‘.’]
3. Lemmatization
Lemmatization considers the meaning of the word in a sentence and reduces it to a root word that is present in the dictionary/vocabulary. The main idea behind lemmatization is to reduce sparsity but at the cost of ignoring emphasis of the term and it is used rarely in recent Deep Learning based approach.
Example: Good, better, best is lemmatized to the word good since they have the same meaning.
4. Acronym and shorthand Expansion
An acronym dictionary with English translations contains frequently used acronyms and shorthand. This dictionary is used to expand appropriate text to get its full form for better insight.
Example: lol is translated to laughing out loud.
they’ll get expanded to they will.
5. Handling Misspelled words
The text from social media contains misspelled words, often these words result in out of vocabulary(OOV) issue on pre-trained word embedding like GloVe & Word2Vec. TextBlob library has a module that can correct spelling which can help attend to this OOV issue.
6. Unwanted text removal
Most of the data scraped from the web may contain HTML tag, unnecessary special characters added between text and undesired accented character that may have popped up due to some bad encoding/formatting. The removal of text is an important step that allows to keep the most essential text data and remove unwanted noise.
Example: removal of all<HTML>tags, Á. à, [ etc..
Though recent deep learning models in NLP which are trained on HUGE DATASETS avoids this additional preprocessing step. The models like BERT and its family uses Byte Pair Encoding (BPE) a data compression technique that greedily merges sub-word tokens and leverages transformer based architecture to generate high quality dense contextual embeddings.
Python packages for Text Preprocessing:
TextBlob
NLTK
Spacy
References
Haddi, E., Liu, X., & Shi, Y. (2013). The role of text pre-processing in sentiment analysis. Procedia Computer Science, 17, 26–32. https://doi.org/10.1016/j.procs.2013.05.005
Camacho-Collados, J., & Pilehvar, M. T. (2018). On the role of text Preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. https://doi.org/10.18653/v1/w18-5406