For purpose of Binary Text Classification Word2Vec, Glove, FasText embeddings and Neural Network based architecture like CNN & RNN(LSTM & Bi-LSTM) is used.
Now lets discuss about these Word Embedding, Neural Network architecture briefly and also look at some of the Experimental setup which are considered in my experiments.
A word embedding is a learned representation for text where words that have the same meaning have a similar representation. In the experiment pre- trained Word2Vec, Glove, FasText is used.
Checkout my article for details :Word Embedding: New Age Text Vectorization in NLP
Neural Network architecture
In the experiment I use single channel CNN, Multi channel CNN, LSTM & Bi-LSTM Neural Network Architecture .
CNN (Single & Multi Channel)
Convolutional neural networks(CNN) are effective at text classification, because they are able to pick out salient features (e.g. tokens or sequences of tokens) in a way that is invariant to their position within the input sequences.
This diagram shows the setup for CNN using 6 Filters, 2 of size (2, 5) (3, 5) and (4, 5) are applied on the embedding matrix. This rectangular filter in a way tries to capture the salient n-gram feature captured from the corresponding token embedding. The Strides are taken downwards. Further, convolution is subsampled using a maxpooling operation which are concatenated in a final vector that is passed to a SoftMax activation function for prediction
RNN (LSTM, Bi-LSTM)
In recurrent neural networks (RNN), predictions are made sequentially, and the hidden layer from one prediction is fed to the hidden layer of the next prediction. This gives the network ”memory”, in the sense that the results from previous predictions can inform future predictions
Internal LSTM unit has additional 4 gates that helps to gain a better control over the memory by allowing it have control over current input, previous learning and generating future outputs. Bidirectional LSTMs train two instead of one LSTMs on the input embedding sequence. The intermediate hidden vector state at each timestamp which are concatenated before forwarding to another Bi-LSTM layer allows it to have a better understanding of the sequence context. The output vectors of the Final layer are passed through a Fully connected layer and SoftMax activation is applied to get the prediction.
To be able to track and compare across several experiments reproducibility is handled. The hyperparameters are specific to experiments and varies based on choice of Neural Network Architecture or embeddings.
In the below section the general details about the experiment conducted are mentioned.
- Neural networks are stochastic by nature and to ensure code reproducibility a fix seed values is set for python libraries like NumPy, keras, TensorFlow .
- Dataset split into train, test, validation set is done with stratified set to true and a fixed seed value.
- The Text Preprocessing is separately conducted and the same pickle(.pkl) file is used as an input.
General Hyperparameter configurations are considered though the exact value in each of the experiments may vary.
- The Embedding VOCAB SIZE is set to 20,000 tokens across all the embedding(Word2Vec, Glove, FasText) which is included for training.
- The sequence length for review length is set to 294 token
(i.e [mean review length + (2 * standard deviation review length)]. The longer sequences are truncated.
- Dropout, Recurrent dropout is set to avoid overfitting and improve generalization capability CNN, RNN models.
- Batch Size is set to 64 and Epoch from 4 onwards.
- Optimizer is Adam and loss function used is binary-crossentropy.
- The model is optimizes for Accuracy metric as dataset is balanced.
- For Early stopping criteria the patience value is set to 3 so that training is stopped once model performance stop improving on hold out validation set.
All the additional details about padding, truncating & others parameter specific to Neural Network Architecture & Embedding is documented in the respective .ipynb files(refer above mentioned git repo).
Bi-LSTM with pretrained 100 dimensional Glove embedding performs the best in my experiment. Also setting embeddings layers parameter trainable to true is seen to have significant impact on classification.
Understanding LSTM Networks
Posted on August 27, 2015 Humans don't start their thinking from scratch every second. As you read this essay, you…