Text Mining and Sentiment Analysis with Keras

When it comes to text mining, sentiment analysis – or gauging sentiment of a particular chunk of text based on its words – is becoming increasingly popular within this area.

Here is how we can conduct sentiment analysis using Keras.

Text Mining: IMDB Movie reviews sentiment

In this instance, we are going to use the IMDB Movie reviews sentiment classification available from the Keras library.

Essentially, we have training and test data that has been classified as either positive (pos) or negative (neg).

For instance, here’s an example of text that was classified as positive:

“This was the Modesty that we didn’t know! It was hinted at and summarized in the comic strip for the syndicates to sell to newspapers! Lee and Janet Batchler were true Modesty Blaise fans who were given The Dream Job – tell a prequel story of Modesty that the fans never saw before. In their audio-commentary, they admitted that that they made changes in her origin to make the story run smoother. The “purists” should also note that we really don’t know if everything she told Miklos was true because she was “stalling for time.” I didn’t rent or borrow the DVD like other “reviewers” did, I bought it! And I don’t want a refund! I watched it three times and I didn’t sleep through it! Great dialog and well-drawn characters that I cared about (even bad guy Miklos) just like in the novels and comic strips! I too can’t wait for the next Modesty (and Willie) film,especially if this “prequel” is a sign of what’s to come!”

Now, here’s an example of one classified as negative:

“I saw this film at its New York’s High Falls Film Festival screening as well and I must say that I found it a complete and awful bore. Although it was funny in some places, the only real laughs was that there appeared to be o real plot to talk about and the acting in some places was dreadful and wooden, especially the “Lovely Lady” and the voice of the narrator (whom I have never heard of) had a lot to be desired. J.C.Mac was, I felt, the redeeming feature of this film, true action and grit and (out of the cast) the only real acting. I am sure with another cast and a tighter reign on the directing, this could have been a half decent film. Let us just hope that it is not sent out on general release, or if you really want a copy, look in the bargain bin in Lidl.”

Essentially, we want Keras to take the positive and negative examples, and train the model so that when new text is fed to the model, Keras accurately identifies the correct sentiment for that text.

Model Generation

Firstly, the relevant libraries are loaded:

import numpy 
from keras.datasets import imdb 
from keras.models import Sequential 
from keras.layers import Dense 
from keras.layers import LSTM 
from keras.layers.embeddings import Embedding 
from keras.preprocessing import sequence
import keras

Out of all the words in the IMDB Movie reviews sentiment classification, we are only going to consider the 2,000 most frequently occurring words for the purpose of this model.

numpy.random.seed(7)
top_words = 2000 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
NUM_WORDS=2000
INDEX_FROM=3
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id[""] = 0
word_to_id[""] = 1
word_to_id[""] = 2
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in X_train[0] ))

When we print the output, we can see that we have collated numerous chunks of text:

 this film was just brilliant casting location scenery story direction  really  the part they played and you could just imagine being there robert  is an amazing actor and now the same being director  father came from the same  island as myself so i loved the fact there was a real  with this film the witty  throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for  and would recommend it to everyone to watch and the   was amazing really  at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also  to the two little  that played the  of  and paul they were just brilliant children are often left out of the  list i think because the stars that play them all  up are such a big  for the whole film but these children are amazing and should be  for what they have done don't you think the whole story was so lovely because it was true and was  life after all that was  with us all

Now, all the input sequences need to be truncated to ensure that they are the same length for feeding into the neural network:

max_review_length = 500 
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

We now see that we have two separate arrays to represent the separate chunks of text, which are in suitable format for feeding through to the neural network:

>>> X_train
array([[  0,   0,   0, ...,  19, 178,  32],
       [  0,   0,   0, ...,  16, 145,  95],
       [  0,   0,   0, ...,   7, 129, 113],
       ...,
       [  0,   0,   0, ...,   4,   2,   2],
       [  0,   0,   0, ...,  12,   9,  23],
       [  0,   0,   0, ..., 204, 131,   9]], dtype=int32)
>>> X_test
array([[  0,   0,   0, ...,  14,   6, 717],
       [  0,   0,   0, ..., 125,   4,   2],
       [ 33,   6,  58, ...,   9,  57, 975],
       ...,
       [  0,   0,   0, ...,  21, 846,   2],
       [  0,   0,   0, ...,   2,   7, 470],
       [  0,   0,   0, ...,  34,   2,   2]], dtype=int32)

Now, the neural network model is created and configured:

embedding_vector_length = 32 
model = Sequential() 
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length)) 
model.add(LSTM(100)) 
#model.add(Flatten()) 
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
print(model.summary())

Long-Short Term Memory Network

In this particular instance, we are using what is called a long-short term memory (LSTM) network, a type of recurrent neural network.

As well as time series analysis, these types of neural networks are quite popular for language modelling, as it is specifically designed to make use of sequential information. In other words, a standard neural network would treat each observation as independent to the other.

However, this is clearly not the case here as the observations (or words) are clearly linked to others in terms of their context.

Here is the output that is generated when we print the model.summary():

>>> print(model.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 500, 32)           64000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
=================================================================
Total params: 117,301
Trainable params: 117,301
Non-trainable params: 0

Now, we train the model across 3 epochs, and generate loss and accuracy readings:

history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

import matplotlib.pyplot as plt
print(history.history.keys())
# Loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['loss', 'val_loss'], loc='upper left')
plt.show()
# Accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['acc', 'val_acc'], loc='upper left')
plt.show()

Here are the graphs illustrating the loss and accuracy for our training and test (validation) data:

text model loss

text model accuracy

Now, we can gauge the accuracy of the model on the test data:

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Upon running this, we come up with an 85.26% accuracy rate.

Conclusion

In this example, we have seen how to:

  • Construct a recurrent (LSTM) neural network
  • Train the network appropriately
  • Use such a network for sentiment analysis

Many thanks for reading, and please feel free to leave any questions or comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *

eight + eight =