Deep Learning 4: Classifying movie reviews with Neural nets
We'll also cover how to build new model archetypes.
let’s go tackle some real world problems
Two class classification (binary) is one of the most common machine learning problems. In this post, you’ll get a chance to classify movie reviews as a positive or a negative, based entirely on the text content of the reviews
1 - The dataset
You’ll be working with the IMDBD dataset, a set of 50k highly controversial reviews. They’re split into 25k reviews for training, and 25k reviews for testing. each set has about 50% positive, and 50% negative reviews. luckily we don’t need to scrape this data, it’s already ready for us in the Keras library. on top of that, the words have already been mapped to a numeric value for us. basically like a dictionary. This will basically let us focus entirely on the model building, training, and evaluation side. We’ll deal with a model meant specifically for text analysis soon enough.
Anyways, let’s load up the data
from tensorflow.keras.datasets import imdb
(train_data,train_labels), (test_data, test_lables) = imdb.load_data(num_words=10000)
setting num_words = 10,000 means you’ll only keep the top 10,000 most frequently occuring words in the training data. Basically, rare words will be discarded. If we did not set this limit, we’d roughly end up working with about 90,000 unique words in the training data…. a bit too big, and also redundant.
The variables train_data, and test_data are lists of reviews. Each review is a list of word indices (encoded sequence of words). the train_labels & test_labels are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.
1.1 Decoding back to english (optional)
If you want to actually read one of these reviews, you’ll need to decode it back to english. Here’s how.
First, you’ll need to download the actual word_index:
word_index=imdb.get_word_index()
Once you’ve done that, you’ll need to reverse it, so that it maps integers to words:
reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])
And finally, apply it to your data set.
decoded_review = " ".join(
[reverse_word_index.get(i-3, "?") for i in train_data[0]])
Here’s the first review in english:
2 - Preparing your data
You can’t directly feed lists of integers to a neural network. They all have different lengths, but a neural network expects to process contiguous batches of data. So, basically we’ll have to turn our lists into tensors.
To handle this, we’ll multi-hot encode our lists to turn them into vectors of 0s and 1s. This would mean turning the sequence [8, 5].
Here’s the basic run down:
1- we’ll make a function called vectorize_sequences, it will take 2 inputs: sequences, and dimension.
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
2 - we’ll then make a zero matrix (a matrix where every value is 0), of the shape (len(sequences), dimension)
results = np.zeros((len(sequences), dimension))
3 - Set specific indices of the result to 1s
for i, sequence in enumerate(sequences):
for j in sequence:
results[i, j] = 1
return results
So, here’s the whole thing altogether:
Keep reading with a 7-day free trial
Subscribe to Data Science & Machine Learning 101 to keep reading this post and get 7 days of free access to the full post archives.