I wanted to work on an AI computer vision project and finally found something practical to work on. The MNIST data set is fun to get started but I wanted something more challenging.
This post outlines the first steps towards developing an algorithm that could be used as part of a mobile or web app. At the end of this project, the code will accept any user-supplied image as input. If a dog is detected in the image, it will provide an estimate of the dog’s breed. If a human is detected, it will provide an estimate of the dog breed that is most resembling. The image below displays potential sample output of your finished project (… but we expect that each individual’s algorithm will behave differently!).
The Road Ahead
This project is separated into separate steps. Currently this post has some but not all of these steps demonstrated. All the code is excised out but I will include that in a later, more technical post.
- Step 0: Import Datasets
- Step 1: Detect Humans
- Step 2: Detect Dogs
- Step 3: Create a CNN to Classify Dog Breeds (from Scratch)
- Step 4: Use a CNN to Classify Dog Breeds (using Transfer Learning)
- Step 5: Create a CNN to Classify Dog Breeds (using Transfer Learning)
- Step 6: Write your Algorithm
- Step 7: Test Your Algorithm
Step 0: Import Datasets
Import Dog Dataset
In the code cell below, we import a dataset of dog images. We populate a few variables through the use of the
load_files function from the scikit-learn library
- train_files, valid_files, test_files – numpy arrays containing file paths to images
- train_targets, valid_targets, test_targets – numpy arrays containing onehot-encoded classification labels
- dog_names – list of string-valued dog breed names for translating labels
1) There are 133 total dog categories.
2) There are 8351 total dog images.
3) There are 6680 training dog images.
4) There are 835 validation dog images.
5) There are 836 test dog images.
Import Human Dataset
In the code cell below, we import a dataset of human images, where the file paths are stored in the numpy array human_files.
Step 1: Detect Humans
I use OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images. OpenCV provides many pre-trained face detectors, stored as XML files on github. I have downloaded one of these detectors and stored it in the haarcascades directory.
In the next code cell, I demonstrate how to use this detector to find human faces in a sample image.
Before using any of the face detectors, it is standard procedure to convert the images to grayscale. The detectMultiScale function executes the classifier stored in face_cascade and takes the grayscale image as a parameter.
In the above code, faces is a numpy array of detected faces, where each row corresponds to a detected face. Each detected face is a 1D array with four entries that specifies the bounding box of the detected face. The first two entries in the array (extracted in the above code as x and y) specify the horizontal and vertical positions of the top left corner of the bounding box. The last two entries in the array (extracted here as w and h) specify the width and height of the box.
Write a Human Face Detector
We can use this procedure to write a function that returns True if a human face is detected in an image and False otherwise. This function, aptly named face_detector, takes a string-valued file path to an image as input.
(IMPLEMENTATION) Assess the Human Face Detector
Question 1: Use the code cell below to test the performance of the face_detector function.
What percentage of the first 100 images in human_files have a detected human face?
What percentage of the first 100 images in dog_files have a detected human face?
Ideally, I would like 100% of human images with a detected face and 0% of dog images with a detected face. You will see that the algorithm falls short of this goal, but still gives acceptable performance. We extract the file paths for the first 100 images from each of the datasets and store them in the numpy arrays human_files_short and dog_files_short.
- In human_files 97% have a detected human face.
- In dog_files 11% have a detected human face.
This algorithmic choice necessitates that we communicate to the user that we accept human images only when they provide a clear view of a face (otherwise, we risk having unneccessarily frustrated users!). In your opinion, is this a reasonable expectation to pose on the user? If not, can you think of a way to detect humans in images that does not necessitate an image with a clearly presented face?
Face detection is a reasonable request for many applications. Several applications come to mind e.g. self-driving cars, which will not have the luxury of face detection. I like the idea of searching for other features such as hands, shoes and even accesories, e.g. watch, hat and suit lapels.
Step 2: Detect Dogs
In this section, we use a pre-trained ResNet-50 model to detect dogs in images. Our first line of code downloads the ResNet-50 model, along with weights that have been trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks. ImageNet contains over 10 million URLs, each linking to an image containing an object from one of 1000 categories. Given an image, this pre-trained ResNet-50 model returns a prediction (derived from the available categories in ImageNet) for the object that is contained in the image.
Pre-process the Data
When using TensorFlow as backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input, with shape
where nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.
The path_to_tensor function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224×224
pixels. Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have shape
The paths_to_tensor function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape
Here, nb_samples is the number of samples, or number of images, in the supplied array of image paths. It is best to think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image) in your dataset!
Making Predictions with ResNet-50
Getting the 4D tensor ready for ResNet-50, and for any other pre-trained model in Keras, requires some additional processing. First, the RGB image is converted to BGR by reordering the channels. All pre-trained models have the additional normalization step that the mean pixel (expressed in RGB as [103.939,116.779,123.68]
and calculated from all pixels in all images in ImageNet) must be subtracted from every pixel in each image. This is implemented in the imported function preprocess_input. If you’re curious, you can check the code for preprocess_input here.
Now that we have a way to format our image for supplying to ResNet-50, we are now ready to use the model to extract the predictions. This is accomplished with the predict method, which returns an array whose ith entry is the model’s predicted probability that the image belongs to the i-th ImageNet category. This is implemented in the ResNet50_predict_labels function.
By taking the argmax of the predicted probability vector, we obtain an integer corresponding to the model’s predicted object class, which we can identify with an object category through the use of this dictionary.
Write a Dog Detector
While looking at the dictionary, you will notice that the categories corresponding to dogs appear in an uninterrupted sequence and correspond to dictionary keys 151-268, inclusive, to include all categories from ‘Chihuahua’ to ‘Mexican hairless’. Thus, in order to check to see if an image is predicted to contain a dog by the pre-trained ResNet-50 model, we need only check if the ResNet50_predict_labels function above returns a value between 151 and 268 (inclusive).
We use these ideas to complete the dog_detector function below, which returns True if a dog is detected in an image (and False if not).
(IMPLEMENTATION) Assess the Dog Detector
Question 3: Use the code cell below to test the performance of your dog_detector function.
- What percentage of the images in human_files_short have a detected dog?
- What percentage of the images in dog_files_short have a detected dog?
The optimal model results would accurately discriminate between humans and dogs. For the optimal detection case, 100% of ‘dog_files_short’ and 0% of ‘human_files_short’ would be reported.
- Five percent in human_files_short have a detected dog.
- One hundred percent in dog_files_short have a detected dog.
Step 3: Create a CNN to Classify Dog Breeds (from Scratch)
Now that we have functions for detecting humans and dogs in images, we need a way to predict breed from images. In this step, you will create a CNN that classifies dog breeds. You must create your CNN from scratch (so, transfer learning won’t be used yet!), and a test accuracy of at least 1% is required to proceed. In Step 5 of this post, I use transfer learning to create a CNN that attains greatly improved accuracy.
Be careful with adding too many trainable layers! More parameters means longer training, which means you are more likely to need a GPU to accelerate the training process. Thankfully, Keras provides a handy estimate of the time that each epoch is likely to take; you can extrapolate this estimate to figure out how long it will take for your algorithm to train.
We mention that the task of assigning breed to dogs from images is considered exceptionally challenging. To see why, consider that even a human would have great difficulty in distinguishing between a Brittany and a Welsh Springer Spaniel.
It is not difficult to find other dog breed pairs with minimal inter-class variation (for instance, Curly-Coated Retrievers and American Water Spaniels).
Likewise, recall that labradors come in yellow, chocolate, and black. Your vision-based algorithm will have to conquer this high intra-class variation to determine how to classify all of these different shades as the same breed.
We also mention that random chance presents an exceptionally low bar: setting aside the fact that the classes are slightly un-balanced, a random guess will provide a correct answer roughly 1 in 133 times, which corresponds to an accuracy of less than 1%.
Remember that the practice is far ahead of the theory in deep learning. Experiment with many different architectures, and trust your intuition. And, of course, have fun!
Pre-process the Data
We rescale the images by dividing every pixel in every image by 255.
(IMPLEMENTATION) Model Architecture
Create a CNN to classify dog breed. At the end of your code cell block, summarize the layers of your model by executing the line: model.summary() . I have imported some Python modules to get you started, but feel free to import as many modules as you need. If you end up getting stuck, here’s a hint that specifies a model that trains relatively fast on CPU and attains >1% test accuracy in 5 epochs:
(IMPLEMENTATION) Train the Model
Train your model in the code cell below. Use model check_pointing to save the model that attains the best validation loss.
Load the Model with the Best Validation Loss
Once satisfied the algorithm that accepts a file path to an image and first determines whether the image contains a human, dog, or neither. Then,
- if a dog is detected in the image, return the predicted breed.
- if a human is detected in the image, return the resembling dog breed.
- if neither is detected in the image, provide output that indicates an error.
Some sample output for our algorithm is provided below, but feel free to design your own user experience!