Final remarks and contributions

Note: the slides of my presentation can be found here: cats_dogs__alexandre_de_brebisson.

Tomorrow is the deadline and since I did not write for a long time, I thought I would write this post to summarize my work. It was an interesting project and many student ideas came out from it, a few might even be good research directions.

At the beginning of the project, I spent quite a bit of time tuning the parameters of vanilla convnets and I am glad it saved a bit of other student time. I started with small models and progressively scaled them.

I quickly reached the 80% required by the project with a model that has 4 hidden layers, takes 70×70 inputs and reaches 80% in less than 3 minutes on a Titan Black GPU. This small model later helped me to prototype new ideas very quickly.

My big model takes 5 hours to converge on a GTX 580. Thanks to CPU parallelization, data augmentation does not add any additional cost. It got 97.84% of accuracy on the testing dataset. I evaluated a former (a bit worse) model on the Kaggle testing dataset and I got ~97.5% which would place me around the 15th position. Note that for our coursework we had less training data than for the Kaggle competition and that we were not allowed to used additional dataset.

My main contribution is the multi-view architecture that really takes advantage of the data augmentation. It is a way to learn co-adaptation between views and make the resulting common representations more robust (a single view may miss an important features). Many questions arose:

What data augmentation scheme, shall the views be generated independently?
– Where should the fusion occur?
What is the best fusion strategy?
Is memory an issue?
– Can we reduce the computational time?

The questions in bold are those that I started to answer in this project. I found that the gamma pooling (with learned gamma) seems to work slightly better than other fusion strategies. I also showed that pretraining the multi-view architecture with a single view works very well (does not reduce the error) and save some training time. Therefore, there does not seem to have any additional cost with using multi-view architectures during training. However, the testing time is inevitably longer.

I also did some side experiments that I did not report, such as as batch normalization (I used an already implemented lasagne version) or dependent data augmentations (views that are dependent).


Optimal fusion strategy in multi-view architectures.

In my previous experiments, I already experimented two types of fusion strategies to merge branches in my multi-view architecture: mean and max-pooling. However there are several other possibilities, which are summarised in the following figure:

Screen Shot 2015-04-19 at 7.58.45 PM

The “gamma-pooling” (suggested by Professor Pascal Vincent) is a pooling function parametrized by a parameter gamma:

y = a . softmax(gamma a)

The possible functions range from the mean-pooling (gamma=0) to the max-pooling (gamma=+inf). I obtained my best results by using the gamma pooling fusion strategy and by learning gamma. The final value of gamma is very sensitive to the initial value of gamma. Further experiments should consider a different learning rate for gamma (in the experiments below, the learning rate is global).

Here are the results of my experiments:

One branch:
not averaged: 0.0896
averaged over 20 at test time: 0.0664

Five branches:

  • Mean (gamma=0):
    from scratch: 0.0704
    pre-trained: 0.0712
  • Max (gamma=+inf):
    from scratch: 0.0704
    pre-trained: 0.0724
  • Gamma learnt:
    from scratch (gamma init:2, gamma final: 2.3): 0.0676
    pre-trained (gamma init:1, gamma final: 0.22): 0.0664
    pre-trained (gamma init:0, gamma final: 0.17): 0.06
  • aggregation + 5 views:
    from scratch: 0.0744

We can see that the gamma strategy (with gamma being learnt) yields the best results. I tried different initial values of gamma and it converges towards different values of gamma.

comp_train_err comp_valid_err

There is again a very small improvement for the pre-trained architecture (not on the training dataset though), which may simply be due to different initial conditions though.

Pretrain the multi-view architecture

It seems that the main disadvantage of the multi-view architecture is that it takes longer to train. This might be the only reason not to use it for a computer vision problem. This post explores a possible way to reduce the computations.

First of all, it is important to note that the training time does not scale linearly in the number of views (equivalently of branches) because of the following reasons:
– the branches represent only the first part of the whole architecture, the second part is a single pathway as before.
– for the first part with multiple branches, there is only a single update of the parameters per minibatch as the branches share the same weights.
– an update with the multi-view architecture contains more “information” than an update with a single view as it analyses several augmented views of the same image. Hence, less epochs should be necessary to converge.

The last point, although intuitive, is not true in practice, as the following figure shows:

We can see that it takes around 130 epochs to converge for both architectures.

Another idea is to pre-train the multi-view architecture with the single view architecture. The first step is to train the single view architecture as usual and then to use the resulting parameters to initialise the multi-view architecture. This turns out work well, surprisingly well. Let’s first have a look to the training curves. The following figure shows the evolution of the validation error.


As you can see, only combining five blue models without fine-tuning gives around 8.5% of error (the red curve at t=0), which is a big improvement compared to the 9.5% of the single branch (blue curve at t=200). Furthermore, fine-tuning the model improves even more the results (red curve reaches 6.5%) and, surprisingly, the fine-tuned pre-trained architecture is even better than the 5-view architecture for which there was not any pre-training. Before analysing these results, let’s have a look at the evolution of the training errors.


As you can see, pre-training the model also helps the optimization significantly.

Let’s try to analyse why a pre-trained multi-view architecture performs better. A possible reason that comes to mind is that there is less co-adaptation between branches. But then, why the fine-tuned pre-trained architecture performs better than the pre-trained not-fine-tuned architecture? This is the interesting point. It seems that co-adaptation is helpful. Some features co-adaptation might be helpful and others not. Ok, why?

1. co-adaptation can be good. I already gave some arguments here, in my original post. Say you have three branches. Individually two of them detect a dogs while the last one detects a cat. If you don’t fine-tune, you have high chances to predict a dog with the multi-view architecture (not always true, simple averaging the branches without fine-tuning is good alone as pointed out before). But if you fine-tune, you may learn that the picture contains two features that indicates, at 100%, that it is a cat. More co-adaptation will likely improve the training error (and the generalisation error as a side effect).

2. too much co-adaptation is not good. This is a purpose of dropout: reducing the co-adaptation between neurons. Here we reduce it between branches.

To recap, here are the benefits of pre-training the multi-view architectures:
– it is faster
– it helps the optimization
– it helps the generalization

Recap, plans, speedup, batch size


A quick note about the architectures of fellow studentst. Some of them have reached similar architectures and hyperparameters as mine after their hyper-parameter search. So that’s good and bad. Good because it means that I did my hyper-parameter search quite well. Bad because I would have preferred to see better architectures than mine! Andrew has very interesting results though, he obtained 94% of accuracy with only 10 neurons in the fully connected neurons and an decreasing number of feature maps as we go deeper! Such a strange architecture is quite uncommon and all the more interesting that it seems to work very well. This is in contradiction with certain of my experiments though as I noted that adding capacity to the fully connected layers helps.

To recap a bit my experiments so far:
– my hyper-parameter search led me to a 6-convolution architecture with 3×3 filters, two fully connected layers on top, 260×260 inputs, 0.1 as learning rate, 0.9 as momentum. My “raw” network gets around 94.5% of accuracy in less than 100 seconds.
– dropout helps on the fully connected layers.
– data augmentation, in particular cropping, flipping and rotating, helps.
– a weight decay scheme (with best parameters being reloaded) improves slightly the results.
– data augmentation helps even more with the multi-view architecture that I introduced (Sander Dieleman thought of a similar architecture for a Kaggle competition, which confirms that the idea is good!).
– I have a super-fast prototyping architecture that I uses to test new ideas. It reaches 80% of accuracy in less than 5 minutes.


Therefore I will naturally resume my work on my multi-view architectures. I have many ideas, I have discussed a few of them already with Pascal and Yoshua. I will resume my experiments. As usual, I will run them with my small prototyping architecture.


In my experiments, I noticed that the merging of the branches of my multi-view architectecture was quite slow. So I changed it to gain a x2 speedup with 5 views. The former “merging” code is:

def pool_over_branches(self, layer, pool_function=T.max):
   ls_selectors = []

    step = 1 / float(self.n_chunks)
    for i in range(self.n_chunks):
        a = i / float(self.n_chunks)
        select = Selector(
            numpy.array([a, a+step])

    return lasagne.layers.ElemwiseSumLayer(ls_selectors, coeffs=step)

I had created a new Lasagne layer, called Selector. Now I am using the Lasagne layers ReshapeLayer and DimshuffleLayer. It turns out to be much faster (trickier to implement though):

def pool_over_branches(self, layer, pool_function=T.max):

    reshaped_layer = lasagne.layers.ReshapeLayer(layer, (self.n_chunks, self.batch_size, -1))
    reshaped_layer = lasagne.layers.DimshuffleLayer(reshaped_layer, (1, 0, 2))
    reshaped_layer = lasagne.layers.ReshapeLayer(reshaped_layer, (self.batch_size, -1))

    return lasagne.layers.FeaturePoolLayer(reshaped_layer, self.n_chunks, 1, pool_function=pool_function)

Batch size

I forgot to say that I did experiments to evaluate the size of the batch, I don’t think anyone did it yet.

bs_time bs_errThere are at least three interesting observations:
– there is a big speedup for batch sizes of 60/70. But would it be different with another architecture? probably yes. I don’t really know why 60/70 are best here, I will probably ask to the Theano devs. I would have expected that bigger batch sizes would result in faster epochs (there are less updates), it is quite strange.
– the erreor rate does not seem to be too dependant on the batch size (more experiments should be done).
– the error rate plot shows the variability we get with different initial weights, more than 1% sometimes!

I forgot to monitor the total time of convergence unfortunately.

97.64% accuracy

I have trained my big model (actually it is slightly different, still 9 layers but with more capacity) with all the improvements I have developed so far (and described in my earlier posts). The trained model has the following accuracies on the different datasets:
train: 0.00005 (1 mistake)
valid: 0.0292
test: 0.0236 (note: in my experiments the test scores are usually slightly better than the validation scores. Test images are likely a bit easier to classify.)

My precise model is detailed in this Lasagne file It uses 3×3 convolution filters like VGGnet and has one pooling layer less than my previous model (thus more capacity). It uses the multi-view architecture that I detailed here with 5 branches.
EDIT 17 March: Sander Dieleman used a somehow similar architecture to win the Kaggle Plankton challenge.

Last year, Pierre Sermanet (now at Google Brain) won the competition with an accuracy of 98.5%. He used external data by doing transfer learning from his Overfeat network trained on ImageNet. I actually went to a meetup in Paris last year, in which he spoke about Overfeat and the Kaggle competition.

The 97.64% were obtained without external data. Unfortunately there is no leaderboard for solutions without training data, but according to this Kaggle thread, Charlie Tang (Phd student of Hinton and former best Kaggler) obtained 96,9% without external data.
EDIT 21 March: Apparently Maxim Milakov obtained 98.137% without external data and was ranked 5th (thanks to Dong-Hyun Lee for pointing this out).

Note that the 97.64% were obtained in unfavourable settings compared to those of the Kaggle competition. In our project, the training dataset is composed of 20 000 images versus 25 000 images for the Kaggle competition. These 5000 point difference would likely improve the results (actually it is maximum 5000 depending on how much validation data you decide to choose).

Furthermore, using bagging  (like many Kagglers do) would certainly improve the results even more.

Closer look to the data

Until now, I have not spent much time analysing the data. Actually, only two of my design choices are related to the dataset: the size of the input images (260×260) and the number of outputs (two). Otherwise it could have been a competition about classifying carrots and potatoes, it would not have changed much.

Size of the input images

I chose 260×260 input images somehow arbitrary by quickly looking at the distributions of heights and widths in the dataset. But if we have a closer look, we can see that images are usually bigger:
distribution_heights distribution_widths

Taking bigger images as inputs would likely improve the results (but increase the computation time). Note that many of my experiments were done with images of size 60×60 to quickly evaluate new ideas.

We also observe that widths and heights vary a lot. Therefore, a solution with a variable size input may be beneficial. It will probably be more difficult to implement though (we would have to pool on variable size inputs).

Misclassified images

Misclassified images would almost always be well classified by a human, so there is room for improvement!
At the end of the post are typical misclassified images of my validation set. You will see that they are probably not the cleanest images of the dataset. By introducing transformations to produce similar dirty images during the training, we would likely improve the model. Such transformations could be: zoom, noise, blurring, bigger rotations. Elongated images would be solve as explained in the previous sections.

small head, dark head
very elongated image
many small cats
Schrodinger’s cat
Very blurred image
Tricky posture
Really too far from the training data distribution (composed of natural images!).
hide the head with your finger and try to guess if it is a dog or a cat…

Engineering speed up

This post is not about how to accelerate the convergence of the training (reduce the number of epochs) but about reducing the overhead in the training dataflow. Discussions with César have been particularly profitable.


First of all, the dataset is too big to be loaded entirely on the GPU, so this is not an option.

Until now, I was reading the hdf5 file, loading the next batch on the RAM, processing it with data augmentation and then loading it on the GPU. However, reading the hdf5 file and loading the data on the RAM at each minibatch takes a lot of time. Loading the entire dataset in RAM yields a big speed up but is only possible if you have more than 12GB (has to store the model as well).


Doing intensive data augmentation can significantly slow down the experiments as it quickly becomes the bottleneck. Instead of using a single thread on a single CPU core, it seems natural to parallelise the data augmentations on several cores of the CPU. The best option I have found so far uses the python multiprocessing library. One core does the IO while the others preprocess batches independently and store the resulting batches in a queue that is watched by the IO process.

I encountered several difficulties:
– if you have several processes, they can not read the same hdf5 file. A solution is to create as many copies of the hdf5 as processes. The advantage of this method is that the RAM is not heavily used. The disadvantage is that you read many hdf5 files in parallel, which has a lot of overhead.
– The processes have separate memory spaces. Therefore if you have the whole dataset in memory, the processes will create their own copies if you are not careful. Python multiprocessing allows to have numpy arrays in shared memory for all the processes but our dataset has variable size images and thus, they can not fit in a big numpy array (in a list instead). The other solution is to give to each process only the part of the dataset they will process. This is what I do now, you just have to be careful with the indices especially if you permute the data every time.

Unfortunately, with Theano, we can not load asynchronously batches on the GPU while the GPU is running.

Speedup in practice

Loading the entire dataset in RAM and parallelizing the creation of the batches yield a significant speedup especially when there is a lot of data augmentation (in particular rotations, which are quite expensive). For some experiments, I have a speedup of nearly x4.