Note: the slides of my presentation can be found here: cats_dogs__alexandre_de_brebisson.
Tomorrow is the deadline and since I did not write for a long time, I thought I would write this post to summarize my work. It was an interesting project and many student ideas came out from it, a few might even be good research directions.
At the beginning of the project, I spent quite a bit of time tuning the parameters of vanilla convnets and I am glad it saved a bit of other student time. I started with small models and progressively scaled them.
I quickly reached the 80% required by the project with a model that has 4 hidden layers, takes 70×70 inputs and reaches 80% in less than 3 minutes on a Titan Black GPU. This small model later helped me to prototype new ideas very quickly.
My big model takes 5 hours to converge on a GTX 580. Thanks to CPU parallelization, data augmentation does not add any additional cost. It got 97.84% of accuracy on the testing dataset. I evaluated a former (a bit worse) model on the Kaggle testing dataset and I got ~97.5% which would place me around the 15th position. Note that for our coursework we had less training data than for the Kaggle competition and that we were not allowed to used additional dataset.
My main contribution is the multi-view architecture that really takes advantage of the data augmentation. It is a way to learn co-adaptation between views and make the resulting common representations more robust (a single view may miss an important features). Many questions arose:
– What data augmentation scheme, shall the views be generated independently?
– Where should the fusion occur?
– What is the best fusion strategy?
– Is memory an issue?
– Can we reduce the computational time?
The questions in bold are those that I started to answer in this project. I found that the gamma pooling (with learned gamma) seems to work slightly better than other fusion strategies. I also showed that pretraining the multi-view architecture with a single view works very well (does not reduce the error) and save some training time. Therefore, there does not seem to have any additional cost with using multi-view architectures during training. However, the testing time is inevitably longer.
I also did some side experiments that I did not report, such as as batch normalization (I used an already implemented lasagne version) or dependent data augmentations (views that are dependent).