A dataset of 200,382 human object recognition judgments as a function of viewing time for 4,771 images from ImageNet and ObjectNet. The human judgments allow us to calculate an image-level metric for recognition difficulty based on minimum exposure in milliseconds that humans needed to reliably classify an image correctly. This allows us to explore dataset difficulty distributions and model performance as a function of recognition difficulty. These results indicate that object recognition datasets, are skewed toward easy examples and are the first steps toward developing tools for shaping datasets as they are being gathered to focus them on filling out the missing class of hard examples. Read more in our upcoming publication!
Our difficulty metric indicates that object recognition datasets are skewed toward easy images. Even ObjectNet which was deliberately collected to control for biases found in other datasets has a remarkably similar difficulty distribution as ImageNet. Without consciously collecting difficulty images, datasets will continue to misrepresent the recognition abilities demonstrated by humans in the real world.
Ensuring representation of difficult images is important as models not only perform significantly worse on hard images, but our results show that performance drops as a result of distribution shift are exacerbated by image difficulty. Surprisingly, increasing the capacity of models in most model families makes no progress toward reversing this trend---although OpenAI's CLIP model demonstrates high robustness. There are likely other phenomena in addition to distribution shift that demonstrate a similar dependence on image difficulty. If benchmarks continue to undersample hard images, misleading model performance will mask these issues. See our paper for more details.
This work was supported by the Center for Brains, Minds, and Machines, NSF STC award CCF1231216, the NSF award 2124052, the MIT CSAIL Machine Learning Applications Initiative, the MIT-IBM Watson AI Lab, the DARPA Artificial Social Intelligence for Successful Teams (ASIST) program, the DARPA Knowledge Management at Scale and Speed (KMASS) program, the United States Air Force Research Laboratory and the Department of the Air Force Artificial Intelligence Accelerator under Cooperative Agreement Number FA8750-19-2-1000, the Air Force Office of Scientific Research (AFOSR) under award number FA9550-21-1-0014, and the Office of Naval Research under award number N00014-20-1-2589 and award number N00014- 20-1-2643. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of the Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.