https://blog.oxen.ai/practical-ml-dive-how-to-customize-a-vi...
~ TLDR ~ ViT works the best in this small experiment, with minimal code. The experiment was classifying 7 different facial emotions such as "happy", "sad", "angry", etc...
Model Accuracy
* ViT - 69% * ResNet50 64% * Zero-Shot CLIP - 53%
Was honestly most impressed with CLIP's ability for zero-shot transfer, even though it had the worst accuracy. The ability to give it a freeform list of prompts or labels and it will automatically classify into the subset without training feels like the future of prototyping products and models, then once you define your use case go with something more performant like a ViT.
Anyways, I had fun writing the code and running the experiments, so thought I would share!