Advanced Computer Vision - MobileCLIP

Case study on MobileCLIP, a lightweight CLIP model for mobile devices, focusing on architecture, training, and dataset creation.

Project Repo

Made in collaboration with Emanuele

This is not strictly a project, but rather a case study on how MobileCLIP was developed, focusing on understanding the architecture, training and dataset creation, and the overall process of building a lightweight CLIP model for mobile devices. Possible improvements were proposed.

Click to download presentation

Training

  • TinyCLIP distillation of MobileCLIP
    • Reduce the model parameter number, while keeping most of the original accuracy using TinyCLIP.
  • Improving synthetic captions
    • Regenerate captions if they are too similar in the reinforced dataset.
  • Sigmoid self-attention
    • Substitute softmax with sigmoid self-attention in the self-attention layers.

Inference

  • PuMer adaptation to MobileCLIP
    • PuMer adopts token pruning (TIP) and merging (MAM) in ViLT architecture to improve latency without compromising model performance.
    • MobileCLIP differs from ViLT from a structural perspective, but modality aware merging (MAM) still could lead to small latency improvements.
  • Token Ranking Pruning
    • Patch Ranking original purpose was to prune image patches in tranformer based CLIP models through a predictor to reduce the number of tokens processed through the image and text encoder.
    • Adapting this technique to the MobileCLIP text encoder could result in similar improvements to the paper implementation.
  • Dynamic Input Resolution
    • Adjust input resolution on a per-sample basis: low-resolution inference for simpler images, high-resolution for more complex ones.
Licensed under CC BY-NC-SA 4.0
Last updated on Jun 200, 20200 00:00 CET
Built with Hugo
Theme Stack designed by Jimmy