CLIP model and its Zero shot capabilities

CLIP model and its Zero shot capabilities


1. Introduction

CLIP which stands for Contrastive Language-Image Pre-training is a model made by OpenAI which can match the given image with a suitable text description of that image. CLIP is trained such that it takes input images and a textual descriptions of those images and tries to find which description matches with given images. In this article we will use the huggingface implementation of clip and find it’s zero shot capabilities and try to finetune clip by adding and training some last layers.

2. Understanding CLIP model

CLIP was developed as a way to identify weather an image matches it’s text description or not. CLIP was trained with N (image,text) pairs as input. There are two parts of CLIP one is the image encoder and the other is text encoder.

training

The output of image encoder (1 x N) and text encoder (1 x N) are multiplied with each other to produce (N x N) possible image text pairings. Hopefully the diagonal of the resultant matrix should have highest values, where each image corresponds to it’s text description. During the training the loss is then calculated on both horizontally and vertically axis.

axis

The loss is then added. In their paper they give some dummy code

  extract feature representations of each modality
  I_f = image_encoder(I) #[n, d_i]
  T_f = text_encoder(T) #[n, d_t]

  joint multimodal embedding [n, d_e]
  I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
  T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

  scaled pairwise cosine similarities [n, n]
  logits = np.dot(I_e, T_e.T) * np.exp(t)

  symmetric loss function
  labels = np.arange(n)
  loss_i = cross_entropy_loss(logits, labels, axis=0)
  loss_t = cross_entropy_loss(logits, labels, axis=1)
  loss = (loss_i + loss_t)/2

This teacher the model two things which images and text pair belong together and more importantly which pairs don’t belong together. The batch size should be very high to do this, in paper they use a very large minibatch size of 32,768.

To use clip in zero shot setting as an image recognition model we just need to give the image and all the text description of the classes with a little bit of prompt. For eg. an image of dog with the text saying [“this is an image of dog”,”this is an image of cat”]. Then get the maximum value out of the output array.

zero shot

3. Resources to learn more about CLIP

Here are some resources that I found that can be used to understand more about how clip works.

  1. OpenAI blog
  2. AI Coffee Break with Letitia
  3. OpenAI CLIP: ConnectingText and Images Paper Explained

4. Code using huggingface transformer library

Dataset

We are going to fine-tune our CLIP model on Stanford Cars dataset. It contains images cars with their names.
We are going to download and extract the dataset and look at how it’s structured. This dataset contains 196 different classes of cars and their images

dataset

!wget http://ai.stanford.edu/~jkrause/car196/cars_train.tgz
!wget https://ai.stanford.edu/~jkrause/cars/car_devkit.tgz
!tar -xvzf /content/cars_train.tgz
!tar -xvzf /content/car_devkit.tgz

The Stanford dataset has two folders one /cars_train where all the images of the cars are stored. The other one is “car_devkit”, where the images path are linked with their labels. It contains metadata about those images.

Lets look at the car_devkit folder.

Folder Structure

Here we can see that file ‘cars_train_annos.mat’ contains the annotations of the file such as bbox_x1,bbox_x2,bbox_y1,bbox_y2 for bounding box and also label,file name. We don’t need bounding box co-ordinates for this project.

The ‘cars_meta.mat’ contains the label name. The actual images are stored in ‘cars_trains’ folder but without any labels

Model

We will be using huggingfaces transformer library. Let’s install it quickly and run a trial of some random images.

!pip install transformers
from transformers import CLIPProcessor, CLIPModel

Import all libraries

from sklearn.model_selection import train_test_split
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
from torch import nn
import torch
import torch.optim as optim
import scipy.io
from torch.utils.data import Dataset,DataLoader
from sklearn.metrics import accuracy_score, f1_score
from tqdm.notebook import tqdm as tq

Zero-Shot CLIP

Here we are using pre-trained model of CLIP, it will take some time to download.

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

This processor is used to process the input data of CLIP. We don’t need to write any custom data manipulation function. CLIP takes two inputs one is the image and the another is text. This processor converts image to pixel values and the text to tokens and return a dictionary containing both and also attension mask.

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Let’s try giving some input

Lets give this images as input

cats

from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
#Here we need to give CLIP a texual descriptions of image
#CLIP will find the one which matches with the image the most out of the list
inputs = processor(
    text=["a photo of a cat", "a photo of a dog"], images=[image], return_tensors="pt", padding=True
)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
probs
#The first has highest probability and it is correct answer
tensor([[0.9949, 0.0051]], grad_fn=<SoftmaxBackward0>)

Input to CLIP model

The inputs variable is a dictionary.

inputs = processor(
    text=["a photo of a cat", "a photo of a dog"], images=[image,image], return_tensors="pt", padding=True
)

Since our clip model takes two inputs one is the image and another is the text for it’s images and text encoder.

The input dictionary contains keys such as

‘input_ids’, ‘attention_mask’, ‘pixel_values’

input_ids is the text that is converted to vector and the pixel_values contain the image pixel values. This three values are our input to model.

inputs.keys()
dict_keys(['input_ids', 'attention_mask', 'pixel_values'])

Lets make a dataset class and try to test CLIP on our Stanford cars dataset

One more observation can be made here. Since our text input is going to be same for all images. We just need to process this input only once. The images need to be processed every time.

class StanfordCars(Dataset):
  def __init__(self,metaPath,imgDir,labelMeta,model_name="openai/clip-vit-base-patch32",cuda=False):
    """
    mataPath: path to the annotation file

    imgDir: Where images are stored

    labelMeta: File where label data is stored

    model_name: Name of model we need to store. It is needed because we need to use the
    processor of the particular model to process inputs.

    cuda : To enable gpu acceleration    

    text: to store text like "This is image of {image} car"

    textInput: Input_ids of the text which needs to be passed to CLIP model

    """
    super(StanfordCars,self).__init__()
    self.metaPath = metaPath
    self.labelMeta = labelMeta
    self.path = imgDir
    train_data = scipy.io.loadmat(self.metaPath)
    class_data = scipy.io.loadmat(self.labelMeta)
    #class names
    self.classes = class_data['class_names'][0]
    # This is our data i.e filenames and their labels
    self.data = train_data['annotations'][0]
    # To process inputs
    self.processor = CLIPProcessor.from_pretrained(model_name)
    self.text = []
    self.textInput = None
    self.cuda = cuda

  def processLabels(self):
    """
    Only needs to process text once since every image will belong to at least one class in labels.
    We just process labels one time and then add these 'input_ids' to our images. We will append these later
    to our image pixel_values and pass the whole dict to CLIP model.
    """
    for i in self.classes:
      # Adding text prompt to help clip
      self.text.append(f'This is photo of {i[0]} car')
    #processing this text
    self.textInput = self.processor(text=self.text,return_tensors="pt", padding=True)

    if(self.cuda):
      for k in self.textInput.keys():
        self.textInput[k] = self.textInput[k].cuda()

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    #just to check of processLable method is run or not.
    assert self.textInput!=None,'run the processLabels method'

    bbox_x1,bbox_x2,bbox_y1,bbox_y2,label,fname = self.data[idx]

    label = label.item() - 1 # because labeling starts from 1 in metadata file
    pth = self.path+'/'+fname.item()
    img = Image.open(pth)
    img = img.convert('RGB')
    #using CLIP processor to apply image pre-processing
    img = self.processor(images=img,return_tensors="pt")
    img['pixel_values'] = img['pixel_values'].squeeze() # by default batch size is one

    if(self.cuda):
      img['pixel_values'] = img['pixel_values'].cuda()

    return (img,label)
dataset = StanfordCars(metaPath='/content/devkit/cars_train_annos.mat',imgDir='/content/cars_train',labelMeta='/content/devkit/cars_meta.mat',cuda=True)
dataset.processLabels()

Lets train,eval split the dataset

def train_eval_split(dataset,per,seed):
  """
  dataset: Full dataset object

  per: How much train test split

  seed: Random seed

  Splitting dataset.data which contains file name and labels into two parts.
  and then creating two different dataset for train and eval
  """
  train_data,test_data = train_test_split(dataset.data,test_size = per,random_state=seed)
  dataset.data = train_data
  evalDataset = StanfordCars(metaPath='/content/devkit/cars_train_annos.mat',imgDir='/content/cars_train',labelMeta='/content/devkit/cars_meta.mat',cuda=True)
  evalDataset.processLabels()
  evalDataset.data = test_data
  return (dataset,evalDataset)
trainData,evalData = train_eval_split(dataset,0.05,3)
len(trainData)
7349
len(evalData)
387
trainLoader = DataLoader(trainData,batch_size=64,shuffle=True)
evalLoader = DataLoader(evalData,batch_size=8,shuffle=True)

Check the zero-shot capacities of CLIP on eval Dataset.

This default CLIP model is not trained of any specific Stanford cars data. It is just given input for the first time without prior training.

predictions = []
truth = []
#we defined eariler
model.cuda()
model.eval()
for inputs,label in tq(evalLoader):
  #add the attention mask and input_ids to input image pixel values
  for k in evalData.textInput.keys():
    inputs[k] = evalData.textInput[k]
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image
  probs = logits_per_image.softmax(dim=1)
  preds =  torch.argmax(probs, dim=1)
  preds=preds.cpu()
  for i in preds:
    predictions.append(i.item())
  for j in label:
    truth.append(j.item())
acc = accuracy_score(truth,predictions)
print(acc)
0.6020671834625323
score = f1_score(truth,predictions,average='weighted')
print(score)
0.5738274215018401

It is a pretty good accuracy score for the zero-shot setting. The model was able to score 60.20 percentage and f1 score of 0.573 without any specific prior training. This shows how powerful training on huge data can be.

Fine Tune CLIP model

Let’s add some layers to the end of CLIP model. We will keep the model weights frozen just add some extra layers at the end and train those layers for only few epochs.

class FineTuneCLIP(nn.Module):
  def __init__(self,out_shape=196,model_name="openai/clip-vit-base-patch32",freeze=True):
    super(FineTuneCLIP,self).__init__()
    self.CLIP = CLIPModel.from_pretrained(model_name)
    # Freezing the CLIP model
    if(freeze):
      for parameter in self.CLIP.parameters():
        parameter.requires_grad=False
    # Adding extra last layers
    self.fc1 = nn.Sequential(
        nn.Linear(out_shape,out_shape*5),
        nn.BatchNorm1d(out_shape*5),
        nn.ReLU(),
        nn.Dropout(0.25)
    )

    self.fc2 =  nn.Sequential(
        nn.Linear(out_shape*5,out_shape*5),
        nn.BatchNorm1d(out_shape*5),
        nn.ReLU(),
        nn.Linear(out_shape*5,out_shape*5),
        nn.BatchNorm1d(out_shape*5),
        nn.ReLU(),
        nn.Dropout(0.3)
    )


    self.fc3 = nn.Sequential(
        nn.Linear(out_shape*5,out_shape),
        nn.BatchNorm1d(out_shape),
    )

  def forward(self,x):
    out = self.CLIP(**x)
    out = out.logits_per_image
    out = self.fc1(out)
    out = self.fc2(out)
    out = self.fc3(out)
    return out

Training NN

We are writing a pytorch training loop, it is a basic training loop just we have added tqdm for checking our progress with time.

def train(model,train_loader,eval_loader,epochs,criterion,optimizer):
  """
  This function trains our model.

  model: Our model we need to train

  train_loader: contains training data

  eval_loader: Contains validation data

  epochs: No. of epochs

  criterions: Loss function

  optimizer: Optimizer for learning

  """
  model = model.cuda()
  loss_list=[]
  accuracy_list=[]
  size = len(train_loader)
  eval_size = len(eval_loader)
  #val_steps = size//2
  for epoch in range(epochs):
    model.train()
    steps = 1
    #initilizing our tqdm progress bar for checking progress
    train_tq = tq(train_loader)
    for inputs,labels in train_tq:
      steps+=1
      """
      add text input info to dict,
      Here we are adding our 'input_ids' and 'attention_masks'
      which we have already calculated by calling processLabels() function in dataset
      to our 'pixel_values' i.e inputs which are from train_loader

      dataset.textInput = {
        'input_ids' : [tensor]
        'attention_mask': [tensor]
      }

      inputs = {
        'pixel_values' : [tensor] of shape (3,224,224)
      }

      we are adding the 'input_ids' and 'attention_masks'  values so the final input should be

      inputs = {
        'input_ids' : [tensor]
        'attention_mask': [tensor]
        'pixel_values' : [tensor] of shape (3,224,224)
      }

      This is the input to our CLIP model

      """
      for k in dataset.textInput.keys():
        inputs[k] = dataset.textInput[k]
      optimizer.zero_grad()
      outputs = model(inputs)
      #predictions
      preds =  torch.argmax(outputs, dim=1)
      #loss
      loss = criterion(outputs, labels.cuda())
      #accuracy
      acc = torch.sum(preds.cpu() == labels.cpu().data).item()
      acc = acc/len(preds)
      accuracy_list.append(acc)
      loss_list.append(loss.item())
      #backprop
      loss.backward()
      optimizer.step()
      #setting the values of our progress bar
      train_tq.set_description(f'TRAIN :: steps: {steps}/{size+1} accuray : {acc*100:.3f} loss: {loss.item():.4f} preds:{preds[0].item()} label:{labels[0].item()}')
    #calling evaluate method to check validation accuracy
    accuracy,val_loss_list = evaluate(model,eval_loader,criterion)


  return {
      "accuracy":accuracy,
      "train_loss":loss_list,
      "train_accuracy":accuracy_list,
      "val_loss":val_loss_list,
  }

Evaluate function to evaluate the eval dataset

def evaluate(model,eval_loader,criterion):
  #calculates validation accuracy
  eval_size = len(eval_loader)
  val_acc_list = []
  val_loss_list = []
  eval_tq = tq(eval_loader)
  esteps = 0
  model.eval()
  for inputs,labels in eval_tq:
    esteps+=1
    #add text info to dict
    for k in dataset.textInput.keys():
      inputs[k] = dataset.textInput[k]

    outputs = model(inputs)
    preds =  torch.argmax(outputs, dim=1)
    val_loss = criterion(outputs, labels.cuda())
    val_acc = torch.sum(preds.cpu() == labels.cpu().squeeze().data).item()
    val_acc = val_acc/len(preds)
    val_loss_list.append(val_loss.item())
    val_acc_list.append(val_acc)

    eval_tq.set_description(f'EVAL :=: steps: {esteps}/{eval_size} accuray : {val_acc*100:.3f} loss: {val_loss.item():.4f}')

  accuracy = sum(val_acc_list)/len(val_acc_list)
  return (accuracy,val_loss_list)
fineCLIP = FineTuneCLIP()

Since we frooze weight of CLIP model we only need to update parameter where required_grad property is true. Below code does that it checks which parameter has requires_grad = “True” and adds them to a list.

This list will be then given to optimizer for updation of parameters

feature_extract = True
print("Params to learn:")
if feature_extract:
    params_to_update = []
    for name,param in fineCLIP.named_parameters():
        if param.requires_grad == True:#
            params_to_update.append(param)
            print("\t",name)
else:
    for name,param in fineCLIP.named_parameters():
        if param.requires_grad == True:
            print("\t",name)
Params to learn:
	 fc1.0.weight
	 fc1.0.bias
	 fc1.1.weight
	 fc1.1.bias
	 fc2.0.weight
	 fc2.0.bias
	 fc2.1.weight
	 fc2.1.bias
	 fc2.3.weight
	 fc2.3.bias
	 fc2.4.weight
	 fc2.4.bias
	 fc3.0.weight
	 fc3.0.bias
	 fc3.1.weight
	 fc3.1.bias
optimizer = optim.Adam(params_to_update,lr=0.0002)
criterion=nn.CrossEntropyLoss()
kwargs = {"model":fineCLIP,
          "train_loader":trainLoader,
          "eval_loader":evalLoader,
          "epochs":6,
          "criterion":criterion,
          "optimizer":optimizer,
}

Training model

res=train(**kwargs)
Checking validation accuray of Finetune CLIP
predictions = []
truth = []
fineCLIP.eval()
for inputs,label in tq(evalLoader):
  #add the attention mask and input_ids to input image pixel values
  for k in dataset.textInput.keys():
    inputs[k] = dataset.textInput[k]
  outputs = fineCLIP(inputs)
  probs = outputs.softmax(dim=1)
  preds =  torch.argmax(probs, dim=1)
  preds=preds.cpu()
  for i in preds:
    predictions.append(i.item())
  for j in label:
    truth.append(j.item())
acc = accuracy_score(truth,predictions)
print(acc)
0.7441860465116279
score = f1_score(truth,predictions,average='weighted')
print(score)
0.7289987359754804

As you can see here our current accuracy is 74.41% with just few epochs, which is pretty good compared to zero-shot CLIP by only adding last layers. Remember we have keept the parameters of CLIP completly frozen. This can further be improved by doing some hyperparameter tuning.

Conclusion

CLIP is a very powerful model with great capability. This shows that transformers models with huge dataset can learn very effectively they are also good a Zero-Shot tasks.

More detailed results are given in CLIP’s original paper like how CLIP is way more robust that traditional CNN type models.

More