Multimodal AI (CLIP, BLIP)

Image + text একসাথে।

🎬 গল্প দিয়ে শুরু

Google-এ ছবি দিয়ে search? "একটি বিড়াল laptop-এর উপর বসে আছে" — text লিখে gallery-তে ছবি খোঁজা? পেছনে আছে CLIP — OpenAI-এর multimodal model যেটি ছবি ও text-কে একই embedding space-এ আনে।

Multimodal AI মানে কী?

একাধিক modality (image + text + audio + video) একসাথে বোঝা। আউটপুট হতে পারে: caption, answer, similarity score, edit instruction।

CLIP — Contrastive Language-Image Pre-training

৪০০M (image, caption) jodi web থেকে collect।
Image encoder (ViT) ও Text encoder (Transformer) আলাদা।
Matching pair-এর cosine similarity বাড়াও, mismatching-এ কমাও।
ফলাফল: একই vector space-এ ছবি ও text।

text

Batch of N (image, text):
  image emb: I₁..Iₙ   text emb: T₁..Tₙ
  Loss = softmax across rows + columns (diagonal = match)

Zero-shot classification — labelled data ছাড়াই

bash

pip install open_clip_torch torch pillow

python

clip_zeroshot.py

import open_clip, torch
from PIL import Image

model, _, prep = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k")
tok = open_clip.get_tokenizer("ViT-B-32")
model.eval()

img = prep(Image.open("photo.jpg")).unsqueeze(0)
labels = ["a cat", "a dog", "a person", "a car", "a tea cup"]
text = tok(labels)

with torch.no_grad():
    img_f  = model.encode_image(img); img_f  /= img_f.norm(dim=-1, keepdim=True)
    txt_f  = model.encode_text(text); txt_f  /= txt_f.norm(dim=-1, keepdim=True)
    probs  = (100.0 * img_f @ txt_f.T).softmax(-1)
print(dict(zip(labels, probs[0].tolist())))

শক্তি

নতুন class যোগ করতে retrain লাগে না — শুধু নতুন label-এর text এড়ুলেই হলো।

Semantic Image Search

python

# 1) সব ছবির embedding precompute → FAISS/Qdrant-এ store
# 2) user query text → text embedding
# 3) cosine similarity → top-K ছবি ফেরত

BLIP / BLIP-2 — Image Captioning ও VQA

python

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

p = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
m = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

img = Image.open("scene.jpg")
inputs = p(img, return_tensors="pt")
out = m.generate(**inputs, max_new_tokens=30)
print(p.decode(out[0], skip_special_tokens=True))

পরবর্তী প্রজন্ম — Vision-Language Models (VLM)

LLaVA, Qwen-VL, MiniGPT-4 — ছবি দেখে chat।
GPT-4V, Gemini, Claude — production multimodal LLM।
Florence-2, PaliGemma — detection + caption + grounding একসাথে।

ব্যবহারিক ক্ষেত্র

E-commerce — text/photo দিয়ে product খোঁজা।
Content moderation — zero-shot policy classifier।
Photo gallery auto-tagging ও natural-language search।
Accessibility — caption দিয়ে screen reader।

প্র্যাকটিস টাস্ক

CLIP দিয়ে নিজের photo collection-এ "ছাদ থেকে sunset" লিখে search করুন।
BLIP দিয়ে ১০০টি ছবির Bangla caption auto-generate করুন (translate API)।
LLaVA-7B local-এ চালিয়ে image-based Q&A bot বানান।