OCR Scanner System

Document digitization pipeline।

🎬 গল্প দিয়ে শুরু

Photo তোলো — searchable PDF পাও। বাংলা/ইংরেজি দলিল, প্রেসক্রিপশন, রসিদ — সবকিছু digitize। এই প্রজেক্টে CamScanner-grade একটি OCR scanner বানাবো।

Pipeline

text

Image / PDF
   ► Page detect (4-corner) ► Perspective unwarp
   ► Deskew + denoise + adaptive threshold
   ► Layout analysis (text vs table vs figure)
   ► OCR (PaddleOCR multilingual: Bn/En)
   ► Post-process (spell-correct, line merge)
   ► Searchable PDF / JSON / DOCX export

১. Document edge detection ও perspective fix

python

import cv2, numpy as np

def find_doc(img):
    g = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    g = cv2.GaussianBlur(g, (5,5), 0)
    edges = cv2.Canny(g, 50, 200)
    cnts, _ = cv2.findContours(edges, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:5]
    for c in cnts:
        approx = cv2.approxPolyDP(c, 0.02*cv2.arcLength(c, True), True)
        if len(approx) == 4: return approx.reshape(4, 2)
    return None

def unwarp(img, pts):
    pts = order_corners(pts)   # TL, TR, BR, BL
    w = int(max(np.linalg.norm(pts[1]-pts[0]), np.linalg.norm(pts[2]-pts[3])))
    h = int(max(np.linalg.norm(pts[3]-pts[0]), np.linalg.norm(pts[2]-pts[1])))
    dst = np.float32([[0,0],[w,0],[w,h],[0,h]])
    M   = cv2.getPerspectiveTransform(pts.astype("float32"), dst)
    return cv2.warpPerspective(img, M, (w, h))

২. Enhance — scan-like look

python

def enhance(img):
    g  = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    g  = cv2.fastNlMeansDenoising(g, h=10)
    th = cv2.adaptiveThreshold(g, 255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,
            blockSize=31, C=10)
    return th

৩. OCR — PaddleOCR (Bangla + English)

bash

pip install paddleocr paddlepaddle

python

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang="bn")    # 'en','bn','hi','ar' ...

res = ocr.ocr("page.jpg", cls=True)[0]
for box, (text, conf) in res:
    print(round(conf, 2), text)

Engine choice

PaddleOCR — সবচেয়ে ভালো Bangla support। EasyOCR — সবচেয়ে সহজ API। Tesseract — fastest CPU, hand-written-এ দুর্বল।

৪. Layout — text বনাম table

python

# Table detection: PP-Structure
from paddleocr import PPStructure
table_engine = PPStructure(show_log=False, layout=True, table=True)
result = table_engine("invoice.jpg")
for block in result:
    print(block["type"], block["bbox"])
    if block["type"] == "table":
        html = block["res"]["html"]   # → pandas.read_html

৫. Searchable PDF তৈরি

bash

# সবচেয়ে সহজ পথ — OCRmyPDF (Tesseract wrapper)
pip install ocrmypdf
ocrmypdf -l ben+eng input.pdf searchable.pdf

python

# অথবা reportlab দিয়ে নিজে invisible text layer বসান
from reportlab.pdfgen import canvas
c = canvas.Canvas("out.pdf")
c.drawImage("page.jpg", 0, 0, width=595, height=842)
c.setFillAlpha(0)                  # invisible text
for box, (txt, _) in res:
    x, y = box[0]                   # PDF coord-এ rescale করুন
    c.drawString(x, 842 - y, txt)
c.save()

৬. Smart features

Auto-rotate — 0/90/180/270 detect (PaddleOCR cls)।
Multi-page batching — pdf2image দিয়ে।
Key-Value extraction (invoice) — LayoutLMv3 / Donut।
Receipt parser — total, date, items → JSON।
Handwriting — TrOCR (Microsoft) বা Google Vision API।

FastAPI endpoint

python

@app.post("/scan")
async def scan(file: UploadFile):
    img    = decode(await file.read())
    pts    = find_doc(img)
    warped = unwarp(img, pts) if pts is not None else img
    clean  = enhance(warped)
    text   = "\n".join(t for _, (t, _) in ocr.ocr(clean)[0])
    pdf    = make_searchable_pdf(clean, ocr.ocr(clean)[0])
    return {"text": text, "pdf_url": upload_to_s3(pdf)}

প্র্যাকটিস টাস্ক

মোবাইল photo থেকে document edge detect ও unwarp করুন।
PaddleOCR দিয়ে বাংলা book page-এর full text extract করুন।
Restaurant receipt থেকে total + items JSON-এ বের করুন।