AI Notes · Vision Language Models · OCR
Vision Transformers and Multimodal Models for OCR
A concise, technical overview of how modern AI systems read text from images, and why resolution and context matter.
1. The problem
OCR is the task of converting an image into text. While this sounds simple, the core difficulty is that the same text can appear in many visual forms — different fonts, sizes, lighting, angles, and backgrounds.
Formally, we want to map a pixel grid I ∈ R^(H×W×C) to a sequence of tokens y = (y₁, …, yₙ). This mapping is ambiguous and depends heavily on context.
The key insight: OCR is not just perception. It is a reasoning problem. The same visual input can map to different characters, and only context (surrounding text, layout, semantics) resolves the correct output.
2. Earlier approach: CNNs
CNNs apply local filters (convolutions) to learn hierarchical features. They encode strong inductive biases: locality and translation equivariance.
Strengths: efficient, robust for clean/structured documents. Limitations: weak global reasoning, struggles with long-range dependencies and layout.
3. Vision Transformer: images as tokens
ViT splits an image into fixed-size patches (P×P), flattens them, and projects to D-dimensional embeddings. The sequence length is N = HW / P².
Self-attention provides global context: each token attends to all others. This is critical for OCR where characters depend on neighbors and layout.
4. Multimodal models: connecting vision with language
Contrastive training learns a joint embedding space for images and text. Given a batch of (image, text) pairs, the model maximizes similarity for true pairs and minimizes for mismatches.
For OCR, language context resolves ambiguity (e.g., "O" vs "0"). This turns character recognition into sequence understanding.
5. A practical OCR pipeline
Image → quality checks (blur, skew, exposure) → region detection (text boxes) → crop at high resolution → encoder (ViT/CNN) → decoder / VLM → post-processing (rules, dictionaries) → validation (confidence, business constraints)
Key idea: separate detection (where text is) from recognition (what text is), and preserve resolution before recognition.
6. The resolution curse
Fixed input sizes force downsampling. If a character occupies fewer than a few pixels post-resize, information is lost irreversibly.
If the text is too small after resizing, the model cannot recover it.
Mitigation: tiling/cropping, multi-scale inference, and higher-resolution encoders.
7. Practical takeaway
- CNNs: strong local priors, efficient.
- ViT: global context via attention.
- VLMs: align vision with language for robustness.
- Pipeline matters more than a single model.
OCR today is sequence understanding under visual constraints.
References
- Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT)
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP)
- Vaswani et al., "Attention is All You Need"
- Hugging Face Blog: "The Resolution Curse in Vision-Language Models"
- Standard OCR pipelines: detection + recognition + post-processing (industry practice)