AI Notes · Vision Language Models · OCR

Vision Transformers and Multimodal Models for OCR

A concise, technical overview of how modern AI systems read text from images, and why resolution and context matter.

Vision TransformerMultimodal AIOCR

1. The problem

OCR is the task of converting an image into text. While this sounds simple, the core difficulty is that the same text can appear in many visual forms — different fonts, sizes, lighting, angles, and backgrounds.

Formally, we want to map a pixel grid I ∈ R^(H×W×C) to a sequence of tokens y = (y₁, …, yₙ). This mapping is ambiguous and depends heavily on context.

Pixelsraw image valuesO 0 1 lVisual patternsambiguous shapes"100 kcal"Text outputfinal interpretationsame pixels → multiple interpretationscontext resolves ambiguity
OCR is not a direct mapping from pixels to characters. It requires resolving ambiguity using context.

The key insight: OCR is not just perception. It is a reasoning problem. The same visual input can map to different characters, and only context (surrounding text, layout, semantics) resolves the correct output.

2. Earlier approach: CNNs

CNNs apply local filters (convolutions) to learn hierarchical features. They encode strong inductive biases: locality and translation equivariance.

Strengths: efficient, robust for clean/structured documents. Limitations: weak global reasoning, struggles with long-range dependencies and layout.

3. Vision Transformer: images as tokens

ViT splits an image into fixed-size patches (P×P), flattens them, and projects to D-dimensional embeddings. The sequence length is N = HW / P².

ImagePatchesTokensTransformer
ViT processes images as sequences with global self-attention.

Self-attention provides global context: each token attends to all others. This is critical for OCR where characters depend on neighbors and layout.

4. Multimodal models: connecting vision with language

Contrastive training learns a joint embedding space for images and text. Given a batch of (image, text) pairs, the model maximizes similarity for true pairs and minimizes for mismatches.

Shared space
Image and text encoders learn comparable representations.

For OCR, language context resolves ambiguity (e.g., "O" vs "0"). This turns character recognition into sequence understanding.

5. A practical OCR pipeline

Image
  → quality checks (blur, skew, exposure)
  → region detection (text boxes)
  → crop at high resolution
  → encoder (ViT/CNN)
  → decoder / VLM
  → post-processing (rules, dictionaries)
  → validation (confidence, business constraints)

Key idea: separate detection (where text is) from recognition (what text is), and preserve resolution before recognition.

6. The resolution curse

Fixed input sizes force downsampling. If a character occupies fewer than a few pixels post-resize, information is lost irreversibly.

Downsampling removes fine details required for small text recognition.

If the text is too small after resizing, the model cannot recover it.

Mitigation: tiling/cropping, multi-scale inference, and higher-resolution encoders.

7. Practical takeaway

  • CNNs: strong local priors, efficient.
  • ViT: global context via attention.
  • VLMs: align vision with language for robustness.
  • Pipeline matters more than a single model.

OCR today is sequence understanding under visual constraints.

References

  • Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT)
  • Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP)
  • Vaswani et al., "Attention is All You Need"
  • Hugging Face Blog: "The Resolution Curse in Vision-Language Models"
  • Standard OCR pipelines: detection + recognition + post-processing (industry practice)