How GLM-4.5V Was Developed

Introduction

The release of GLM-4.5V is undoubtedly another milestone in the multimodal AI field. It has achieved significant improvements in multimodal understanding and reasoning, showcasing powerful performance and broad application potential through its unique architectural design, refined data construction, and the application of reinforcement learning.

Last night, GLM-4.5V was released in the open-source multimodal arena, shaking things up.

Performance

GLM-4.5V has significantly improved compared to previous models in multimodal understanding and reasoning.

In the above image, GLM-4.5V outperforms its predecessors in various fields such as STEM, spatial reasoning, GUI agents, OCR, document understanding, code understanding, video understanding, visual localization, and general VQA.

The backbone of GLM-4.5V is a reinforcement learning (RL) framework.

After reinforcement learning, the model showed a gain of up to +10.6% in coding tasks and +6.7% in STEM problems.

GLM-4.5V achieved the best scores in nearly all high-difficulty tasks, including MMStar (75.3), MMMU Pro (65.2), MathVista (84.6), ChartQAPro (64.0), and WebVoyager (84.4).

Architecture

The architectural design of GLM-4.5V focuses on three goals: native multimodal support, high resolution, and strong temporal understanding. It consists of three components: a visual encoder (ViT Encoder), an MLP projector (MLP Projector), and a language decoder (LLM Decoder).

Visual Encoder

Based on AIMv2-Huge initialization, it incorporates 2D-ROPE and 3D convolution to natively handle images and videos of any resolution while effectively capturing temporal information.

Language Decoder

Based on GLM-4.5-Air, it enhances the understanding of spatial positions in multimodal inputs by extending 3D-RoPE.

Native Temporal Understanding

When processing videos, the model inserts a timestamp token after the visual features of each frame, allowing it to clearly perceive the real time intervals between frames, greatly improving the accuracy of video understanding and localization.

Pre-training

The pre-training of GLM-4.5V consists of two parts: data construction and training paradigm.

Data Construction

The pre-training corpus of GLM-4.5V covers multidimensional data, including:

Image-Text Pair Data

Over 10 billion pairs of high-quality image-text data were constructed through a refined process involving heuristic filtering, CLIP-Score selection, concept-balanced resampling, and factual-centered recaptioning.

Each image has a better rephrasing. For example, the simple description “a northern cardinal singing” can be enriched to “a northern cardinal perched on a branch, with a clear blue sky in the background,” preserving facts while greatly enhancing detail and information density.

Interleaved Image-Text Data

High-quality mixed content was extracted from webpages and academic books to help the model learn complex logical relationships and domain knowledge.

OCR Data

An OCR dataset containing 220 million images was constructed, covering synthetic documents, natural scene text, and academic documents, significantly enhancing text recognition capabilities.

Grounding Data

A mixed grounding dataset was created with 40 million natural image annotations and over 140 million GUI question-answer pairs, endowing the model with precise pixel-level understanding capabilities.

Video Data

A high-quality video dataset was built through detailed manual annotation to capture complex actions, scene text, and cinematic elements.

Training Paradigm: Two-Stage, Long Context

GLM-4.5V’s training adopts a two-stage strategy:

Multimodal Pre-training

Using all data except videos, it underwent 120,000 training steps at a sequence length of 8192.

Long Context Continuous Training: The sequence length was extended to 32,768, incorporating video data for an additional 10,000 training steps, enabling the model to handle high-resolution images, long videos, and lengthy documents.

Post-training: SFT and RL

The post-training phase is crucial for enhancing GLM-4.5V’s reasoning capabilities, consisting of supervised fine-tuning (SFT) and reinforcement learning (RL).

Supervised Fine-Tuning (SFT): Aligning Thought Paradigms

The goal of SFT is to align the model’s thinking and expression style, teaching it to reason in the form of a “Chain-of-Thought.”

Standard Format

All training data follows the standard format {thought process}{final answer}.

Answer Extraction: For tasks requiring precise answers, the final answer is wrapped in special tokens <|begin_of_box|> and <|end_of_box|> to facilitate accurate judgment by the subsequent RL phase’s reward model.

Dual Modality Support: GLM-4.5V mixes “thinking” and “non-thinking” data during the SFT phase and introduces special token /nothink, achieving flexible switching between two reasoning modes, balancing performance and efficiency.

Reinforcement Learning (RL): Unlocking Model Potential

GLM-4.5V enhances reasoning capabilities through large-scale, cross-domain reinforcement learning.

RLCS Curriculum Learning Sampling

To improve training efficiency, the team proposed Reinforcement Learning with Curriculum Sampling (RLCS), which dynamically selects “moderately difficult” training samples based on the model’s current ability, avoiding wasted computational resources on overly easy or difficult problems, thus maximizing the benefits of each training step.

Robust Reward System

The success of RL largely depends on the quality of the reward signal. GLM-4.5V established a domain-specific reward system, designing specialized validation logic for different tasks such as mathematics, OCR, and GUI to avoid the phenomenon of “reward hacking.”

As shown in the image above, even with high-quality reward signals in the STEM field, the presence of a flawed reward model in a multi-image VQA task can lead to a complete collapse of the entire training process after 150 steps.

This illustrates that any shortcoming can become a critical failure point for RL training.

Cross-domain generalization and collaborative RL not only enhance the model’s capabilities in specific domains but also bring significant cross-domain generalization effects.

As shown in the image above, training in a single domain can enhance capabilities in other domains. For example, training solely on GUI Agent data can boost performance in STEM, OCR, visual localization, and general VQA.

This indicates that there exists a shared underlying logic between different multimodal capabilities, and mixing all domain data for training (Mix All) can achieve stronger results than single-domain training, realizing a synergistic effect of “1+1 > 2.”

Conclusion

The training of GLM-4.5V includes:

Architecture: natively supports high resolution, long videos, and temporal understanding.
Pre-training: refined data construction and two-stage training.
SFT: aligns the model with the “Chain-of-Thought” reasoning paradigm, preparing it for the RL phase.
RL: enhances reasoning through RLCS, a robust reward system, and cross-domain training.

Stay tuned for the upcoming GLM-4.5V-355B.