FlexiMMT

Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

¹Tianjin University, ²University of New South Wales, ³University of Electronic Science and Technology of China, ⁴Hainan University, ⁵University of Auckland

arXiv Code

Accepted by CVPR 2026

Abstract

Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.

Method

Overview of the FlexiMMT. (a) Training: Given one reference video, insert trainable motion tokens into text and video tokens. Get the object mask through a simple \(QK\) multiplication method, then mask out M2M, M2V and T2M parts in attention map. (b) Inference: Given a multi-object conditional image, we first segment each object's mask with semantic segmentation model. Concatenate pre-trained motion tokens into text and video tokens for inference. Extract each object's latent-space mask in subsequent frames via Dynamic Regressive Mask Propagation Mechanism (Dynamic RMPM), and apply it to Motion parts and Text parts in attention map.

BibTeX

@article{li2026letimagemotion, title={Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer}, author={Yuze Li and Dong Gong and Xiao Cao and Junchao Yuan and Dongsheng Li and Lei Zhou and Yun Sing Koh and Cheng Yan and Xinyu Zhang}, year={2026}, eprint={2603.01000}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.01000}, }

Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

Given multiple reference videos, our FlexiMMT independently extracts each motion and applies them to images with any number of objects, enabling precise and compositional multi-object multi-motion transfer.

Abstract

Demo

Method

Qualitative Results

Qualitative Comparison

BibTeX