Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

1Tianjin University, 2University of New South Wales, 3University of Electronic Science and Technology of China, 4Hainan University, 5University of Auckland

Given multiple reference videos, our FlexiMMT independently extracts each motion and applies them to images with any number of objects, enabling precise and compositional multi-object multi-motion transfer.

Abstract

Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.

Demo

Method

VideoMage

Overview of the FlexiMMT. (a) Training: Given one reference video, insert trainable motion tokens into text and video tokens. Get the object mask through a simple \(QK\) multiplication method, then mask out M2M, M2V and T2M parts in attention map. (b) Inference: Given a multi-object conditional image, we first segment each object's mask with semantic segmentation model. Concatenate pre-trained motion tokens into text and video tokens for inference. Extract each object's latent-space mask in subsequent frames via Dynamic Regressive Mask Propagation Mechanism (Dynamic RMPM), and apply it to Motion parts and Text parts in attention map.

Qualitative Results

Qualitative Comparison

BibTeX


      @article{li2026letimagemotion,
            title={Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer}, 
            author={Yuze Li and Dong Gong and Xiao Cao and Junchao Yuan and Dongsheng Li and Lei Zhou and Yun Sing Koh and Cheng Yan and Xinyu Zhang},
            year={2026},
            eprint={2603.01000},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2603.01000}, 
      }