Overview of the FlexiMMT. (a) Training: Given one reference video, insert trainable motion tokens into text and video tokens. Get the object mask through a simple \(QK\) multiplication method, then mask out M2M, M2V and T2M parts in attention map. (b) Inference: Given a multi-object conditional image, we first segment each object's mask with semantic segmentation model. Concatenate pre-trained motion tokens into text and video tokens for inference. Extract each object's latent-space mask in subsequent frames via Dynamic Regressive Mask Propagation Mechanism (Dynamic RMPM), and apply it to Motion parts and Text parts in attention map.
@article{li2026letimagemotion,
title={Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer},
author={Yuze Li and Dong Gong and Xiao Cao and Junchao Yuan and Dongsheng Li and Lei Zhou and Yun Sing Koh and Cheng Yan and Xinyu Zhang},
year={2026},
eprint={2603.01000},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.01000},
}