Abstract
Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.
Method

Object Insertion
ID

Caption
A blue car is driving on the road.
Reference Video
Ours
VACE
Kling
AnyV2V
Unic
Pika
ID

Caption
A Pikachu is standing on the plane wing.
Reference Video
Ours
VACE
Kling
AnyV2V
Unic
Pika
ID

Caption
An octopus is swimming in the sea.
Reference Video
Ours
VACE
Kling
AnyV2V
Unic
Pika
ID

Caption
A red poppy flower surrounded by purple flowers. A large gorilla is gently trying to touch the red poppy flower.
Reference Video
Ours
Kling
VACE
AnyV2V
ID

Caption
A giant Pikachu is floating on the sea.
Reference Video
Ours
VACE
AnyV2V
Object Swapping
ID

Caption
A man in casual attire, including a black t-shirt, blue jeans, and sneakers, is seen walking on a paved path in a park with his Samoyed guide dog.
Reference Video
Ours
VACE
AnyV2V
Unic
Pika
ID

Caption
A man in a black tuxedo stands in a festive, dimly lit environment fliied with sparking lights and decorations.
Reference Video
Ours
VACE
AnyV2V
Unic
Pika
ID

Caption
A Pikachu sits on top of a toy jeep against a brick wall.
Reference Video
Ours
VACE
AnyV2V
Unic
Pika
ID

Caption
A colossal Statue of Liberty stands atop a white base with classical columns, surrounded by a dense forest.
Reference Video
Ours
VACE
AnyV2V
Unic
Pika
ID

Caption
In the scene, the young man with dark hair and a contemplative expression is now wearing a stylish black leather jacket.
Reference Video
Ours
VACE
AnyV2V
Unic
Pika
Object Deletion
Caption
A man in a plaid shirt, denim overalls, and a white cap is seen using a pitchfork in a rural farm setting. He stands next to a green wheelbarrow filled with hay, against a backdrop of a wooden fence, a dark green-roofed building, and a clear blue sky.
Reference Video
Ours
VACE
AnyV2V
Kling
Caption
A serene landscape features a lush green field surrounded by rolling hills, a forest, and a small village nestled against a mountainous backdrop under a clear blue sky.
Reference Video
Ours
VACE
AnyV2V
Kling
Caption
Two young women, dressed in cozy winter attire, stand together in a tranquil snowy winter landscape, engaging in the warmth of sharing a hot drink.
Reference Video
Ours
VACE
AnyV2V
Kling
Caption
A cow is standing in the fence grazing. There is straw on the ground and a trough.
Reference Video
Ours
VACE
AnyV2V
Kling
Caption
A book.
Reference Video
Ours
VACE
AnyV2V
Kling
BibTeX
@misc{chen2025contextflowtrainingfreevideoobject,
title={ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment},
author={Yiyang Chen and Xuanhua He and Xiujun Ma and Yue Ma},
year={2025},
eprint={2509.17818},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.17818},
}