Dragging with Geometry: From Pixels to Geometry-Guided Image Editing

TL;DR: We introduce a novel drag-based image editing method, GeoDrag, which integrates 3D geometric cues into pixel-level drag-based editing to enhance underlying 3D structure-consistent. Benefiting from a unified displacement field and a conflict-free partitioning strategy, GeoDrag enables coherent, high-fidelity, and structure-consistent edits in a single forward pass.

Abstract: Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, previous methods predominantly center on 2D pixel plane, neglecting the underlying 3D geometric structure. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing method—GeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. Our code and models will be released publicly.

Method

MY ALT TEXT

Overall framework of GeoDrag. In drag pipeline, the mask is decomposed into multiple sub-regions, each containing a pair of drag points. In each sub-region, the geometry- and plane-aware displacement fields are independently calculated. Subsequently, these fused fields are aggregated without conflict. Once the final field is obtained, one-step editing is performed via latent relocation and bilateral nearest neighbor interpolation (BNNI), with reference guidance to ensure semantic consistency.

Visual Examples

Quantitative Results

Quantitative results on DragBench. Lower MD and DAI indicate higher editing precision, and higher IF reflects better perceptual fidelity. Time is the average editing time per point, and Mem is the peak GPU memory (GB).

Approach MD ↓ DAI1 DAI10 DAI20 IF ↑ Preparation Time (s) Mem
DragDiffusion 34.570.1810.1700.1600.871~1 min (LoRA)22.4618.63
FreeDrag 30.800.1830.1660.1510.845~1 min (LoRA)42.9018.90
CLIPDrag 34.620.1950.1740.1580.891~1 min (LoRA)38.2122.72
DragNoise 33.840.1790.1690.1580.861~1 min (LoRA)21.1218.36
FastDrag 32.100.1310.1230.1150.8503.235.85
GeoDrag (Ours) 29.24 0.128 0.120 0.111 0.847 3.95 5.44