Help... looking for Smart Object Swap model with reference

Hi,

Can anyone point me a model that is able to do Smart Object Swap using a reference object. So the main source is not a prompt but are images. A base image where for example a door needs to be replaced with another door that is presented as a reference object. If that is to difficult a third image with the masked area is also an option. I have tried numerous models but until now no success.

Thanks in advance!

Jos

1 Like

What about image-to-image models?


You are looking for a model that can do reference-driven object replacement, not just ordinary prompt-based editing.

That distinction is the whole story.

Most image editors are good at:

  • “make this object look nicer,”
  • “change the style,”
  • “replace the object with a plausible new object.”

But you want something stronger:

Take object B from a reference image, and install it into image A, ideally in a chosen region, while preserving the rest of image A.

That is a much narrower and harder task. It sits between inpainting, image editing, reference-guided generation, and object-level compositing. That is why so many otherwise impressive models have not worked for you. (openaccess.thecvf.com)

The big picture

For your use case, the useful models fall into two families:

Family A — direct-fit models

These are the ones that are conceptually closest to your task:

These are the best when you can provide:

  • a base image,
  • a reference object image,
  • and ideally a mask or target region.

Family B — latest general HF editors that are now strong enough to try

These are newer, broader, more modern image editors that can often do the job if you set them up well:

  • black-forest-labs/FLUX.2-klein-4B
  • Qwen/Qwen-Image-Edit-2511
  • FireRedTeam/FireRed-Image-Edit-1.1
  • meituan-longcat/LongCat-Image-Edit
  • black-forest-labs/FLUX.2-dev if you can handle a large model
    (huggingface.co)

My real opinion is:

  • If you care about task fit, think like Family A.
  • If you care about Hugging Face practicality in 2026, start testing Family B.

That is the cleanest summary I can give.


My honest recommendation, in plain language

If I were doing this myself, I would not hunt for a single “perfect” model first.

I would use this strategy:

Best practical first test

black-forest-labs/FLUX.2-klein-4B
because it is:

  • current,
  • supports multi-reference editing,
  • small enough to be practical,
  • and Apache-2.0. (huggingface.co)

Best mature open baseline

Qwen/Qwen-Image-Edit-2511
because the Diffusers docs explicitly support multi-image reference workflows, and Qwen has become one of the safest open editing baselines on Hugging Face. (huggingface.co)

Best for identity consistency

FireRedTeam/FireRed-Image-Edit-1.1
because its model card explicitly emphasizes identity consistency and multi-image conditioning, which are exactly the two things that usually break in reference-based object replacement. (huggingface.co)

Best for localized reference-guided editing

meituan-longcat/LongCat-Image-Edit
because the card explicitly says it supports local editing and reference-guided editing. That makes it unusually relevant to what you are trying to do. (huggingface.co)

Best conceptual research fit

AnyDoor
because the paper is still one of the closest direct matches to “swap object in scene using object reference.” (openaccess.thecvf.com)

Best shape-preserving reference imitation

MimicBrush
because it is built around source image + selected region + reference image, and is especially good when the edit is more like “make this region become like the reference” than “fully reconstruct a new rigid object from scratch.” (xavierchen34.github.io)


The single most important thing: use a mask if you can

You said a third image with the masked area is an option.

That is not a fallback. That is a major advantage.

In practice, the masked version of your problem is much easier than the unmasked version, because the model no longer has to guess:

  • what to replace,
  • where to replace it,
  • and how much of the scene should stay untouched. (huggingface.co)

So I would strongly reframe the situation like this:

  • Two images only = hard mode
  • Base image + reference object + target mask = realistic mode

That is one of the biggest reasons people fail with this class of task. They are asking the model to solve localization, object transfer, and blending all at once. (openaccess.thecvf.com)


Why general editors often fail here

There are four common failure modes.

1) The model preserves the category, not the instance

It gives you “a similar object,” not the actual reference object.
This is the classic identity-drift problem. It is exactly why I think FireRed is worth testing early, and why AnyDoor still matters conceptually. (huggingface.co)

2) The model edits too much

It changes the surrounding image when you only wanted one region edited.
That is why masked editing and locality-aware models matter so much. (huggingface.co)

3) The model gets the geometry wrong

The object looks plausible by itself, but does not fit the target scene:

  • wrong perspective,
  • wrong scale,
  • wrong orientation,
  • wrong relation to nearby objects.
    That is where more spatially aware editors, or models that accept a mask/target region, help a lot. (huggingface.co)

4) The reference image carries too much clutter

If the reference contains extra background, lighting, or surrounding objects, the model may import the wrong things.
The AnyDoor paper explicitly reports that filtering background information from the reference object helps. (openaccess.thecvf.com)


Detailed model-by-model thoughts

1) AnyDoor

What it is

A research-oriented object-level image customization method designed for tasks like object insertion and object swapping. (openaccess.thecvf.com)

Why it matches your request so well

Because its whole framing is basically:

  • take a base image,
  • take a reference object,
  • place/swap that object into the base image. (openaccess.thecvf.com)

Why I do not recommend it as the easiest first option

Because the current Hugging Face Space shows a runtime error, so the practical HF experience is not as clean as with the newer families. (huggingface.co)

My real conclusion

Excellent conceptual fit.
Not the cleanest beginner-first Hugging Face experience today.


2) FLUX.1 Kontext Inpaint

What it is

A Diffusers pipeline that explicitly supports:

  • editing within a fixed mask region
  • with image-reference conditioning. (huggingface.co)

Why it matters so much

Because many models say “editing,” but the docs do not clearly spell out the exact local workflow. Kontext does. That makes it one of the most concrete “yes, this really matches your problem” options in the HF ecosystem. (huggingface.co)

My real conclusion

If you can provide a mask, this is one of the strongest practical paths, even though newer general models now exist.


3) FLUX.2-klein-4B

What it is

A newer FLUX.2 model with:

  • multi-reference editing
  • consumer-GPU friendliness
  • Apache-2.0 licensing
  • release date April 6, 2026. (huggingface.co)

Why I like it for your case

It hits a rare sweet spot:

  • current,
  • practical,
  • open enough,
  • and explicitly reference-aware. (huggingface.co)

Weak point

It is still a broader general editor family, not a pure object-swap paper architecture. So it may still need strong masking and good setup to shine.

My real conclusion

If you ask me for one Hugging Face model to try first today, this is near the top of the list.


4) Qwen-Image-Edit-2511

What it is

A strong open editing model with documented support for multi-image reference workflows in Diffusers. (huggingface.co)

Why I like it

Because it is one of the cleanest current “modern open image editor” stacks:

  • active,
  • documented,
  • relatively standard to use,
  • and not locked into obscure tooling. (huggingface.co)

Weak point

It is broader than your exact task. It is not a pure object-swap specialist the way AnyDoor is.

My real conclusion

This is the model I would use as a strong open baseline. If even this struggles with your case, that is useful information.


5) FireRed-Image-Edit-1.1

What it is

A general-purpose image editing model whose card explicitly highlights:

  • identity consistency
  • multi-image conditioning
  • real-world editing performance. (huggingface.co)

Why I like it for your problem

Because the biggest pain in reference-object swap is often:

“The edit happened, but the model did not really preserve the referenced object.”

That is the exact axis where FireRed is trying to improve. (huggingface.co)

My real conclusion

If your current attempts produce generic-looking replacements, test FireRed early.


6) LongCat-Image-Edit

What it is

A model card that explicitly says:

  • global editing,
  • local editing,
  • text modification,
  • reference-guided editing. (huggingface.co)

Why that matters

That wording is unusually aligned with your problem. It suggests a model that was designed with more structured edit control in mind, not just flashy broad edits.

My real conclusion

This is a strong candidate if your problem is mostly:

  • “edit only this region”
  • “follow the reference carefully”
  • “do not wreck the rest of the image”

7) MimicBrush

What it is

A project focused on local reference imitation:

Why it matters

Because it directly respects the way you think about the task: not “prompt first,” but “image first.”
It is especially appealing when the edit is about making the target region look like the reference, while preserving more of the original shape. (xavierchen34.github.io)

Weak point

The current HF Space shows a configuration error. (huggingface.co)

My real conclusion

Useful, relevant, but more research/project-flavored than modern HF-native turnkey.


My recommended testing order

Here is the order I would actually use.

First wave

These give the best balance of practicality and relevance:

  1. FLUX.2-klein-4B
  2. Qwen-Image-Edit-2511
  3. FireRed-Image-Edit-1.1
  4. LongCat-Image-Edit
    (huggingface.co)

This wave tells you whether the problem is already solvable with current modern HF-native editors.

Second wave

If the first wave is close but not quite right:

  1. FLUX Kontext masked reference path
  2. AnyDoor
  3. MimicBrush
    (huggingface.co)

This wave tells you whether the missing ingredient is more explicit locality/object-level structure, not more raw model power.

Third wave

Only if you have heavy compute and want to test the ceiling:

  1. FLUX.2-dev
    (huggingface.co)

Practical setup advice

These details matter a lot.

1) Prepare the reference object well

Use a tight crop or segmented object if possible.
Do not feed a messy reference image if you can avoid it. (openaccess.thecvf.com)

2) Prepare a real mask

If possible, the mask should cover the exact region you want replaced, not a huge loose box.
Precise locality is a big part of success. (huggingface.co)

3) Crop around the edit region

If the region is small relative to the full image, cropping around it helps the model focus. The Diffusers docs explicitly mention this for local inpainting-style workflows. (huggingface.co)

4) Keep the instruction short and concrete

Even in image-led editing, a small prompt helps:

  • “replace the masked object with the reference object”
  • “keep the rest of the image unchanged”
  • “preserve lighting, scale, and perspective”

Short, concrete, and visual is usually better than long creative prose.
The FLUX.2 editing app prompt rules are actually quite aligned with that style. (huggingface.co)


Final verdict

If you want the answer reduced to one clear recommendation:

Best overall practical strategy

Use a mask and test modern HF-native editors first.

Best first HF model to try

black-forest-labs/FLUX.2-klein-4B
because it is recent, open enough, multi-reference capable, and practical. (huggingface.co)

Best open baseline

Qwen/Qwen-Image-Edit-2511. (huggingface.co)

Best identity-focused alternative

FireRedTeam/FireRed-Image-Edit-1.1. (huggingface.co)

Best locality-focused alternative

meituan-longcat/LongCat-Image-Edit. (huggingface.co)

Best conceptual specialist

AnyDoor. (openaccess.thecvf.com)


Here is the compact, concrete start-here recipe I would use for your task.

1) Prepare exactly these 3 inputs

Input A — base image
The image you want to edit.

Input B — reference object image
A tight crop of the object you want to insert/transfer. Remove as much background as possible. The AnyDoor paper reports better results when background information around the reference object is filtered out. (openaccess.thecvf.com)

Input C — mask image
A mask of the region to replace. If you can provide this, do it. It makes the task much easier and much more controllable. Diffusers’ FLUX Kontext docs explicitly support image-reference conditioning inside a fixed mask region. (huggingface.co)

2) Crop the work area before editing

Do not always feed the entire full-resolution image first.

If the target region is small, crop around that region plus a little context. The Diffusers docs explicitly note that when the masked region is small compared with the whole image, cropping around it can improve results. (huggingface.co)

3) Try these models in this exact order

First try

FLUX Kontext masked reference workflow
Reason: this is the clearest officially documented Hugging Face path for masked local editing + image-reference conditioning. (huggingface.co)

Second try

black-forest-labs/FLUX.2-klein-4B
Reason: it is current, practical, Apache-2.0, and its card explicitly says it supports image-to-image multi-reference editing. (huggingface.co)

Third try

Qwen/Qwen-Image-Edit-2511
Reason: it is a strong open baseline, and Diffusers explicitly documents multi-image reference workflows for the Qwen image editing family. (huggingface.co)

Backup if those are close but not good enough

meituan-longcat/LongCat-Image-Edit
Reason: the card explicitly says it supports local editing and reference-guided editing. (huggingface.co)

4) Use a short prompt, not a long one

Use something like this:

Replace the masked object with the reference object. Keep the rest of the image unchanged. Match scale, lighting, and perspective.

Do not write a long creative paragraph. Keep it direct and visual.

5) Run the same 3 tests for every model

For each model, do these 3 runs:

Run A — clean reference + clean mask

This is your baseline.

Run B — same inputs, slightly larger mask

This checks whether the model needed a bit more freedom around edges.

Run C — same inputs, tighter crop around the target region

This checks whether the full image was distracting the model.

6) Diagnose failure like this

If the model changes too much of the image

Problem: weak locality.
Action: use a better mask, tighter crop, or switch toward Kontext / LongCat. (huggingface.co)

If the model inserts the wrong-looking object

Problem: reference identity drift.
Action: clean the reference crop more aggressively; then try Qwen or FLUX.2-klein-4B again. The AnyDoor paper and newer model cards both point to reference handling as a key issue. (openaccess.thecvf.com)

If the object looks right but fits badly in the scene

Problem: geometry / perspective / scene integration.
Action: enlarge the crop around the target area a bit and make the prompt explicitly say match scale, lighting, and perspective.

If everything is almost right except the edges

Problem: blending.
Action: rerun with a slightly larger mask, then do a second cleanup pass.

7) The shortest realistic recommendation

If you want the fastest sensible path:

  1. Prepare: base image + tight reference crop + mask.
  2. Test first: FLUX Kontext masked reference path.
  3. Test second: FLUX.2-klein-4B.
  4. Test third: Qwen-Image-Edit-2511.
  5. If locality is still weak: try LongCat-Image-Edit.