What about image-to-image models?
You are looking for a model that can do reference-driven object replacement, not just ordinary prompt-based editing.
That distinction is the whole story.
Most image editors are good at:
- “make this object look nicer,”
- “change the style,”
- “replace the object with a plausible new object.”
But you want something stronger:
Take object B from a reference image, and install it into image A, ideally in a chosen region, while preserving the rest of image A.
That is a much narrower and harder task. It sits between inpainting, image editing, reference-guided generation, and object-level compositing. That is why so many otherwise impressive models have not worked for you. (openaccess.thecvf.com)
The big picture
For your use case, the useful models fall into two families:
Family A — direct-fit models
These are the ones that are conceptually closest to your task:
These are the best when you can provide:
- a base image,
- a reference object image,
- and ideally a mask or target region.
Family B — latest general HF editors that are now strong enough to try
These are newer, broader, more modern image editors that can often do the job if you set them up well:
black-forest-labs/FLUX.2-klein-4B
Qwen/Qwen-Image-Edit-2511
FireRedTeam/FireRed-Image-Edit-1.1
meituan-longcat/LongCat-Image-Edit
black-forest-labs/FLUX.2-dev if you can handle a large model
(huggingface.co)
My real opinion is:
- If you care about task fit, think like Family A.
- If you care about Hugging Face practicality in 2026, start testing Family B.
That is the cleanest summary I can give.
My honest recommendation, in plain language
If I were doing this myself, I would not hunt for a single “perfect” model first.
I would use this strategy:
Best practical first test
black-forest-labs/FLUX.2-klein-4B
because it is:
- current,
- supports multi-reference editing,
- small enough to be practical,
- and Apache-2.0. (huggingface.co)
Best mature open baseline
Qwen/Qwen-Image-Edit-2511
because the Diffusers docs explicitly support multi-image reference workflows, and Qwen has become one of the safest open editing baselines on Hugging Face. (huggingface.co)
Best for identity consistency
FireRedTeam/FireRed-Image-Edit-1.1
because its model card explicitly emphasizes identity consistency and multi-image conditioning, which are exactly the two things that usually break in reference-based object replacement. (huggingface.co)
Best for localized reference-guided editing
meituan-longcat/LongCat-Image-Edit
because the card explicitly says it supports local editing and reference-guided editing. That makes it unusually relevant to what you are trying to do. (huggingface.co)
Best conceptual research fit
AnyDoor
because the paper is still one of the closest direct matches to “swap object in scene using object reference.” (openaccess.thecvf.com)
Best shape-preserving reference imitation
MimicBrush
because it is built around source image + selected region + reference image, and is especially good when the edit is more like “make this region become like the reference” than “fully reconstruct a new rigid object from scratch.” (xavierchen34.github.io)
The single most important thing: use a mask if you can
You said a third image with the masked area is an option.
That is not a fallback. That is a major advantage.
In practice, the masked version of your problem is much easier than the unmasked version, because the model no longer has to guess:
- what to replace,
- where to replace it,
- and how much of the scene should stay untouched. (huggingface.co)
So I would strongly reframe the situation like this:
- Two images only = hard mode
- Base image + reference object + target mask = realistic mode
That is one of the biggest reasons people fail with this class of task. They are asking the model to solve localization, object transfer, and blending all at once. (openaccess.thecvf.com)
Why general editors often fail here
There are four common failure modes.
1) The model preserves the category, not the instance
It gives you “a similar object,” not the actual reference object.
This is the classic identity-drift problem. It is exactly why I think FireRed is worth testing early, and why AnyDoor still matters conceptually. (huggingface.co)
2) The model edits too much
It changes the surrounding image when you only wanted one region edited.
That is why masked editing and locality-aware models matter so much. (huggingface.co)
3) The model gets the geometry wrong
The object looks plausible by itself, but does not fit the target scene:
- wrong perspective,
- wrong scale,
- wrong orientation,
- wrong relation to nearby objects.
That is where more spatially aware editors, or models that accept a mask/target region, help a lot. (huggingface.co)
4) The reference image carries too much clutter
If the reference contains extra background, lighting, or surrounding objects, the model may import the wrong things.
The AnyDoor paper explicitly reports that filtering background information from the reference object helps. (openaccess.thecvf.com)
Detailed model-by-model thoughts
1) AnyDoor
What it is
A research-oriented object-level image customization method designed for tasks like object insertion and object swapping. (openaccess.thecvf.com)
Why it matches your request so well
Because its whole framing is basically:
- take a base image,
- take a reference object,
- place/swap that object into the base image. (openaccess.thecvf.com)
Why I do not recommend it as the easiest first option
Because the current Hugging Face Space shows a runtime error, so the practical HF experience is not as clean as with the newer families. (huggingface.co)
My real conclusion
Excellent conceptual fit.
Not the cleanest beginner-first Hugging Face experience today.
2) FLUX.1 Kontext Inpaint
What it is
A Diffusers pipeline that explicitly supports:
- editing within a fixed mask region
- with image-reference conditioning. (huggingface.co)
Why it matters so much
Because many models say “editing,” but the docs do not clearly spell out the exact local workflow. Kontext does. That makes it one of the most concrete “yes, this really matches your problem” options in the HF ecosystem. (huggingface.co)
My real conclusion
If you can provide a mask, this is one of the strongest practical paths, even though newer general models now exist.
3) FLUX.2-klein-4B
What it is
A newer FLUX.2 model with:
- multi-reference editing
- consumer-GPU friendliness
- Apache-2.0 licensing
- release date April 6, 2026. (huggingface.co)
Why I like it for your case
It hits a rare sweet spot:
- current,
- practical,
- open enough,
- and explicitly reference-aware. (huggingface.co)
Weak point
It is still a broader general editor family, not a pure object-swap paper architecture. So it may still need strong masking and good setup to shine.
My real conclusion
If you ask me for one Hugging Face model to try first today, this is near the top of the list.
4) Qwen-Image-Edit-2511
What it is
A strong open editing model with documented support for multi-image reference workflows in Diffusers. (huggingface.co)
Why I like it
Because it is one of the cleanest current “modern open image editor” stacks:
- active,
- documented,
- relatively standard to use,
- and not locked into obscure tooling. (huggingface.co)
Weak point
It is broader than your exact task. It is not a pure object-swap specialist the way AnyDoor is.
My real conclusion
This is the model I would use as a strong open baseline. If even this struggles with your case, that is useful information.
5) FireRed-Image-Edit-1.1
What it is
A general-purpose image editing model whose card explicitly highlights:
- identity consistency
- multi-image conditioning
- real-world editing performance. (huggingface.co)
Why I like it for your problem
Because the biggest pain in reference-object swap is often:
“The edit happened, but the model did not really preserve the referenced object.”
That is the exact axis where FireRed is trying to improve. (huggingface.co)
My real conclusion
If your current attempts produce generic-looking replacements, test FireRed early.
6) LongCat-Image-Edit
What it is
A model card that explicitly says:
- global editing,
- local editing,
- text modification,
- reference-guided editing. (huggingface.co)
Why that matters
That wording is unusually aligned with your problem. It suggests a model that was designed with more structured edit control in mind, not just flashy broad edits.
My real conclusion
This is a strong candidate if your problem is mostly:
- “edit only this region”
- “follow the reference carefully”
- “do not wreck the rest of the image”
7) MimicBrush
What it is
A project focused on local reference imitation:
Why it matters
Because it directly respects the way you think about the task: not “prompt first,” but “image first.”
It is especially appealing when the edit is about making the target region look like the reference, while preserving more of the original shape. (xavierchen34.github.io)
Weak point
The current HF Space shows a configuration error. (huggingface.co)
My real conclusion
Useful, relevant, but more research/project-flavored than modern HF-native turnkey.
My recommended testing order
Here is the order I would actually use.
First wave
These give the best balance of practicality and relevance:
- FLUX.2-klein-4B
- Qwen-Image-Edit-2511
- FireRed-Image-Edit-1.1
- LongCat-Image-Edit
(huggingface.co)
This wave tells you whether the problem is already solvable with current modern HF-native editors.
Second wave
If the first wave is close but not quite right:
- FLUX Kontext masked reference path
- AnyDoor
- MimicBrush
(huggingface.co)
This wave tells you whether the missing ingredient is more explicit locality/object-level structure, not more raw model power.
Third wave
Only if you have heavy compute and want to test the ceiling:
- FLUX.2-dev
(huggingface.co)
Practical setup advice
These details matter a lot.
1) Prepare the reference object well
Use a tight crop or segmented object if possible.
Do not feed a messy reference image if you can avoid it. (openaccess.thecvf.com)
2) Prepare a real mask
If possible, the mask should cover the exact region you want replaced, not a huge loose box.
Precise locality is a big part of success. (huggingface.co)
3) Crop around the edit region
If the region is small relative to the full image, cropping around it helps the model focus. The Diffusers docs explicitly mention this for local inpainting-style workflows. (huggingface.co)
4) Keep the instruction short and concrete
Even in image-led editing, a small prompt helps:
- “replace the masked object with the reference object”
- “keep the rest of the image unchanged”
- “preserve lighting, scale, and perspective”
Short, concrete, and visual is usually better than long creative prose.
The FLUX.2 editing app prompt rules are actually quite aligned with that style. (huggingface.co)
Final verdict
If you want the answer reduced to one clear recommendation:
Best overall practical strategy
Use a mask and test modern HF-native editors first.
Best first HF model to try
black-forest-labs/FLUX.2-klein-4B
because it is recent, open enough, multi-reference capable, and practical. (huggingface.co)
Best open baseline
Qwen/Qwen-Image-Edit-2511. (huggingface.co)
Best identity-focused alternative
FireRedTeam/FireRed-Image-Edit-1.1. (huggingface.co)
Best locality-focused alternative
meituan-longcat/LongCat-Image-Edit. (huggingface.co)
Best conceptual specialist
AnyDoor. (openaccess.thecvf.com)
Here is the compact, concrete start-here recipe I would use for your task.
1) Prepare exactly these 3 inputs
Input A — base image
The image you want to edit.
Input B — reference object image
A tight crop of the object you want to insert/transfer. Remove as much background as possible. The AnyDoor paper reports better results when background information around the reference object is filtered out. (openaccess.thecvf.com)
Input C — mask image
A mask of the region to replace. If you can provide this, do it. It makes the task much easier and much more controllable. Diffusers’ FLUX Kontext docs explicitly support image-reference conditioning inside a fixed mask region. (huggingface.co)
2) Crop the work area before editing
Do not always feed the entire full-resolution image first.
If the target region is small, crop around that region plus a little context. The Diffusers docs explicitly note that when the masked region is small compared with the whole image, cropping around it can improve results. (huggingface.co)
3) Try these models in this exact order
First try
FLUX Kontext masked reference workflow
Reason: this is the clearest officially documented Hugging Face path for masked local editing + image-reference conditioning. (huggingface.co)
Second try
black-forest-labs/FLUX.2-klein-4B
Reason: it is current, practical, Apache-2.0, and its card explicitly says it supports image-to-image multi-reference editing. (huggingface.co)
Third try
Qwen/Qwen-Image-Edit-2511
Reason: it is a strong open baseline, and Diffusers explicitly documents multi-image reference workflows for the Qwen image editing family. (huggingface.co)
Backup if those are close but not good enough
meituan-longcat/LongCat-Image-Edit
Reason: the card explicitly says it supports local editing and reference-guided editing. (huggingface.co)
4) Use a short prompt, not a long one
Use something like this:
Replace the masked object with the reference object. Keep the rest of the image unchanged. Match scale, lighting, and perspective.
Do not write a long creative paragraph. Keep it direct and visual.
5) Run the same 3 tests for every model
For each model, do these 3 runs:
Run A — clean reference + clean mask
This is your baseline.
Run B — same inputs, slightly larger mask
This checks whether the model needed a bit more freedom around edges.
Run C — same inputs, tighter crop around the target region
This checks whether the full image was distracting the model.
6) Diagnose failure like this
If the model changes too much of the image
Problem: weak locality.
Action: use a better mask, tighter crop, or switch toward Kontext / LongCat. (huggingface.co)
If the model inserts the wrong-looking object
Problem: reference identity drift.
Action: clean the reference crop more aggressively; then try Qwen or FLUX.2-klein-4B again. The AnyDoor paper and newer model cards both point to reference handling as a key issue. (openaccess.thecvf.com)
If the object looks right but fits badly in the scene
Problem: geometry / perspective / scene integration.
Action: enlarge the crop around the target area a bit and make the prompt explicitly say match scale, lighting, and perspective.
If everything is almost right except the edges
Problem: blending.
Action: rerun with a slightly larger mask, then do a second cleanup pass.
7) The shortest realistic recommendation
If you want the fastest sensible path:
- Prepare: base image + tight reference crop + mask.
- Test first: FLUX Kontext masked reference path.
- Test second: FLUX.2-klein-4B.
- Test third: Qwen-Image-Edit-2511.
- If locality is still weak: try LongCat-Image-Edit.