The Iterative Refinement Trick That Stops Your Diffusion Model Hallucinating

A breakdown of a clever technique combining rejection sampling with iterative refinement to boost the quality and fidelity of generated images without needing to retrain the base model.

Introduction

Tired of Janky AI-Generated Images? There's a Fix.

Let's be real: sometimes your diffusion model outputs... well, they just suck. You spend ages on a prompt, the model whirs, and you get a beautiful landscape... with a person who has eight fingers, or a car that forgot how to physics. The culprit is often unrealistically low-probability details getting stuck early in the sampling process.

This new paper drops a straightforward, deployable method that dramatically cuts down on these low-fidelity hallucinations. The core idea is simple, but powerful: don't just generate an image once and hope for the best; instead, generate many, reject the bad ones, and then iteratively refine the best one until it's perfect.

They call the combined technique something academic, but you can just think of it as Quality-Controlled Iterative Refinement. It's a two-stage process that leverages the model's own capability to distinguish good from bad, making the final output much more aligned with high-probability (and thus, high-quality) samples. Crucially, this is a sampling-time hack, meaning you can use it with your existing Stable Diffusion, Imagen, or similar model right now without any expensive retraining.

How It Works

The magic here is in the combination of two well-known concepts, woven into a two-step generative loop: Rejection Sampling and Iterative Refinement.

The Quality Filter (Rejection Sampling)

We know that a standard diffusion model often produces a small handful of truly great images among a sea of garbage. Rejection sampling is all about filtering this noise.

Generate a Batch: Instead of generating one image, generate a small batch say, $N=4$ or $N=8$ independent images from the same prompt.

Score the Batch: Use an external quality estimator. The paper uses a pre-trained classifier-free guidance score (CFG) on a separate, much smaller model. This score essentially measures how well the final output image aligns with the text prompt, which is a surprisingly good proxy for perceptual quality.

Reject and Select: The model selects the $K$ best images (e.g., $K=1$ or $K=2$) based on the quality score. The rest are thrown away.

The key insight is that this initial filtering step removes low-probability noise early on. If an image is inherently janky (eight fingers, blurred mess), its text-alignment score is probably low, and it gets rejected. We're starting the refinement process from a much higher-quality seed.

The Fidelity Boost (Iterative Refinement)

Now that you have a high-quality seed image, you want to push its fidelity even higher.

Re-Noising: Take the selected best image and apply a small, controlled amount of noise addition. This is the critical step. We don't want to destroy the image, just slightly perturb it. It's like gently shaking a perfectly-poured cement mold to settle it and remove air pockets.

Re-Sampling: Run the perturbed image through a few steps of the original diffusion model's reverse process (the denoising part). Because the starting point was already high quality, the model's small adjustments during this limited denoising will focus on fixing minor inconsistencies, sharpening details, and generally increasing the likelihood of the final pixel configuration.

Looping: You can repeat this Re-Noising $\rightarrow$ Re-Sampling loop a few times ($T$ iterations). Each pass nudges the image closer to the most probable, highest-fidelity version that your model knows how to make.

The final result is a model that starts with a good idea (via rejection) and polishes it to a brilliant finish (via iteration).

Comparison

The results are solid, showing improvements across several metrics, here's takeaway:

Metric	Improvement	What it means for you
FID (Frechet Inception Distance)	⬇️ 7.9%	Images look more realistic and less "AI-generated."
CLIP Score	⬆️ 3.4%	Images are more semantically aligned with the text prompt.
Human Preference Rate	⬆️ 65%	People consistently prefer these images over standard generation.

The most relevant result is the Human Preference Rate. On average, people chose the images generated with this technique 65% of the time. When you're shipping a product, that's all that matters. It means fewer re-generations for the user and higher perceived quality.

Compared to simply running the original model for more steps, or just using a bigger CFG, this method offers a more efficient quality gain because it strategically filters before it refines, minimizing wasted computation on bad seeds.

Practical Prompts

You can essentially wrap any existing diffusion call with this logic.

Fixing Compositional Errors

A hyper-realistic photograph of a wolf howling at a crescent moon 
in a snowy forest at night, deep focus, f/1.4, cinematic lighting

Standard Output (1-Pass): A wolf, but one paw is deformed, and the moon is oddly distorted into an oval.

Expected Output (Iterative Refinement): A perfectly formed wolf and a crisp crescent moon. The iterative steps fix the minor, low-probability deformations in the limbs and the celestial object.

Boosting Detail Fidelity

Steampunk robot serving a cup of tea, intricate brass and copper plating, 
leather apron, detailed oil painting style by Zdzisław Beksiński

Standard Output (1-Pass): A robot with a generally correct aesthetic, but the brass piping is inconsistent, and the hands are muddy and indistinct.

Expected Output (Iterative Refinement): The robot's brass plating is sharp and consistent, the gear mechanisms are clearly defined, and the hands holding the cup have coherent, detailed joints. The refinement loop cleans up the "muddy" parts of the oil painting texture.

Insights and Practical Takeaways

1. The Quality Estimator Doesn't Need to Be Perfect

The paper's key insight is that even a simple, slightly-off-the-shelf quality score (like CFG from a smaller model) is good enough to filter out the outright garbage. You don't need to train a massive, perfect quality classifier.

2. Low-Cost Iteration

The refinement step only involves a few extra forward passes. If your standard generation is 50 steps, the refinement might be 3 loops of (1-step noise + 5-step denoise). The computational overhead is worth the quality improvement, especially since you only refine one image instead of generating a huge batch of high-step images.

3. Deployable Now

This is a post-training technique. It requires no fine-tuning of your base model weights. You can implement the rejection and refinement logic as a wrapper around your existing sampling pipeline. This is an operator-level improvement.

Conclusion

The pursuit of better AI image generation usually defaults to "train a bigger model" or "use a longer prompt." This work provides an elegant, computational alternative: be smarter about the sampling process itself.

By combining initial Rejection Sampling to throw out the garbage early, and then using Iterative Refinement to polish the chosen few, you leverage your existing model's ability to denoise to fix its own low-probability mistakes. If you're a developer running a commercial image-generation service, implementing this technique is a fast, cost-effective way to immediately boost the perceptual quality and fidelity of your outputs, leading directly to happier users and fewer "re-roll" clicks. Go try it your robot hands will look much, much better.