SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Shuang Liang1,4*, Jing He3, Chuanmeizhi Wang1, Lejun Liao2, Guo Zhang1, Yingcong Chen3,5, Yuan Yuan2
1Rama Alpaca Technology Company 2Boston College 3HKUST(GZ)
4The University of Hong Kong 5HKUST

*Work done during an internship at Rama Alpaca Technology
SDPose Teaser

SDPose: OOD-robust pose via diffusion priors. On stylized paintings, SDPose surpasses Sapiens and ViTPose++-H, matching SoTA on COCO and setting new records on HumanArt and COCO-OOD; yellow boxes show baseline failures.

Abstract

Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold and Lotus adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs (e.g., human pose estimation) remains underexplored. In this paper, we propose SDPose, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net’s image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct COCO-OOD, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Furthermore, we showcase SDPose as a zero-shot pose annotator for downstream controllable generation tasks, including ControlNet-based image synthesis and video generation, where it delivers qualitatively superior pose guidance.

🎥 Demo: Animation Video Pose Estimation in the Wild

Experiments

Qualitative Results on Real-world Images

Real-world Images Results

Quantitative Results on COCO Validation Set

Model Pre-trained
Backbone
Parameters Input size Train
dataset
Train
epochs
AP AR
Sapiens-1B Sapiens ViT 1.169B 1024 × 768 COCO 210 82.1 85.9
Sapiens-2B Sapiens ViT 2.163B 1024 × 768 COCO 210 82.2 86.0
GenLoc Stable Diffusion-v1.5 0.95B N/A COCO 14 77.6 80.7
ViTPose++-H ViTAE 0.63B 256 × 192 Mixture* 210 79.4 N/A
SDPose (Ours) Stable Diffusion-v1.5 0.95B 1024 × 768 COCO 40 81.2 85.3
SDPose (Ours) Stable Diffusion-v2 0.95B 1024 × 768 COCO 40 81.3 85.2

* Mixture includes COCO, AIC, MPII, AP10K, APT36K, and WholeBody datasets.

Qualitative Comparison for Whole-Body Estimation on Stylized Pictures

Comparison on Stylized Paintings: Sapiens WholeBody vs. SDPose WholeBody

Comparison on Stylized Paintings: Sapiens Whole-Body vs. SDPose Whole-Body. All erroneous predictions are highlighted with yellow boxes. SDPose yields fewer false positives and notably better facial keypoint localization.

Details of COCO-OOD

To complement the HumanArt dataset and enable OOD evaluation under matched content and labels, we construct COCO-OOD by applying artistic style transfer to the original COCO images. We adopt the official CycleGAN framework to perform unpaired image-to-image translation from the COCO domain (natural photographs) to the target domain of Monet-style paintings. We use the monet2photo model provided in the CycleGAN repository, which is trained in an unsupervised manner to learn the mapping between Monet paintings and real-world photos. During conversion, all training and validation images in COCO are processed to produce style-transferred counterparts, while preserving their original human annotations (bounding boxes, keypoints). This yields an OOD variant of COCO in which the underlying scene structure is unchanged, but the texture, color palette, and brushstroke patterns are consistent with Monet's artistic style. Importantly, for fair comparison and to avoid introducing priors from large-scale pretrained diffusion models, we intentionally adopt the earlier CycleGAN framework rather than more recent style transfer methods. Such stylization introduces a significant appearance shift while keeping pose-related geometric information intact, making it suitable for robust pose estimation evaluation.

COCO-OOD Dataset Visualization Grid

COCO-OOD Dataset Visualization. A 5×10 grid showing 50 randomly sampled images from the COCO-OOD dataset, demonstrating the diversity of Monet-style oil painting transformations applied to COCO validation images.

COCO-OOD visualizations with Monet-style oil painting

COCO-OOD visualizations with Monet-style oil painting. We use CycleGAN to stylize COCO validation images into a Monet-like oil painting domain, creating an OOD split to evaluate pose estimation robustness under appearance shift.

Quantitative Comparison on OOD Benchmarks HumanArt and COCO-OOD

Model Pre-trained
Backbone
Parameters Train
epochs
HumanArt COCO-OOD
AP AR AP AR
Sapiens-1B Sapiens ViT 1.169B 210 64.3 67.4 58.8 63.3
Sapiens-2B Sapiens ViT 2.163B 210 69.6 72.2 59.6 64.0
GenLoc Stable Diffusion-v1.5 0.95B 14 67.0 70.8 N/A N/A
SDPose (Ours) Stable Diffusion-v2 0.95B 40 71.2 73.9 63.5 68.2

All models are trained on COCO with an input size of 1024 × 768; full HumanArt results for other methods are in the supplementary.

Zero-shot Downstream Applications

Better Pose-guided Image Generation

For human or humanoid character generation, an accurate skeleton is essential for transferring poses between characters. Traditional pose estimators often fail to precisely capture the skeletons of art-based human or humanoid characters. Our method provides a generalizable pose estimation approach that can benefit animation production. As shown in the figure below, we compare ControlNet outputs using DWPose as the baseline pose annotator. Notably, our SDPose yields more precise and detailed skeletons than DWPose, enabling reliable pose transfer and high-quality image generation for artistic characters.

Visualization of pose-guided image generation results

Visualization of pose-guided image generation results. The lower images illustrate results from the baseline, which combines a pre-trained ControlNet with the DWPose estimator. In comparison, the upper images show results obtained using our SDPose as the pose annotator. Yellow boxes highlight baseline failures. Prompts, random seeds, and other settings are kept identical for fairness.

Pose-guided Video Generation

Recent advances in controlled video generation have gained significant traction. Despite the progress of video generation models in producing higher-quality outputs, extracting reliable control conditions remains critical for achieving high-quality results. As shown in the figure below, our SDPose provides more accurate poses for the driving frames, enabling more reliable pose-sequence transfer from animations to animations. Video frames are generated by Moore-Animated Anyone.

Qualitative comparison for pose-controlled video generation in the wild

Qualitative comparison for pose-controlled video generation in the wild. The first row shows the source image and frames from the driving video. The second row shows output video frames generated from the pose sequence estimated by the baseline model DWPose, while the third row shows the results guided by our SDPose. Red boxes highlight failures in the generated video, and yellow boxes highlight errors in pose estimation.

Finetuning Protocol

Leveraging the Multi-scale Latent Features for Pose Estimation

In this paper, we directly leverage the multi-scale latent features of SD U-Net for the pose estimation task. The input image is encoded by the frozen SD-VAE encoder and then fed into the SD U-Net, from which we extract multi-scale features at the upsampling stage. These features serve as robust representations for downstream applications.

Lightweight Heatmap Decoder Head

Stable Diffusion's U-Net outputs a 4-channel latent for the VAE through a single convolutional layer. In contrast, pose estimation requires K-channel heatmaps with K ≫ 4, making the 4-channel latent a severe information bottleneck. To address this, we replace the original 4-channel head with a lightweight heatmap decoder. The decoder consists of a deconvolution layer for upsampling, followed by two 1×1 convolutions that output K-channel heatmaps. This modification removes the bottleneck and shortens the supervision path to keypoints.

Training Pipeline of SDPose

Training Pipeline of SDPose. The input RGB image is first encoded into the latent space by a pre-trained VAE. The U-Net is conditioned for multi-task learning via a class embedding. When the class label is set to [0,1], the U-Net predicts the reconstructed RGB latent; when set to [1,0], it produces features for heatmap prediction. The output layer of the U-Net is task-specific: the original convolutional output layer is retained for RGB latent reconstruction, while a lightweight heatmap decoder is used to process the U-Net's intermediate features for keypoint heatmap prediction.

Auxiliary RGB Reconstruction

To preserve the fine-detail representation capability of diffusion priors and to avoid overfitting to the pose estimation domain, we adopt the Detail Preserver strategy. Concretely, we introduce a class embedding C ∈ {CRGB, CPose} that controls the behavior of the denoising U-Net fθ. When CRGB is provided, the network is trained to reconstruct the RGB latent zRGB; when CPose is provided, it learns to reconstruct the ground-truth heatmap HPose. The overall objective is:

L = ||zRGB - fθ(zinput, t, CRGB)||² + ||HPose - fθ(zinput, t, CPose)||²

where zinput is the latent encoded from the input image by the SD-VAE, and t is fixed to t=1000 in our experiments.

Inference

The input RGB image x is encoded by the SD-VAE into the latent representation zRGB. The latent diffusion U-Net then performs a single-step regression with the timestep fixed at t = 1000, using the class label CPose to execute the pose estimation task. The text condition is disabled by feeding an empty text embedding to the U-Net.

SDPose Inference Pipeline

SDPose Inference Pipeline. The input RGB image is encoded by the SD-VAE into the latent representation. The latent diffusion U-Net then performs a single-step regression with the timestep fixed at t = 1000, using the class label CPose to execute the pose estimation task. The text condition is disabled by feeding an empty text embedding to the U-Net.

BibTeX

@misc{liang2025sdposeexploitingdiffusionpriors,
      title={SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation}, 
      author={Shuang Liang and Jing He and Chuanmeizhi Wang and Lejun Liao and Guo Zhang and Yingcong Chen and Yuan Yuan},
      year={2025},
      eprint={2509.24980},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.24980}, 
}