Computer Vision and Pattern Recognition
☆ 3D Aware Region Prompted Vision Language Model
An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu
We present Spatial Region 3D (SR-3D) aware vision-language model that
connects single-view 2D images and multi-view 3D data through a shared visual
token space. SR-3D supports flexible region prompting, allowing users to
annotate regions with bounding boxes, segmentation masks on any frame, or
directly in 3D, without the need for exhaustive multi-frame labeling. We
achieve this by enriching 2D visual features with 3D positional embeddings,
which allows the 3D model to draw upon strong 2D priors for more accurate
spatial reasoning across frames, even when objects of interest do not co-occur
within the same view. Extensive experiments on both general 2D vision language
and specialized 3D spatial benchmarks demonstrate that SR-3D achieves
state-of-the-art performance, underscoring its effectiveness for unifying 2D
and 3D representation space on scene understanding. Moreover, we observe
applicability to in-the-wild videos without sensory 3D inputs or ground-truth
3D annotations, where SR-3D accurately infers spatial relationships and metric
measurements.
comment: Project Website: https://www.anjiecheng.me/sr3d
☆ StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance SIGGRAPH
Creating 3D assets that follow the texture and geometry style of existing
ones is often desirable or even inevitable in practical applications like video
gaming and virtual reality. While impressive progress has been made in
generating 3D objects from text or images, creating style-controllable 3D
assets remains a complex and challenging problem. In this work, we propose
StyleSculptor, a novel training-free approach for generating style-guided 3D
assets from a content image and one or more style images. Unlike previous
works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner,
enabling fine-grained 3D style control that captures the texture, geometry, or
both styles of user-provided style images. At the core of StyleSculptor is a
novel Style Disentangled Attention (SD-Attn) module, which establishes a
dynamic interaction between the input content image and style image for
style-guided 3D asset generation via a cross-3D attention mechanism, enabling
stable feature fusion and effective style-guided generation. To alleviate
semantic content leakage, we also introduce a style-disentangled feature
selection strategy within the SD-Attn module, which leverages the variance of
3D feature patches to disentangle style- and content-significant channels,
allowing selective feature injection within the attention framework. With
SD-Attn, the network can dynamically compute texture-, geometry-, or
both-guided features to steer the 3D generation process. Built upon this, we
further propose the Style Guided Control (SGC) mechanism, which enables
exclusive geometry- or texture-only stylization, as well as adjustable style
intensity control. Extensive experiments demonstrate that StyleSculptor
outperforms existing baseline methods in producing high-fidelity 3D assets.
comment: SIGGRAPH Asia 2025 Conference Paper
♻ ☆ CryoSplat: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction
As a critical modality for structural biology, cryogenic electron microscopy
(cryo-EM) facilitates the determination of macromolecular structures at
near-atomic resolution. The core computational task in single-particle cryo-EM
is to reconstruct the 3D electrostatic potential of a molecule from a large
collection of noisy 2D projections acquired at unknown orientations. Gaussian
mixture models (GMMs) provide a continuous, compact, and physically
interpretable representation for molecular density and have recently gained
interest in cryo-EM reconstruction. However, existing methods rely on external
consensus maps or atomic models for initialization, limiting their use in
self-contained pipelines. Addressing this issue, we introduce cryoGS, a
GMM-based method that integrates Gaussian splatting with the physics of cryo-EM
image formation. In particular, we develop an orthogonal projection-aware
Gaussian splatting, with adaptations such as a normalization term and
FFT-aligned coordinate system tailored for cryo-EM imaging. All these
innovations enable stable and efficient homogeneous reconstruction directly
from raw cryo-EM particle images using random initialization. Experimental
results on real datasets validate the effectiveness and robustness of cryoGS
over representative baselines. The code will be released upon publication.