Computer Vision and Pattern Recognition
☆ Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation
Zhen Xu, Hongyu Zhou, Sida Peng, Haotong Lin, Haoyu Guo, Jiahao Shao, Peishan Yang, Qinglin Yang, Sheng Miao, Xingyi He, Yifan Wang, Yue Wang, Ruizhen Hu, Yiyi Liao, Xiaowei Zhou, Hujun Bao
Depth estimation is a fundamental task in 3D computer vision, crucial for
applications such as 3D reconstruction, free-viewpoint rendering, robotics,
autonomous driving, and AR/VR technologies. Traditional methods relying on
hardware sensors like LiDAR are often limited by high costs, low resolution,
and environmental sensitivity, limiting their applicability in real-world
scenarios. Recent advances in vision-based methods offer a promising
alternative, yet they face challenges in generalization and stability due to
either the low-capacity model architectures or the reliance on domain-specific
and small-scale datasets. The emergence of scaling laws and foundation models
in other domains has inspired the development of "depth foundation models":
deep neural networks trained on large datasets with strong zero-shot
generalization capabilities. This paper surveys the evolution of deep learning
architectures and paradigms for depth estimation across the monocular, stereo,
multi-view, and monocular video settings. We explore the potential of these
models to address existing challenges and provide a comprehensive overview of
large-scale datasets that can facilitate their development. By identifying key
architectures and training strategies, we aim to highlight the path towards
robust depth foundation models, offering insights into their future research
and applications.
☆ Streaming 4D Visual Geometry Transformer
Perceiving and reconstructing 4D spatial-temporal geometry from videos is a
fundamental yet challenging computer vision task. To facilitate interactive and
real-time applications, we propose a streaming 4D visual geometry transformer
that shares a similar philosophy with autoregressive large language models. We
explore a simple and efficient design and employ a causal transformer
architecture to process the input sequence in an online manner. We use temporal
causal attention and cache the historical keys and values as implicit memory to
enable efficient streaming long-term 4D reconstruction. This design can handle
real-time 4D reconstruction by incrementally integrating historical information
while maintaining high-quality spatial consistency. For efficient training, we
propose to distill knowledge from the dense bidirectional visual geometry
grounded transformer (VGGT) to our causal model. For inference, our model
supports the migration of optimized efficient attention operator (e.g.,
FlashAttention) from the field of large language models. Extensive experiments
on various 4D geometry perception benchmarks demonstrate that our model
increases the inference speed in online scenarios while maintaining competitive
performance, paving the way for scalable and interactive 4D vision systems.
Code is available at: https://github.com/wzzheng/StreamVGGT.
comment: Code is available at: https://github.com/wzzheng/StreamVGGT
☆ CharaConsist: Fine-Grained Consistent Character Generation ICCV 2025
In text-to-image generation, producing a series of consistent contents that
preserve the same identity is highly valuable for real-world applications.
Although a few works have explored training-free methods to enhance the
consistency of generated subjects, we observe that they suffer from the
following problems. First, they fail to maintain consistent background details,
which limits their applicability. Furthermore, when the foreground character
undergoes large motion variations, inconsistencies in identity and clothing
details become evident. To address these problems, we propose CharaConsist,
which employs point-tracking attention and adaptive token merge along with
decoupled control of the foreground and background. CharaConsist enables
fine-grained consistency for both foreground and background, supporting the
generation of one character in continuous shots within a fixed scene or in
discrete shots across different scenes. Moreover, CharaConsist is the first
consistent generation method tailored for text-to-image DiT model. Its ability
to maintain fine-grained consistency, combined with the larger capacity of
latest base model, enables it to produce high-quality visual outputs,
broadening its applicability to a wider range of real-world scenarios. The
source code has been released at https://github.com/Murray-Wang/CharaConsist
comment: ICCV 2025 accepted paper, project page:
https://murray-wang.github.io/CharaConsist/
☆ CATVis: Context-Aware Thought Visualization MICCAI 2025
EEG-based brain-computer interfaces (BCIs) have shown promise in various
applications, such as motor imagery and cognitive state monitoring. However,
decoding visual representations from EEG signals remains a significant
challenge due to their complex and noisy nature. We thus propose a novel
5-stage framework for decoding visual representations from EEG signals: (1) an
EEG encoder for concept classification, (2) cross-modal alignment of EEG and
text embeddings in CLIP feature space, (3) caption refinement via re-ranking,
(4) weighted interpolation of concept and caption embeddings for richer
semantics, and (5) image generation using a pre-trained Stable Diffusion model.
We enable context-aware EEG-to-image generation through cross-modal alignment
and re-ranking. Experimental results demonstrate that our method generates
high-quality images aligned with visual stimuli, outperforming SOTA approaches
by 13.43% in Classification Accuracy, 15.21% in Generation Accuracy and
reducing Fr\'echet Inception Distance by 36.61%, indicating superior semantic
alignment and image quality.
comment: Accepted at MICCAI 2025. This is the submitted version prior to peer
review. The final Version of Record will appear in the MICCAI 2025
proceedings (Springer LNCS)
☆ COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation
Pakizar Shamoi, Nuray Toganas, Muragul Muratbekova, Elnara Kadyrgali, Adilet Yerkin, Ayan Igali, Malika Ziyada, Ayana Adilova, Aron Karatayev, Yerdauit Torekhan
Colors are omnipresent in today's world and play a vital role in how humans
perceive and interact with their surroundings. However, it is challenging for
computers to imitate human color perception. This paper introduces the Human
Perception-Based Fuzzy Color Model, COLIBRI (Color Linguistic-Based
Representation and Interpretation), designed to bridge the gap between
computational color representations and human visual perception. The proposed
model uses fuzzy sets and logic to create a framework for color categorization.
Using a three-phase experimental approach, the study first identifies
distinguishable color stimuli for hue, saturation, and intensity through
preliminary experiments, followed by a large-scale human categorization survey
involving more than 1000 human subjects. The resulting data are used to extract
fuzzy partitions and generate membership functions that reflect real-world
perceptual uncertainty. The model incorporates a mechanism for adaptation that
allows refinement based on feedback and contextual changes. Comparative
evaluations demonstrate the model's alignment with human perception compared to
traditional color models, such as RGB, HSV, and LAB. To the best of our
knowledge, no previous research has documented the construction of a model for
color attribute specification based on a sample of this size or a comparable
sample of the human population (n = 2496). Our findings are significant for
fields such as design, artificial intelligence, marketing, and human-computer
interaction, where perceptually relevant color representation is critical.
comment: submitted to IEEE for consideration
☆ C-FBI: A Combinatorial method using Convolutions for Circle Fitting in Blurry Images
This paper addresses the fundamental computer vision challenge of robust
circle detection and fitting in degraded imaging conditions. We present
Combinatorial Convolution-based Circle Fitting for Blurry Images (3C-FBI), an
algorithm that bridges the gap between circle detection and precise parametric
fitting by combining (1) efficient combinatorial edge pixel (edgel) sampling
and (2) convolution-based density estimation in parameter space.
We evaluate 3C-FBI across three experimental frameworks: (1) real-world
medical data from Parkinson's disease assessments (144 frames from 36 videos),
(2) controlled synthetic data following established circle-fitting benchmarks,
and (3) systematic analysis across varying spatial resolutions and outlier
contamination levels. Results show that 3C-FBI achieves state-of-the-art
accuracy (Jaccard index 0.896) while maintaining real-time performance (40.3
fps), significantly outperforming classical methods like RCD (6.8 fps) on a
standard CPU (i7-10875H). It maintains near-perfect accuracy (Jaccard almost
1.0) at high resolutions (480x480) and reliable performance (Jaccard higher
than 0.95) down to 160x160 with up to 20% outliers.
In extensive synthetic testing, 3C-FBI achieves a mean Jaccard Index of 0.989
across contamination levels, comparable to modern methods like Qi et al. (2024,
0.991), and surpassing RHT (0.964). This combination of accuracy, speed, and
robustness makes 3C-FBI ideal for medical imaging, robotics, and industrial
inspection under challenging conditions.
comment: 22 pages, 16 figures
☆ HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing
Accurate characterization of vascular geometry is essential for
cardiovascular diagnosis and treatment planning. Traditional statistical shape
modeling (SSM) methods rely on linear assumptions, limiting their expressivity
and scalability to complex topologies such as multi-branch vascular structures.
We introduce HUG-VAS, a Hierarchical NURBS Generative model for Vascular
geometry Synthesis, which integrates NURBS surface parameterization with
diffusion-based generative modeling to synthesize realistic, fine-grained
aortic geometries. Trained with 21 patient-specific samples, HUG-VAS generates
anatomically faithful aortas with supra-aortic branches, yielding biomarker
distributions that closely match those of the original dataset. HUG-VAS adopts
a hierarchical architecture comprising a denoising diffusion model that
generates centerlines and a guided diffusion model that synthesizes radial
profiles conditioned on those centerlines, thereby capturing two layers of
anatomical variability. Critically, the framework supports zero-shot
conditional generation from image-derived priors, enabling practical
applications such as interactive semi-automatic segmentation, robust
reconstruction under degraded imaging conditions, and implantable device
optimization. To our knowledge, HUG-VAS is the first SSM framework to bridge
image-derived priors with generative shape modeling via a unified integration
of NURBS parameterization and hierarchical diffusion processes.
comment: 59 pages, 9 figures
☆ Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model SIGGRAPH 2025
High-quality 3D assets are essential for various applications in computer
graphics and 3D vision but remain scarce due to significant acquisition costs.
To address this shortage, we introduce Elevate3D, a novel framework that
transforms readily accessible low-quality 3D assets into higher quality. At the
core of Elevate3D is HFS-SDEdit, a specialized texture enhancement method that
significantly improves texture quality while preserving the appearance and
geometry while fixing its degradations. Furthermore, Elevate3D operates in a
view-by-view manner, alternating between texture and geometry refinement.
Unlike previous methods that have largely overlooked geometry refinement, our
framework leverages geometric cues from images refined with HFS-SDEdit by
employing state-of-the-art monocular geometry predictors. This approach ensures
detailed and accurate geometry that aligns seamlessly with the enhanced
texture. Elevate3D outperforms recent competitors by achieving state-of-the-art
quality in 3D model refinement, effectively addressing the scarcity of
high-quality open-source 3D assets.
comment: Accepted to SIGGRAPH 2025. For the project page, see
https://cg.postech.ac.kr/research/Elevate3D/
☆ Deep Equilibrium models for Poisson Imaging Inverse problems via Mirror Descent
Deep Equilibrium Models (DEQs) are implicit neural networks with fixed
points, which have recently gained attention for learning image regularization
functionals, particularly in settings involving Gaussian fidelities, where
assumptions on the forward operator ensure contractiveness of standard
(proximal) Gradient Descent operators. In this work, we extend the application
of DEQs to Poisson inverse problems, where the data fidelity term is more
appropriately modeled by the Kullback-Leibler divergence. To this end, we
introduce a novel DEQ formulation based on Mirror Descent defined in terms of a
tailored non-Euclidean geometry that naturally adapts with the structure of the
data term. This enables the learning of neural regularizers within a principled
training framework. We derive sufficient conditions to guarantee the
convergence of the learned reconstruction scheme and propose computational
strategies that enable both efficient training and fully parameter-free
inference. Numerical experiments show that our method outperforms traditional
model-based approaches and it is comparable to the performance of Bregman
Plug-and-Play methods, while mitigating their typical drawbacks - namely,
sensitivity to initialization and careful tuning of hyperparameters. The code
is publicly available at https://github.com/christiandaniele/DEQ-MD.
☆ COLI: A Hierarchical Efficient Compressor for Large Images
The escalating adoption of high-resolution, large-field-of-view imagery
amplifies the need for efficient compression methodologies. Conventional
techniques frequently fail to preserve critical image details, while
data-driven approaches exhibit limited generalizability. Implicit Neural
Representations (INRs) present a promising alternative by learning continuous
mappings from spatial coordinates to pixel intensities for individual images,
thereby storing network weights rather than raw pixels and avoiding the
generalization problem. However, INR-based compression of large images faces
challenges including slow compression speed and suboptimal compression ratios.
To address these limitations, we introduce COLI (Compressor for Large Images),
a novel framework leveraging Neural Representations for Videos (NeRV). First,
recognizing that INR-based compression constitutes a training process, we
accelerate its convergence through a pretraining-finetuning paradigm,
mixed-precision training, and reformulation of the sequential loss into a
parallelizable objective. Second, capitalizing on INRs' transformation of image
storage constraints into weight storage, we implement Hyper-Compression, a
novel post-training technique to substantially enhance compression ratios while
maintaining minimal output distortion. Evaluations across two medical imaging
datasets demonstrate that COLI consistently achieves competitive or superior
PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while
accelerating NeRV training by up to 4 times.
☆ Implementing Adaptations for Vision AutoRegressive Model ICML 2025
Vision AutoRegressive model (VAR) was recently introduced as an alternative
to Diffusion Models (DMs) in image generation domain. In this work we focus on
its adaptations, which aim to fine-tune pre-trained models to perform specific
downstream tasks, like medical data generation. While for DMs there exist many
techniques, adaptations for VAR remain underexplored. Similarly, differentially
private (DP) adaptations-ones that aim to preserve privacy of the adaptation
data-have been extensively studied for DMs, while VAR lacks such solutions. In
our work, we implement and benchmark many strategies for VAR, and compare them
to state-of-the-art DM adaptation strategies. We observe that VAR outperforms
DMs for non-DP adaptations, however, the performance of DP suffers, which
necessitates further research in private adaptations for VAR. Code is available
at https://github.com/sprintml/finetuning_var_dp.
comment: Accepted at DIG-BUGS: Data in Generative Models Workshop @ ICML 2025
☆ U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV MICCAI2025
Achieving equity in healthcare accessibility requires lightweight yet
high-performance solutions for medical image segmentation, particularly in
resource-limited settings. Existing methods like U-Net and its variants often
suffer from limited global Effective Receptive Fields (ERFs), hindering their
ability to capture long-range dependencies. To address this, we propose U-RWKV,
a novel framework leveraging the Recurrent Weighted Key-Value(RWKV)
architecture, which achieves efficient long-range modeling at O(N)
computational cost. The framework introduces two key innovations: the
Direction-Adaptive RWKV Module(DARM) and the Stage-Adaptive
Squeeze-and-Excitation Module(SASE). DARM employs Dual-RWKV and QuadScan
mechanisms to aggregate contextual cues across images, mitigating directional
bias while preserving global context and maintaining high computational
efficiency. SASE dynamically adapts its architecture to different feature
extraction stages, balancing high-resolution detail preservation and semantic
relationship capture. Experiments demonstrate that U-RWKV achieves
state-of-the-art segmentation performance with high computational efficiency,
offering a practical solution for democratizing advanced medical imaging
technologies in resource-constrained environments. The code is available at
https://github.com/hbyecoding/U-RWKV.
comment: Accepted by MICCAI2025
☆ Stochastic Entanglement Configuration for Constructive Entanglement Topologies in Quantum Machine Learning with Application to Cardiac MRI
Efficient entanglement strategies are essential for advancing variational
quantum circuits (VQCs) for quantum machine learning (QML). However, most
current approaches use fixed entanglement topologies that are not adaptive to
task requirements, limiting potential gains over classical models. We introduce
a novel stochastic entanglement configuration method that systematically
generates diverse entanglement topologies to identify a subspace of
constructive entanglement configurations, defined as entanglement topologies
that boost hybrid model performance (e.g., classification accuracy) beyond
classical baselines. Each configuration is encoded as a stochastic binary
matrix, denoting directed entanglement between qubits. This enables scalable
exploration of the hyperspace of candidate entanglement topologies using
entanglement density and per-qubit constraints as key metrics. We define
unconstrained and constrained sampling modes, controlling entanglement per
qubit. Using our method, 400 stochastic configurations were generated and
evaluated in a hybrid QML for cardiac MRI disease classification. We identified
64 (16%) novel constructive entanglement configurations that consistently
outperformed the classical baseline. Ensemble aggregation of top-performing
configurations achieved ~0.92 classification accuracy, exceeding the classical
model (~0.87) by over 5%. Compared to four conventional topologies (ring,
nearest neighbor, no entanglement, fully entangled), none surpassed the
classical baseline (maximum accuracy ~0.82), while our configurations delivered
up to ~20% higher accuracy. Thus, highlighting the robustness and
generalizability of the identified constructive entanglements.
comment: Accepted for publication at IEEE International Conference on Quantum
Computing and Engineering (QCE) 2025
☆ Attributes Shape the Embedding Space of Face Recognition Models
Face Recognition (FR) tasks have made significant progress with the advent of
Deep Neural Networks, particularly through margin-based triplet losses that
embed facial images into high-dimensional feature spaces. During training,
these contrastive losses focus exclusively on identity information as labels.
However, we observe a multiscale geometric structure emerging in the embedding
space, influenced by interpretable facial (e.g., hair color) and image
attributes (e.g., contrast). We propose a geometric approach to describe the
dependence or invariance of FR models to these attributes and introduce a
physics-inspired alignment metric. We evaluate the proposed metric on
controlled, simplified models and widely used FR models fine-tuned with
synthetic data for targeted attribute augmentation. Our findings reveal that
the models exhibit varying degrees of invariance across different attributes,
providing insight into their strengths and weaknesses and enabling deeper
interpretability. Code available here:
https://github.com/mantonios107/attrs-fr-embs}{https://github.com/mantonios107/attrs-fr-embs
☆ UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
Real-world user-generated videos, especially on platforms like TikTok, often
feature rich and intertwined audio visual content. However, existing video
captioning benchmarks and models remain predominantly visual centric,
overlooking the crucial role of audio in conveying scene dynamics, speaker
intent, and narrative context. This lack of omni datasets and lightweight,
capable models hampers progress in fine grained, multimodal video
understanding. To address these challenges, we introduce UGC-VideoCap, a new
benchmark and model framework specifically designed for detailed omnimodal
captioning of short form user-generated videos. Unlike prior datasets,
UGC-VideoCap emphasizes balanced integration of audio and visual modalities,
featuring 1000 TikTok videos annotated through a structured three stage
human-in-the-loop pipeline covering audio only, visual only, and joint audio
visual semantics. The benchmark also includes 4000 carefully crafted QA pairs
probing both unimodal and cross modal understanding. Alongside the dataset, we
propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from
Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine
tuning followed by Group Relative Policy Optimization (GRPO), our approach
enables efficient adaptation from limited data while maintaining competitive
performance. Together, our benchmark and model offer a high-quality foundation
and a data-efficient solution for advancing omnimodal video captioning in
unconstrained real-world UGC settings.
☆ MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network ICCV 2025
Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for
a sequence of calibrated images to recover dense point clouds. However,
existing MVS methods often struggle with challenging regions, such as
textureless regions and reflective surfaces, where feature matching fails. In
contrast, monocular depth estimation inherently does not require feature
matching, allowing it to achieve robust relative depth estimation in these
regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature
and depth guided MVS network that integrates powerful priors from a monocular
foundation model into multi-view geometry. Firstly, the monocular feature of
the reference view is integrated into source view features by the attention
mechanism with a newly designed cross-view position encoding. Then, the
monocular depth of the reference view is aligned to dynamically update the
depth candidates for edge regions during the sampling procedure. Finally, a
relative consistency loss is further designed based on the monocular depth to
supervise the depth prediction. Extensive experiments demonstrate that
MonoMVSNet achieves state-of-the-art performance on the DTU and
Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate
and Advanced benchmarks. The source code is available at
https://github.com/JianfeiJ/MonoMVSNet.
comment: Accepted by ICCV 2025
☆ HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging
Arefin Ittesafun Abian, Ripon Kumar Debnath, Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Asif Karim, Reem E. Mohamed, Sami Azam
Accurate liver and tumor segmentation on abdominal CT images is critical for
reliable diagnosis and treatment planning, but remains challenging due to
complex anatomical structures, variability in tumor appearance, and limited
annotated data. To address these issues, we introduce Hyperbolic-convolutions
Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity
Network (HANS-Net), a novel segmentation framework that synergistically
combines hyperbolic convolutions for hierarchical geometric representation, a
wavelet-inspired decomposition module for multi-scale texture learning, a
biologically motivated synaptic plasticity mechanism for adaptive feature
enhancement, and an implicit neural representation branch to model fine-grained
and continuous anatomical boundaries. Additionally, we incorporate
uncertainty-aware Monte Carlo dropout to quantify prediction confidence and
lightweight temporal attention to improve inter-slice consistency without
sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate
that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an
average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap
error (VOE) of 11.91%. Furthermore, cross-dataset validation on the
3D-IRCADb-01 dataset obtains an average Dice of 87.45%, IoU of 80.30%, ASSD of
1.525 mm, and VOE of 19.71%, indicating strong generalization across different
datasets. These results confirm the effectiveness and robustness of HANS-Net in
providing anatomically consistent, accurate, and confident liver and tumor
segmentation.
comment: 10 figures. Will be submitted to IEEE Transactions on Radiation and
Plasma Medical Sciences
☆ A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction
Recently, Gaussian Splatting (GS) has received a lot of attention in surface
reconstruction. However, while 3D objects can be of complex and diverse shapes
in the real world, existing GS-based methods only limitedly use a single type
of splatting primitive (Gaussian ellipse or Gaussian ellipsoid) to represent
object surfaces during their reconstruction. In this paper, we highlight that
this can be insufficient for object surfaces to be represented in high quality.
Thus, we propose a novel framework that, for the first time, enables Gaussian
Splatting to incorporate multiple types of (geometrical) primitives during its
surface reconstruction process. Specifically, in our framework, we first
propose a compositional splatting strategy, enabling the splatting and
rendering of different types of primitives in the Gaussian Splatting pipeline.
In addition, we also design our framework with a mixed-primitive-based
initialization strategy and a vertex pruning mechanism to further promote its
surface representation learning process to be well executed leveraging
different types of primitives. Extensive experiments show the efficacy of our
framework and its accurate surface reconstruction performance.
☆ All Eyes, no IMU: Learning Flight Attitude from Vision Alone
Vision is an essential part of attitude control for many flying animals, some
of which have no dedicated sense of gravity. Flying robots, on the other hand,
typically depend heavily on accelerometers and gyroscopes for attitude
stabilization. In this work, we present the first vision-only approach to
flight control for use in generic environments. We show that a quadrotor drone
equipped with a downward-facing event camera can estimate its attitude and
rotation rate from just the event stream, enabling flight control without
inertial sensors. Our approach uses a small recurrent convolutional neural
network trained through supervised learning. Real-world flight tests
demonstrate that our combination of event camera and low-latency neural network
is capable of replacing the inertial measurement unit in a traditional flight
control loop. Furthermore, we investigate the network's generalization across
different environments, and the impact of memory and different fields of view.
While networks with memory and access to horizon-like visual cues achieve best
performance, variants with a narrower field of view achieve better relative
generalization. Our work showcases vision-only flight control as a promising
candidate for enabling autonomous, insect-scale flying robots.
☆ Detección y Cuantificación de Erosión Fluvial con Visión Artificial
Fluvial erosion is a natural process that can generate significant impacts on
soil stability and strategic infrastructures. The detection and monitoring of
this phenomenon is traditionally addressed by photogrammetric methods and
analysis in geographic information systems. These tasks require specific
knowledge and intensive manual processing. This study proposes an artificial
intelligence-based approach for automatic identification of eroded zones and
estimation of their area. The state-of-the-art computer vision model YOLOv11,
adjusted by fine-tuning and trained with photographs and LiDAR images, is used.
This combined dataset was segmented and labeled using the Roboflow platform.
Experimental results indicate efficient detection of erosion patterns with an
accuracy of 70%, precise identification of eroded areas and reliable
calculation of their extent in pixels and square meters. As a final product,
the EROSCAN system has been developed, an interactive web application that
allows users to upload images and obtain automatic segmentations of fluvial
erosion, together with the estimated area. This tool optimizes the detection
and quantification of the phenomenon, facilitating decision making in risk
management and territorial planning.
comment: 18 pages, in Spanish language, 13 figures, 4 tables
☆ 3D Magnetic Inverse Routine for Single-Segment Magnetic Field Images
In semiconductor packaging, accurately recovering 3D information is crucial
for non-destructive testing (NDT) to localize circuit defects. This paper
presents a novel approach called the 3D Magnetic Inverse Routine (3D MIR),
which leverages Magnetic Field Images (MFI) to retrieve the parameters for the
3D current flow of a single-segment. The 3D MIR integrates a deep learning
(DL)-based Convolutional Neural Network (CNN), spatial-physics-based
constraints, and optimization techniques. The method operates in three stages:
i) The CNN model processes the MFI data to predict ($\ell/z_o$), where $\ell$
is the wire length and $z_o$ is the wire's vertical depth beneath the magnetic
sensors and classify segment type ($c$). ii) By leveraging
spatial-physics-based constraints, the routine provides initial estimates for
the position ($x_o$, $y_o$, $z_o$), length ($\ell$), current ($I$), and current
flow direction (positive or negative) of the current segment. iii) An optimizer
then adjusts these five parameters ($x_o$, $y_o$, $z_o$, $\ell$, $I$) to
minimize the difference between the reconstructed MFI and the actual MFI. The
results demonstrate that the 3D MIR method accurately recovers 3D information
with high precision, setting a new benchmark for magnetic image reconstruction
in semiconductor packaging. This method highlights the potential of combining
DL and physics-driven optimization in practical applications.
comment: copyright 2025 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
☆ Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers ICCV 2025
In this paper, we study task-oriented human grasp synthesis, a new grasp
synthesis task that demands both task and context awareness. At the core of our
method is the task-aware contact maps. Unlike traditional contact maps that
only reason about the manipulated object and its relation with the hand, our
enhanced maps take into account scene and task information. This comprehensive
map is critical for hand-object interaction, enabling accurate grasping poses
that align with the task. We propose a two-stage pipeline that first constructs
a task-aware contact map informed by the scene and task. In the subsequent
stage, we use this contact map to synthesize task-oriented human grasps. We
introduce a new dataset and a metric for the proposed task to evaluate our
approach. Our experiments validate the importance of modeling both scene and
task, demonstrating significant improvements over existing methods in both
grasp quality and task performance. See our project page for more details:
https://hcis-lab.github.io/TOHGS/
comment: Accepted by ICCV 2025
☆ Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping
Observer bias and inconsistencies in traditional plant phenotyping methods
limit the accuracy and reproducibility of fine-grained plant analysis. To
overcome these challenges, we developed TomatoMAP, a comprehensive dataset for
Solanum lycopersicum using an Internet of Things (IoT) based imaging system
with standardized data acquisition protocols. Our dataset contains 64,464 RGB
images that capture 12 different plant poses from four camera elevation angles.
Each image includes manually annotated bounding boxes for seven regions of
interest (ROIs), including leaves, panicle, batch of flowers, batch of fruits,
axillary shoot, shoot and whole plant area, along with 50 fine-grained growth
stage classifications based on the BBCH scale. Additionally, we provide 3,616
high-resolution image subset with pixel-wise semantic and instance segmentation
annotations for fine-grained phenotyping. We validated our dataset using a
cascading model deep learning framework combining MobileNetv3 for
classification, YOLOv11 for object detection, and MaskRCNN for segmentation.
Through AI vs. Human analysis involving five domain experts, we demonstrate
that the models trained on our dataset achieve accuracy and speed comparable to
the experts. Cohen's Kappa and inter-rater agreement heatmap confirm the
reliability of automated fine-grained phenotyping using our approach.
☆ YOLOatr : Deep Learning Based Automatic Target Detection and Localization in Thermal Infrared Imagery
Automatic Target Detection (ATD) and Recognition (ATR) from Thermal Infrared
(TI) imagery in the defense and surveillance domain is a challenging computer
vision (CV) task in comparison to the commercial autonomous vehicle perception
domain. Limited datasets, peculiar domain-specific and TI modality-specific
challenges, i.e., limited hardware, scale invariance issues due to greater
distances, deliberate occlusion by tactical vehicles, lower sensor resolution
and resultant lack of structural information in targets, effects of weather,
temperature, and time of day variations, and varying target to clutter ratios
all result in increased intra-class variability and higher inter-class
similarity, making accurate real-time ATR a challenging CV task. Resultantly,
contemporary state-of-the-art (SOTA) deep learning architectures underperform
in the ATR domain. We propose a modified anchor-based single-stage detector,
called YOLOatr, based on a modified YOLOv5s, with optimal modifications to the
detection heads, feature fusion in the neck, and a custom augmentation profile.
We evaluate the performance of our proposed model on a comprehensive DSIAC MWIR
dataset for real-time ATR over both correlated and decorrelated testing
protocols. The results demonstrate that our proposed model achieves
state-of-the-art ATR performance of up to 99.6%.
comment: Published in 25th Irish Machine Vision and Image Processing Conf.,
Galway, Ireland, Aug 30-Sep 1 2023 Also available at
https://doi.org/10.5281/zenodo.8264062
☆ ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition ICCV 2025
3D visual grounding aims to identify and localize objects in a 3D space based
on textual descriptions. However, existing methods struggle with disentangling
targets from anchors in complex multi-anchor queries and resolving
inconsistencies in spatial descriptions caused by perspective variations. To
tackle these challenges, we propose ViewSRD, a framework that formulates 3D
visual grounding as a structured multi-view decomposition process. First, the
Simple Relation Decoupling (SRD) module restructures complex multi-anchor
queries into a set of targeted single-anchor statements, generating a
structured set of perspective-aware descriptions that clarify positional
relationships. These decomposed representations serve as the foundation for the
Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates
textual and scene features across multiple viewpoints using shared, Cross-modal
Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a
Textual-Scene Reasoning module synthesizes multi-view predictions into a
unified and robust 3D visual grounding. Experiments on 3D visual grounding
datasets show that ViewSRD significantly outperforms state-of-the-art methods,
particularly in complex queries requiring precise spatial differentiation.
comment: Accepted by ICCV 2025
☆ MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection
Smoke is the first visible indicator of a wildfire.With the advancement of
deep learning, image-based smoke detection has become a crucial method for
detecting and preventing forest fires. However, the scarcity of smoke image
data from forest fires is one of the significant factors hindering the
detection of forest fire smoke. Image generation models offer a promising
solution for synthesizing realistic smoke images. However, current inpainting
models exhibit limitations in generating high-quality smoke representations,
particularly manifesting as inconsistencies between synthesized smoke and
background contexts. To solve these problems, we proposed a comprehensive
framework for generating forest fire smoke images. Firstly, we employed the
pre-trained segmentation model and the multimodal model to obtain smoke masks
and image captions.Then, to address the insufficient utilization of masks and
masked images by inpainting models, we introduced a network architecture guided
by mask and masked image features. We also proposed a new loss function, the
mask random difference loss, which enhances the consistency of the generated
effects around the mask by randomly expanding and eroding the mask
edges.Finally, to generate a smoke image dataset using random masks for
subsequent detection tasks, we incorporated smoke characteristics and use a
multimodal large language model as a filtering tool to select diverse and
reasonable smoke images, thereby improving the quality of the synthetic
dataset. Experiments showed that our generated smoke images are realistic and
diverse, and effectively enhance the performance of forest fire smoke detection
models. Code is available at https://github.com/wghr123/MFGDiffusion.
comment: 18 pages, 11 figures
☆ Fairness-Aware Grouping for Continuous Sensitive Variables: Application for Debiasing Face Analysis with respect to Skin Tone
Within a legal framework, fairness in datasets and models is typically
assessed by dividing observations into predefined groups and then computing
fairness measures (e.g., Disparate Impact or Equality of Odds with respect to
gender). However, when sensitive attributes such as skin color are continuous,
dividing into default groups may overlook or obscure the discrimination
experienced by certain minority subpopulations. To address this limitation, we
propose a fairness-based grouping approach for continuous (possibly
multidimensional) sensitive attributes. By grouping data according to observed
levels of discrimination, our method identifies the partition that maximizes a
novel criterion based on inter-group variance in discrimination, thereby
isolating the most critical subgroups.
We validate the proposed approach using multiple synthetic datasets and
demonstrate its robustness under changing population distributions - revealing
how discrimination is manifested within the space of sensitive attributes.
Furthermore, we examine a specialized setting of monotonic fairness for the
case of skin color. Our empirical results on both CelebA and FFHQ, leveraging
the skin tone as predicted by an industrial proprietary algorithm, show that
the proposed segmentation uncovers more nuanced patterns of discrimination than
previously reported, and that these findings remain stable across datasets for
a given model. Finally, we leverage our grouping model for debiasing purpose,
aiming at predicting fair scores with group-by-group post-processing. The
results demonstrate that our approach improves fairness while having minimal
impact on accuracy, thus confirming our partition method and opening the door
for industrial deployment.
☆ NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models
With the rapid development of foundation video generation technologies, long
video generation models have exhibited promising research potential thanks to
expanded content creation space. Recent studies reveal that the goal of long
video generation tasks is not only to extend video duration but also to
accurately express richer narrative content within longer videos. However, due
to the lack of evaluation benchmarks specifically designed for long video
generation models, the current assessment of these models primarily relies on
benchmarks with simple narrative prompts (e.g., VBench). To the best of our
knowledge, our proposed NarrLV is the first benchmark to comprehensively
evaluate the Narrative expression capabilities of Long Video generation models.
Inspired by film narrative theory, (i) we first introduce the basic narrative
unit maintaining continuous visual presentation in videos as Temporal Narrative
Atom (TNA), and use its count to quantitatively measure narrative richness.
Guided by three key film narrative elements influencing TNA changes, we
construct an automatic prompt generation pipeline capable of producing
evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based
on the three progressive levels of narrative content expression, we design an
effective evaluation metric using the MLLM-based question generation and
answering framework. (iii) Finally, we conduct extensive evaluations on
existing long video generation models and the foundation generation models.
Experimental results demonstrate that our metric aligns closely with human
judgments. The derived evaluation outcomes reveal the detailed capability
boundaries of current video generation models in narrative content expression.
comment: Project Page: https://amap-ml.github.io/NarrLV-Website/
☆ A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition
Xinkui Zhao, Jinsong Shu, Yangyang Wu, Guanjie Cheng, Zihe Liu, Naibo Wang, Shuiguang Deng, Zhongle Xie, Jianwei Yin
Multimodal Emotion Recognition (MER) often encounters incomplete
multimodality in practical applications due to sensor failures or privacy
protection requirements. While existing methods attempt to address various
incomplete multimodal scenarios by balancing the training of each modality
combination through additional gradients, these approaches face a critical
limitation: training gradients from different modality combinations conflict
with each other, ultimately degrading the performance of the final prediction
model. In this paper, we propose a unimodal decoupled dynamic low-rank
adaptation method based on modality combinations, named MCULoRA, which is a
novel framework for the parameter-efficient training of incomplete multimodal
learning models. MCULoRA consists of two key modules, modality combination
aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The
MCLA module effectively decouples the shared information from the distinct
characteristics of individual modality combinations. The DPFT module adjusts
the training ratio of modality combinations based on the separability of each
modality's representation space, optimizing the learning efficiency across
different modality combinations. Our extensive experimental evaluation in
multiple benchmark datasets demonstrates that MCULoRA substantially outperforms
previous incomplete multimodal learning approaches in downstream task accuracy.
☆ How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study
Vision-Language Models (VLMs) trained on web-scale corpora excel at natural
image tasks and are increasingly repurposed for healthcare; however, their
competence in medical tasks remains underexplored. We present a comprehensive
evaluation of open-source general-purpose and medically specialised VLMs,
ranging from 3B to 72B parameters, across eight benchmarks: MedXpert,
OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model
performance across different aspects, we first separate it into understanding
and reasoning components. Three salient findings emerge. First, large
general-purpose models already match or surpass medical-specific counterparts
on several benchmarks, demonstrating strong zero-shot transfer from natural to
medical images. Second, reasoning performance is consistently lower than
understanding, highlighting a critical barrier to safe decision support. Third,
performance varies widely across benchmarks, reflecting differences in task
design, annotation quality, and knowledge demands. No model yet reaches the
reliability threshold for clinical deployment, underscoring the need for
stronger multimodal alignment and more rigorous, fine-grained evaluation
protocols.
comment: Accepted by the International Conference on AI in Healthcare 2025
☆ Clustering-Guided Multi-Layer Contrastive Representation Learning for Citrus Disease Classification
Citrus, as one of the most economically important fruit crops globally,
suffers severe yield depressions due to various diseases. Accurate disease
detection and classification serve as critical prerequisites for implementing
targeted control measures. Recent advancements in artificial intelligence,
particularly deep learning-based computer vision algorithms, have substantially
decreased time and labor requirements while maintaining the accuracy of
detection and classification. Nevertheless, these methods predominantly rely on
massive, high-quality annotated training examples to attain promising
performance. By introducing two key designs: contrasting with cluster centroids
and a multi-layer contrastive training (MCT) paradigm, this paper proposes a
novel clustering-guided self-supervised multi-layer contrastive representation
learning (CMCRL) algorithm. The proposed method demonstrates several advantages
over existing counterparts: (1) optimizing with massive unannotated samples;
(2) effective adaptation to the symptom similarity across distinct citrus
diseases; (3) hierarchical feature representation learning. The proposed method
achieves state-of-the-art performance on the public citrus image set CDD,
outperforming existing methods by 4.5\%-30.1\% accuracy. Remarkably, our method
narrows the performance gap with fully supervised counterparts (all samples are
labeled). Beyond classification accuracy, our method shows great performance on
other evaluation metrics (F1 score, precision, and recall), highlighting the
robustness against the class imbalance challenge.
comment: 11 pages, 5 figures
☆ Assessing Color Vision Test in Large Vision-language Models
With the widespread adoption of large vision-language models, the capacity
for color vision in these models is crucial. However, the color vision
abilities of large visual-language models have not yet been thoroughly
explored. To address this gap, we define a color vision testing task for large
vision-language models and construct a dataset \footnote{Anonymous Github
Showing some of the data
https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers
multiple categories of test questions and tasks of varying difficulty levels.
Furthermore, we analyze the types of errors made by large vision-language
models and propose fine-tuning strategies to enhance their performance in color
vision tests.
☆ Latent Space Consistency for Sparse-View CT Reconstruction
Computed Tomography (CT) is a widely utilized imaging modality in clinical
settings. Using densely acquired rotational X-ray arrays, CT can capture 3D
spatial features. However, it is confronted with challenged such as significant
time consumption and high radiation exposure. CT reconstruction methods based
on sparse-view X-ray images have garnered substantial attention from
researchers as they present a means to mitigate costs and risks. In recent
years, diffusion models, particularly the Latent Diffusion Model (LDM), have
demonstrated promising potential in the domain of 3D CT reconstruction.
Nonetheless, due to the substantial differences between the 2D latent
representation of X-ray modalities and the 3D latent representation of CT
modalities, the vanilla LDM is incapable of achieving effective alignment
within the latent space. To address this issue, we propose the Consistent
Latent Space Diffusion Model (CLS-DM), which incorporates cross-modal feature
contrastive learning to efficiently extract latent 3D information from 2D X-ray
images and achieve latent space alignment between modalities. Experimental
results indicate that CLS-DM outperforms classical and state-of-the-art
generative models in terms of standard voxel-level metrics (PSNR, SSIM) on the
LIDC-IDRI and CTSpine1K datasets. This methodology not only aids in enhancing
the effectiveness and economic viability of sparse X-ray reconstructed CT but
can also be generalized to other cross-modal transformation tasks, such as
text-to-image synthesis. We have made our code publicly available at
https://anonymous.4open.science/r/CLS-DM-50D6/ to facilitate further research
and applications in other domains.
comment: ACMMM2025 Accepted
☆ RMAU-NET: A Residual-Multihead-Attention U-Net Architecture for Landslide Segmentation and Detection from Remote Sensing Images
Lam Pham, Cam Le, Hieu Tang, Khang Truong, Truong Nguyen, Jasmin Lampert, Alexander Schindler, Martin Boyer, Son Phan
In recent years, landslide disasters have reported frequently due to the
extreme weather events of droughts, floods , storms, or the consequence of
human activities such as deforestation, excessive exploitation of natural
resources. However, automatically observing landslide is challenging due to the
extremely large observing area and the rugged topography such as mountain or
highland. This motivates us to propose an end-to-end deep-learning-based model
which explores the remote sensing images for automatically observing landslide
events. By considering remote sensing images as the input data, we can obtain
free resource, observe large and rough terrains by time. To explore the remote
sensing images, we proposed a novel neural network architecture which is for
two tasks of landslide detection and landslide segmentation. We evaluated our
proposed model on three different benchmark datasets of LandSlide4Sense, Bijie,
and Nepal. By conducting extensive experiments, we achieve F1 scores of 98.23,
93.83 for the landslide detection task on LandSlide4Sense, Bijie datasets; mIoU
scores of 63.74, 76.88 on the segmentation tasks regarding LandSlide4Sense,
Nepal datasets. These experimental results prove potential to integrate our
proposed model into real-life landslide observation systems.
☆ MMOne: Representing Multiple Modalities in One Scene ICCV 2025
Humans perceive the world through multimodal cues to understand and interact
with the environment. Learning a scene representation for multiple modalities
enhances comprehension of the physical world. However, modality conflicts,
arising from inherent distinctions among different modalities, present two
critical challenges: property disparity and granularity disparity. To address
these challenges, we propose a general framework, MMOne, to represent multiple
modalities in one scene, which can be readily extended to additional
modalities. Specifically, a modality modeling module with a novel modality
indicator is proposed to capture the unique properties of each modality.
Additionally, we design a multimodal decomposition mechanism to separate
multi-modal Gaussians into single-modal Gaussians based on modality
differences. We address the essential distinctions among modalities by
disentangling multimodal information into shared and modality-specific
components, resulting in a more compact and efficient multimodal scene
representation. Extensive experiments demonstrate that our method consistently
enhances the representation capability for each modality and is scalable to
additional modalities. The code is available at
https://github.com/Neal2020GitHub/MMOne.
comment: Accepted to ICCV 2025
☆ Try Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID
Hard samples pose a significant challenge in person re-identification (ReID)
tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent
ambiguity or similarity, coupled with the lack of explicit definitions, makes
them a fundamental bottleneck. These issues not only limit the design of
targeted learning strategies but also diminish the model's robustness under
clothing or viewpoint changes. In this paper, we propose a novel
multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which
is the first effort to unify textual and visual modalities to explicitly
define, generate, and optimize hard samples within a unified paradigm. HSGL
comprises two core components: (1) Dual-Granularity Hard Sample Generation
(DGHSG), which leverages multimodal cues to synthesize semantically consistent
samples, including both coarse- and fine-grained hard positives and negatives
for effectively increasing the hardness and diversity of the training data. (2)
Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware
optimization strategy that adjusts feature distances based on textual semantic
labels, encouraging the separation of hard positives and drawing hard negatives
closer in the embedding space to enhance the model's discriminative capability
and robustness to hard samples. Extensive experiments on multiple CC-ReID
benchmarks demonstrate the effectiveness of our approach and highlight the
potential of multimodal-guided hard sample generation and learning for robust
CC-ReID. Notably, HSAL significantly accelerates the convergence of the
targeted learning procedure and achieves state-of-the-art performance on both
PRCC and LTCC datasets. The code is available at
https://github.com/undooo/TryHarder-ACMMM25.
☆ Jellyfish Species Identification: A CNN Based Artificial Neural Network Approach
Jellyfish, a diverse group of gelatinous marine organisms, play a crucial
role in maintaining marine ecosystems but pose significant challenges for
biodiversity and conservation due to their rapid proliferation and ecological
impact. Accurate identification of jellyfish species is essential for
ecological monitoring and management. In this study, we proposed a deep
learning framework for jellyfish species detection and classification using an
underwater image dataset. The framework integrates advanced feature extraction
techniques, including MobileNetV3, ResNet50, EfficientNetV2-B0, and VGG16,
combined with seven traditional machine learning classifiers and three
Feedforward Neural Network classifiers for precise species identification.
Additionally, we activated the softmax function to directly classify jellyfish
species using the convolutional neural network models. The combination of the
Artificial Neural Network with MobileNetV3 is our best-performing model,
achieving an exceptional accuracy of 98%, significantly outperforming other
feature extractor-classifier combinations. This study demonstrates the efficacy
of deep learning and hybrid frameworks in addressing biodiversity challenges
and advancing species detection in marine environments.
comment: This paper has been accepted at the IEEE QPAIN 2025. The final
version will be available in the IEEE Xplore Digital Library
☆ KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model
The emergence of Multimodal Large Language Models (MLLMs) has revolutionized
image understanding by bridging textual and visual modalities. However, these
models often struggle with capturing fine-grained semantic information, such as
the precise identification and analysis of object keypoints. Keypoints, as
structure-aware, pixel-level, and compact representations of objects,
particularly articulated ones, play a crucial role in applications such as
fine-grained image analysis, object retrieval, and behavior recognition. In
this paper, we propose KptLLM++, a novel multimodal large language model that
specifically designed for generic keypoint comprehension through the
integration of diverse input modalities guided by user-defined instructions. By
unifying keypoint detection across varied contexts, KptLLM++ establishes itself
as an advanced interface, fostering more effective human-AI collaboration. The
model is built upon a novel identify-then-detect paradigm, which first
interprets keypoint semantics and subsequently localizes their precise
positions through a structured chain-of-thought reasoning mechanism. To push
the boundaries of performance, we have scaled up the training dataset to over
500K samples, encompassing diverse objects, keypoint categories, image styles,
and scenarios with complex occlusions. This extensive scaling enables KptLLM++
to unlock its potential, achieving remarkable accuracy and generalization.
Comprehensive experiments on multiple keypoint detection benchmarks demonstrate
its state-of-the-art performance, underscoring its potential as a unified
solution for fine-grained image understanding and its transformative
implications for human-AI interaction.
comment: Extended Version of KptLLM. arXiv admin note: text overlap with
arXiv:2411.01846
☆ A Survey on Interpretability in Visual Recognition
In recent years, visual recognition methods have advanced significantly,
finding applications across diverse fields. While researchers seek to
understand the mechanisms behind the success of these models, there is also a
growing impetus to deploy them in critical areas like autonomous driving and
medical diagnostics to better diagnose failures, which promotes the development
of interpretability research. This paper systematically reviews existing
research on the interpretability of visual recognition models and proposes a
taxonomy of methods from a human-centered perspective. The proposed taxonomy
categorizes interpretable recognition methods based on Intent, Object,
Presentation, and Methodology, thereby establishing a systematic and coherent
set of grouping criteria for these XAI methods. Additionally, we summarize the
requirements for evaluation metrics and explore new opportunities enabled by
recent technologies, such as large multimodal models. We aim to organize
existing research in this domain and inspire future investigations into the
interpretability of visual recognition models.
comment: 20 pages, 7 figures, 2 tables. Under review
☆ Atmos-Bench: 3D Atmospheric Structures for Climate Insight
Atmospheric structure, represented by backscatter coefficients (BC) recovered
from satellite LiDAR attenuated backscatter (ATB), provides a volumetric view
of clouds, aerosols, and molecules, playing a critical role in human
activities, climate understanding, and extreme weather forecasting. Existing
methods often rely on auxiliary inputs and simplified physics-based
approximations, and lack a standardized 3D benchmark for fair evaluation.
However, such approaches may introduce additional uncertainties and
insufficiently capture realistic radiative transfer and atmospheric
scattering-absorption effects. To bridge these gaps, we present Atmos-Bench:
the first 3D atmospheric benchmark, along with a novel FourCastX:
Frequency-enhanced Spatio-Temporal Mixture-of-Experts Network that (a)
generates 921,600 image slices from 3D scattering volumes simulated at 532 nm
and 355 nm by coupling WRF with an enhanced COSP simulator over 384 land-ocean
time steps, yielding high-quality voxel-wise references; (b) embeds ATB-BC
physical constraints into the model architecture, promoting energy consistency
during restoration; (c) achieves consistent improvements on the Atmos-Bench
dataset across both 355 nm and 532 nm bands, outperforming state-of-the-art
baseline models without relying on auxiliary inputs. Atmos-Bench establishes a
new standard for satellite-based 3D atmospheric structure recovery and paves
the way for deeper climate insight.
☆ Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification
Ground penetrating radar (GPR) has become a rapid and non-destructive
solution for road subsurface distress (RSD) detection. However, RSD recognition
from GPR images is labor-intensive and heavily relies on inspectors' expertise.
Deep learning offers the possibility for automatic RSD recognition, but its
current performance is limited by two factors: Scarcity of high-quality dataset
for network training and insufficient capability of network to distinguish RSD.
In this study, a rigorously validated 3D GPR dataset containing 2134 samples of
diverse types was constructed through field scanning. Based on the finding that
the YOLO model trained with one of the three scans of GPR images exhibits
varying sensitivity to specific type of RSD, we proposed a novel
cross-verification strategy with outstanding accuracy in RSD recognition,
achieving recall over 98.6% in field tests. The approach, integrated into an
online RSD detection system, can reduce the labor of inspection by around 90%.
☆ GKNet: Graph-based Keypoints Network for Monocular Pose Estimation of Non-cooperative Spacecraft
Monocular pose estimation of non-cooperative spacecraft is significant for
on-orbit service (OOS) tasks, such as satellite maintenance, space debris
removal, and station assembly. Considering the high demands on pose estimation
accuracy, mainstream monocular pose estimation methods typically consist of
keypoint detectors and PnP solver. However, current keypoint detectors remain
vulnerable to structural symmetry and partial occlusion of non-cooperative
spacecraft. To this end, we propose a graph-based keypoints network for the
monocular pose estimation of non-cooperative spacecraft, GKNet, which leverages
the geometric constraint of keypoints graph. In order to better validate
keypoint detectors, we present a moderate-scale dataset for the spacecraft
keypoint detection, named SKD, which consists of 3 spacecraft targets, 90,000
simulated images, and corresponding high-precise keypoint annotations.
Extensive experiments and an ablation study have demonstrated the high accuracy
and effectiveness of our GKNet, compared to the state-of-the-art spacecraft
keypoint detectors. The code for GKNet and the SKD dataset is available at
https://github.com/Dongzhou-1996/GKNet.
☆ Joint angle model based learning to refine kinematic human pose estimation
Chang Peng, Yifei Zhou, Huifeng Xi, Shiqing Huang, Chuangye Chen, Jianming Yang, Bao Yang, Zhenyu Jiang
Marker-free human pose estimation (HPE) has found increasing applications in
various fields. Current HPE suffers from occasional errors in keypoint
recognition and random fluctuation in keypoint trajectories when analyzing
kinematic human poses. The performance of existing deep learning-based models
for HPE refinement is considerably limited by inaccurate training datasets in
which the keypoints are manually annotated. This paper proposed a novel method
to overcome the difficulty through joint angle-based modeling. The key
techniques include: (i) A joint angle-based model of human pose, which is
robust to describe kinematic human poses; (ii) Approximating temporal variation
of joint angles through high order Fourier series to get reliable "ground
truth"; (iii) A bidirectional recurrent network is designed as a
post-processing module to refine the estimation of well-established HRNet.
Trained with the high-quality dataset constructed using our method, the network
demonstrates outstanding performance to correct wrongly recognized joints and
smooth their spatiotemporal trajectories. Tests show that joint angle-based
refinement (JAR) outperforms the state-of-the-art HPE refinement network in
challenging cases like figure skating and breaking.
☆ LogTinyLLM: Tiny Large Language Models Based Contextual Log Anomaly Detection
Log anomaly detection using traditional rule based or deep learning based
methods is often challenging due to the large volume and highly complex nature
of log sequence. So effective way of detection of anomalous sequence of logs is
crucial for system maintenance and development. This paper proposes parameter
efficient finetuning specifically low rank adaptation (LoRA) and adapter based
approaches for finding contextual anomalies in sequence of logs in large log
data set. It compares different tiny large language models (LLMs) on the
Thunderbird dataset. The results show that LoRA based finetuning provides
substantial performance improvements of 18 to 19 percentage over LogBert based
full finetuning approach, achieving accuracy scores between 97.76% and 98.83%
compared to 79.37%.
☆ TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update
Understanding the 3D geometry of transparent objects from RGB images is
challenging due to their inherent physical properties, such as reflection and
refraction. To address these difficulties, especially in scenarios with sparse
views and dynamic environments, we introduce TRAN-D, a novel 2D Gaussian
Splatting-based depth reconstruction method for transparent objects. Our key
insight lies in separating transparent objects from the background, enabling
focused optimization of Gaussians corresponding to the object. We mitigate
artifacts with an object-aware loss that places Gaussians in obscured regions,
ensuring coverage of invisible surfaces while reducing overfitting.
Furthermore, we incorporate a physics-based simulation that refines the
reconstruction in just a few seconds, effectively handling object removal and
chain-reaction movement of remaining objects without the need for rescanning.
TRAN-D is evaluated on both synthetic and real-world sequences, and it
consistently demonstrated robust improvements over existing GS-based
state-of-the-art methods. In comparison with baselines, TRAN-D reduces the mean
absolute error by over 39% for the synthetic TRansPose sequences. Furthermore,
despite being updated using only one image, TRAN-D reaches a {\delta} < 2.5 cm
accuracy of 48.46%, over 1.5 times that of baselines, which uses six images.
Code and more results are available at https://jeongyun0609.github.io/TRAN-D/.
☆ Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
Recent advances in 3D neural representations and instance-level editing
models have enabled the efficient creation of high-quality 3D content. However,
achieving precise local 3D edits remains challenging, especially for Gaussian
Splatting, due to inconsistent multi-view 2D part segmentations and inherently
ambiguous nature of Score Distillation Sampling (SDS) loss. To address these
limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that
enables precise and drastic part-level modifications. First, we introduce a
robust 3D mask generation module with our 3D-Geometry Aware Label Prediction
(3D-GALP), which uses spherical harmonics (SH) coefficients to model
view-dependent label variations and soft-label property, yielding accurate and
consistent part segmentations across viewpoints. Second, we propose a
regularized SDS loss that combines the standard SDS loss with additional
regularizers. In particular, an L1 anchor loss is introduced via our Scheduled
Latent Mixing and Part (SLaMP) editing method, which generates high-quality
part-edited 2D images and confines modifications only to the target region
while preserving contextual coherence. Additional regularizers, such as
Gaussian prior removal, further improve flexibility by allowing changes beyond
the existing context, and robust 3D masking prevents unintended edits.
Experimental results demonstrate that our RoMaP achieves state-of-the-art local
3D editing on both reconstructed and generated Gaussian scenes and objects
qualitatively and quantitatively, making it possible for more robust and
flexible part-level 3D Gaussian editing.
☆ Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation ICCV 2025
Medical language-guided segmentation, integrating textual clinical reports as
auxiliary guidance to enhance image segmentation, has demonstrated significant
improvements over unimodal approaches. However, its inherent reliance on paired
image-text input, which we refer to as ``textual reliance", presents two
fundamental limitations: 1) many medical segmentation datasets lack paired
reports, leaving a substantial portion of image-only data underutilized for
training; and 2) inference is limited to retrospective analysis of cases with
paired reports, limiting its applicability in most clinical scenarios where
segmentation typically precedes reporting. To address these limitations, we
propose ProLearn, the first Prototype-driven Learning framework for
language-guided segmentation that fundamentally alleviates textual reliance. At
its core, in ProLearn, we introduce a novel Prototype-driven Semantic
Approximation (PSA) module to enable approximation of semantic guidance from
textual input. PSA initializes a discrete and compact prototype space by
distilling segmentation-relevant semantics from textual reports. Once
initialized, it supports a query-and-respond mechanism which approximates
semantic guidance for images without textual input, thereby alleviating textual
reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG
demonstrate that ProLearn outperforms state-of-the-art language-guided methods
when limited text is available.
comment: Accepted to ICCV 2025
☆ Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery
We present GLOD, a transformer-first architecture for object detection in
high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin
Transformer for end-to-end feature extraction, combined with novel UpConvMixer
blocks for robust upsampling and Fusion Blocks for multi-scale feature
integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods
by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a
multi-path head design capturing objects across scales. The architecture is
optimized for satellite imagery challenges, leveraging spatial priors while
maintaining computational efficiency.
comment: 11 pages, 9 figures
☆ A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion
The kinematics analysis of foot-ankle complex during gait is essential for
advancing biomechanical research and clinical assessment. Collecting accurate
surface geometry data from the foot and ankle during dynamic gait conditions is
inherently challenging due to swing foot occlusions and viewing limitations.
Thus, this paper introduces FootGait3D, a novel multi-view dataset of
high-resolution ankle-foot surface point clouds captured during natural gait.
Different from existing gait datasets that typically target whole-body or
lower-limb motion, FootGait3D focuses specifically on the detailed modeling of
the ankle-foot region, offering a finer granularity of motion data. To address
this, FootGait3D consists of 8,403 point cloud frames collected from 46
subjects using a custom five-camera depth sensing system. Each frame includes a
complete 5-view reconstruction of the foot and ankle (serving as ground truth)
along with partial point clouds obtained from only four, three, or two views.
This structured variation enables rigorous evaluation of 3D point cloud
completion methods under varying occlusion levels and viewpoints. Our dataset
is designed for shape completion tasks, facilitating the benchmarking of
state-of-the-art single-modal (e.g., PointTr, SnowflakeNet, Anchorformer) and
multi-modal (e.g., SVDFormer, PointSea, CSDN) completion networks on the
challenge of recovering the full foot geometry from occluded inputs. FootGait3D
has significant potential to advance research in biomechanics and multi-segment
foot modeling, offering a valuable testbed for clinical gait analysis,
prosthetic design, and robotics applications requiring detailed 3D models of
the foot during motion. The dataset is now available at
https://huggingface.co/datasets/ljw285/FootGait3D.
comment: 15 pages, 10 figures, 2 tables
☆ Efficient Dual-domain Image Dehazing with Haze Prior Perception
Transformer-based models exhibit strong global modeling capabilities in
single-image dehazing, but their high computational cost limits real-time
applicability. Existing methods predominantly rely on spatial-domain features
to capture long-range dependencies, which are computationally expensive and
often inadequate under complex haze conditions. While some approaches introduce
frequency-domain cues, the weak coupling between spatial and frequency branches
limits the overall performance. To overcome these limitations, we propose the
Dark Channel Guided Frequency-aware Dehazing Network (DGFDNet), a novel
dual-domain framework that performs physically guided degradation alignment
across spatial and frequency domains. At its core, the DGFDBlock comprises two
key modules: 1) the Haze-Aware Frequency Modulator (HAFM), which generates a
pixel-level haze confidence map from dark channel priors to adaptively enhance
haze-relevant frequency components, thereby achieving global degradation-aware
spectral modulation; 2) the Multi-level Gating Aggregation Module (MGAM), which
fuses multi-scale features through diverse convolutional kernels and hybrid
gating mechanisms to recover fine structural details. Additionally, a Prior
Correction Guidance Branch (PCGB) incorporates a closed-loop feedback
mechanism, enabling iterative refinement of the prior by intermediate dehazed
features and significantly improving haze localization accuracy, especially in
challenging outdoor scenes. Extensive experiments on four benchmark haze
datasets demonstrate that DGFDNet achieves state-of-the-art performance with
superior robustness and real-time efficiency. Code is available at:
https://github.com/Dilizlr/DGFDNet.
comment: 12 pages
☆ Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation ICCV 2025
Sunghyun Park, Jungsoo Lee, Shubhankar Borse, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, Fatih Porikli
While open-vocabulary semantic segmentation (OVSS) can segment an image into
semantic regions based on arbitrarily given text descriptions even for classes
unseen during training, it fails to understand personal texts (e.g., `my mug
cup') for segmenting regions of specific interest to users. This paper
addresses challenges like recognizing `my mug cup' among `multiple mug cups'.
To overcome this challenge, we introduce a novel task termed
\textit{personalized open-vocabulary semantic segmentation} and propose a text
prompt tuning-based plug-in method designed to recognize personal visual
concepts using a few pairs of images and masks, while maintaining the
performance of the original OVSS. Based on the observation that reducing false
predictions is essential when applying text prompt tuning to this task, our
proposed method employs `negative mask proposal' that captures visual concepts
other than the personalized concept. We further improve the performance by
enriching the representation of text prompts by injecting visual embeddings of
the personal concept into them. This approach enhances personalized OVSS
without compromising the original OVSS performance. We demonstrate the
superiority of our method on our newly established benchmarks for this task,
including FSS$^\text{per}$, CUB$^\text{per}$, and ADE$^\text{per}$.
comment: Accepted to ICCV 2025; 15 pages
☆ Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion
We present a novel framework for CBCT-to-MDCT translation, grounded in the
Schrodinger Bridge (SB) formulation, which integrates GAN-derived priors with
human-guided conditional diffusion. Unlike conventional GANs or diffusion
models, our approach explicitly enforces boundary consistency between CBCT
inputs and pseudo targets, ensuring both anatomical fidelity and perceptual
controllability. Binary human feedback is incorporated via classifier-free
guidance (CFG), effectively steering the generative process toward clinically
preferred outcomes. Through iterative refinement and tournament-based
preference selection, the model internalizes human preferences without relying
on a reward model. Subtraction image visualizations reveal that the proposed
method selectively attenuates shade artifacts in key anatomical regions while
preserving fine structural detail. Quantitative evaluations further demonstrate
superior performance across RMSE, SSIM, LPIPS, and Dice metrics on clinical
datasets -- outperforming prior GAN- and fine-tuning-based feedback methods --
while requiring only 10 sampling steps. These findings underscore the
effectiveness and efficiency of our framework for real-time, preference-aligned
medical image translation.
☆ First-Order Error Matters: Accurate Compensation for Quantized Large Language Models
Post-training quantization (PTQ) offers an efficient approach to compressing
large language models (LLMs), significantly reducing memory access and
computational costs. Existing compensation-based weight calibration methods
often rely on a second-order Taylor expansion to model quantization error,
under the assumption that the first-order term is negligible in well-trained
full-precision models. However, we reveal that the progressive compensation
process introduces accumulated first-order deviations between latent weights
and their full-precision counterparts, making this assumption fundamentally
flawed. To address this, we propose FOEM, a novel PTQ method that explicitly
incorporates first-order gradient terms to improve quantization error
compensation. FOEM approximates gradients by directly computing the difference
between latent and full-precision weights, avoiding the high cost and limited
generalization of backpropagation-based gradient computation. This approach
introduces minimal additional computational overhead. Moreover, FOEM leverages
precomputed Cholesky factors to efficiently recover the inverse of Hessian
submatrices in real time. Extensive experiments across a wide range of models
and benchmarks demonstrate that FOEM consistently outperforms the classical
GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of
Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from
51.7% to 74.9%, approaching the full-precision performance of 78.6%.
Furthermore, FOEM can be seamlessly integrated with advanced techniques such as
GPTAQ and SpinQuant, yielding additional improvements under the challenging
W4A4KV4 setting, and further narrowing the accuracy gap with full-precision
baselines beyond what current state-of-the-art methods achieve. The code is
available at https://github.com/Xingyu-Zheng/FOEM.
☆ Semantically Informed Salient Regions Guided Radiology Report Generation
Recent advances in automated radiology report generation from chest X-rays
using deep learning algorithms have the potential to significantly reduce the
arduous workload of radiologists. However, due to the inherent massive data
bias in radiology images, where abnormalities are typically subtle and sparsely
distributed, existing methods often produce fluent yet medically inaccurate
reports, limiting their applicability in clinical practice. To address this
issue effectively, we propose a Semantically Informed Salient Regions-guided
(SISRNet) report generation method. Specifically, our approach explicitly
identifies salient regions with medically critical characteristics using
fine-grained cross-modal semantics. Then, SISRNet systematically focuses on
these high-information regions during both image modeling and report
generation, effectively capturing subtle abnormal findings, mitigating the
negative impact of data bias, and ultimately generating clinically accurate
reports. Compared to its peers, SISRNet demonstrates superior performance on
widely used IU-Xray and MIMIC-CXR datasets.
☆ Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection
With the advent of vision-language models (e.g., CLIP) in zero- and few-shot
settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in
recent research, where the rare classes are essential and expected in many
applications. This study introduces \textbf{FiSeCLIP} for ZSAD with
training-free \textbf{CLIP}, combining the feature matching with the
cross-modal alignment. Testing with the entire dataset is impractical, while
batch-based testing better aligns with real industrial needs, and images within
a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes
other images in the same batch as reference information for the current image.
However, the lack of labels for these references can introduce ambiguity, we
apply text information to \textbf{fi}lter out noisy features. In addition, we
further explore CLIP's inherent potential to restore its local
\textbf{se}mantic correlation, adapting it for fine-grained anomaly detection
tasks to enable a more accurate filtering process. Our approach exhibits
superior performance for both anomaly classification and segmentation on
anomaly detection benchmarks, building a stronger baseline for the direction,
e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by
+4.6\%$\uparrow$/+5.7\%$\uparrow$ in segmentation metrics AU-ROC/$F_1$-max.
☆ Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation
Service robots are increasingly deployed in diverse and dynamic environments,
where both physical layouts and social contexts change over time and across
locations. In these unstructured settings, conventional navigation systems that
rely on fixed parameters often fail to generalize across scenarios, resulting
in degraded performance and reduced social acceptance. Although recent
approaches have leveraged reinforcement learning to enhance traditional
planners, these methods often fail in real-world deployments due to poor
generalization and limited simulation diversity, which hampers effective
sim-to-real transfer. To tackle these issues, we present LE-Nav, an
interpretable and scene-aware navigation framework that leverages multi-modal
large language model reasoning and conditional variational autoencoders to
adaptively tune planner hyperparameters. To achieve zero-shot scene
understanding, we utilize one-shot exemplars and chain-of-thought prompting
strategies. Additionally, a conditional variational autoencoder captures the
mapping between natural language instructions and navigation hyperparameters,
enabling expert-level tuning. Experiments show that LE-Nav can generate
hyperparameters achieving human-level tuning across diverse planners and
scenarios. Real-world navigation trials and a user study on a smart wheelchair
platform demonstrate that it outperforms state-of-the-art methods on
quantitative metrics such as success rate, efficiency, safety, and comfort,
while receiving higher subjective scores for perceived safety and social
acceptance. Code is available at https://github.com/Cavendish518/LE-Nav.
☆ SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition IJCNN 2025
The resurgence of convolutional neural networks (CNNs) in visual recognition
tasks, exemplified by ConvNeXt, has demonstrated their capability to rival
transformer-based architectures through advanced training methodologies and
ViT-inspired design principles. However, both CNNs and transformers exhibit a
simplicity bias, favoring straightforward features over complex structural
representations. Furthermore, modern CNNs often integrate MLP-like blocks akin
to those in transformers, but these blocks suffer from significant information
redundancies, necessitating high expansion ratios to sustain competitive
performance. To address these limitations, we propose SpaRTAN, a lightweight
architectural design that enhances spatial and channel-wise information
processing. SpaRTAN employs kernels with varying receptive fields, controlled
by kernel size and dilation factor, to capture discriminative multi-order
spatial features effectively. A wave-based channel aggregation module further
modulates and reinforces pixel interactions, mitigating channel-wise
redundancies. Combining the two modules, the proposed network can efficiently
gather and dynamically contextualize discriminative features. Experimental
results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable
parameter efficiency while maintaining competitive performance. In particular,
on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M
parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver
strong performance through an efficient design. On the COCO benchmark, it
achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M
parameters. The code is publicly available at
[https://github.com/henry-pay/SpaRTAN].
comment: Accepted at International Joint Conference on Neural Networks (IJCNN
2025)
☆ Mind the Gap: Bridging Occlusion in Gait Recognition via Residual Gap Correction
Gait is becoming popular as a method of person re-identification because of
its ability to identify people at a distance. However, most current works in
gait recognition do not address the practical problem of occlusions. Among
those which do, some require paired tuples of occluded and holistic sequences,
which are impractical to collect in the real world. Further, these approaches
work on occlusions but fail to retain performance on holistic inputs. To
address these challenges, we propose RG-Gait, a method for residual correction
for occluded gait recognition with holistic retention. We model the problem as
a residual learning task, conceptualizing the occluded gait signature as a
residual deviation from the holistic gait representation. Our proposed network
adaptively integrates the learned residual, significantly improving performance
on occluded gait sequences without compromising the holistic recognition
accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR
datasets and show that learning the residual can be an effective technique to
tackle occluded gait recognition with holistic retention.
comment: Accepted at IJCB 2025
☆ Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection IJCNN 2025
Human-object interaction (HOI) detection is essential for accurately
localizing and characterizing interactions between humans and objects,
providing a comprehensive understanding of complex visual scenes across various
domains. However, existing HOI detectors often struggle to deliver reliable
predictions efficiently, relying on resource-intensive training methods and
inefficient architectures. To address these challenges, we conceptualize a
wavelet attention-like backbone and a novel ray-based encoder architecture
tailored for HOI detection. Our wavelet backbone addresses the limitations of
expressing middle-order interactions by aggregating discriminative features
from the low- and high-order interactions extracted from diverse convolutional
filters. Concurrently, the ray-based encoder facilitates multi-scale attention
by optimizing the focus of the decoder on relevant regions of interest and
mitigating computational overhead. As a result of harnessing the attenuated
intensity of learnable ray origins, our decoder aligns query embeddings with
emphasized regions of interest for accurate predictions. Experimental results
on benchmark datasets, including ImageNet and HICO-DET, showcase the potential
of our proposed architecture. The code is publicly available at
[https://github.com/henry-pay/RayEncoder].
comment: Accepted at International Joint Conference on Neural Networks (IJCNN
2025)
☆ Teach Me Sign: Stepwise Prompting LLM for Sign Language Production ICIP 2025
Large language models, with their strong reasoning ability and rich
knowledge, have brought revolution to many tasks of AI, but their impact on
sign language generation remains limited due to its complexity and unique
rules. In this paper, we propose TEAch Me Sign (TEAM-Sign), treating sign
language as another natural language. By fine-tuning an LLM, we enable it to
learn the correspondence between text and sign language, and facilitate
generation. Considering the differences between sign and spoken language, we
employ a stepwise prompting strategy to extract the inherent sign language
knowledge within the LLM, thereby supporting the learning and generation
process. Experimental results on How2Sign and Phoenix14T datasets demonstrate
that our approach effectively leverages both the sign language knowledge and
reasoning capabilities of LLM to align the different distribution and
grammatical rules between sign and spoken language.
comment: Accepted by IEEE ICIP 2025
☆ Women Sport Actions Dataset for Visual Classification Using Small Scale Training Data
Sports action classification representing complex body postures and
player-object interactions is an emerging area in image-based sports analysis.
Some works have contributed to automated sports action recognition using
machine learning techniques over the past decades. However, sufficient image
datasets representing women sports actions with enough intra- and inter-class
variations are not available to the researchers. To overcome this limitation,
this work presents a new dataset named WomenSports for women sports
classification using small-scale training data. This dataset includes a variety
of sports activities, covering wide variations in movements, environments, and
interactions among players. In addition, this study proposes a convolutional
neural network (CNN) for deep feature extraction. A channel attention scheme
upon local contextual regions is applied to refine and enhance feature
representation. The experiments are carried out on three different sports
datasets and one dance dataset for generalizing the proposed algorithm, and the
performances on these datasets are noteworthy. The deep learning method
achieves 89.15% top-1 classification accuracy using ResNet-50 on the proposed
WomenSports dataset, which is publicly available for research at Mendeley Data.
☆ Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction
Prior human-robot interaction (HRI) research has primarily focused on
single-user interactions, where robots do not need to consider the timing or
recipient of their responses. However, in multi-party interactions, such as at
malls and hospitals, social robots must understand the context and decide both
when and to whom they should respond. In this paper, we propose a
Transformer-based multi-task learning framework to improve the decision-making
process of social robots, particularly in multi-user environments. Considering
the characteristics of HRI, we propose two novel loss functions: one that
enforces constraints on active speakers to improve scene modeling, and another
that guides response selection towards utterances specifically directed at the
robot. Additionally, we construct a novel multi-party HRI dataset that captures
real-world complexities, such as gaze misalignment. Experimental results
demonstrate that our model achieves state-of-the-art performance in respond
decisions, outperforming existing heuristic-based and single-task approaches.
Our findings contribute to the development of socially intelligent social
robots capable of engaging in natural and context-aware multi-party
interactions.
☆ Robust ID-Specific Face Restoration via Alignment Learning
The latest developments in Face Restoration have yielded significant
advancements in visual quality through the utilization of diverse diffusion
priors. Nevertheless, the uncertainty of face identity introduced by
identity-obscure inputs and stochastic generative processes remains unresolved.
To address this challenge, we present Robust ID-Specific Face Restoration
(RIDFR), a novel ID-specific face restoration framework based on diffusion
models. Specifically, RIDFR leverages a pre-trained diffusion model in
conjunction with two parallel conditioning modules. The Content Injection
Module inputs the severely degraded image, while the Identity Injection Module
integrates the specific identity from a given image. Subsequently, RIDFR
incorporates Alignment Learning, which aligns the restoration results from
multiple references with the same identity in order to suppress the
interference of ID-irrelevant face semantics (e.g. pose, expression, make-up,
hair style). Experiments demonstrate that our framework outperforms the
state-of-the-art methods, reconstructing high-quality ID-specific results with
high identity fidelity and demonstrating strong robustness.
comment: 17 pages, 8 figures
☆ Graph Aggregation Prototype Learning for Semantic Change Detection in Remote Sensing
Semantic change detection (SCD) extends the binary change detection task to
provide not only the change locations but also the detailed "from-to"
categories in multi-temporal remote sensing data. Such detailed semantic
insights into changes offer considerable advantages for a wide array of
applications. However, since SCD involves the simultaneous optimization of
multiple tasks, the model is prone to negative transfer due to task-specific
learning difficulties and conflicting gradient flows. To address this issue, we
propose Graph Aggregation Prototype Learning for Semantic Change Detection in
remote sensing(GAPL-SCD). In this framework, a multi-task joint optimization
method is designed to optimize the primary task of semantic segmentation and
change detection, along with the auxiliary task of graph aggregation prototype
learning. Adaptive weight allocation and gradient rotation methods are used to
alleviate the conflict between training tasks and improve multi-task learning
capabilities. Specifically, the graph aggregation prototype learning module
constructs an interaction graph using high-level features. Prototypes serve as
class proxies, enabling category-level domain alignment across time points and
reducing interference from irrelevant changes. Additionally, the proposed
self-query multi-level feature interaction and bi-temporal feature fusion
modules further enhance multi-scale feature representation, improving
performance in complex scenes. Experimental results on the SECOND and
Landsat-SCD datasets demonstrate that our method achieves state-of-the-art
performance, with significant improvements in accuracy and robustness for SCD
task.
☆ GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization ICCV2025
Cross-view localization, the task of estimating a camera's
3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with
satellite images, is crucial for large-scale outdoor applications like
autonomous navigation and augmented reality. Existing methods often rely on
fully supervised learning, which requires costly ground-truth pose annotations.
In this work, we propose GeoDistill, a Geometry guided weakly supervised self
distillation framework that uses teacher-student learning with Field-of-View
(FoV)-based masking to enhance local feature learning for robust cross-view
localization. In GeoDistill, the teacher model localizes a panoramic image,
while the student model predicts locations from a limited FoV counterpart
created by FoV-based masking. By aligning the student's predictions with those
of the teacher, the student focuses on key features like lane lines and ignores
textureless regions, such as roads. This results in more accurate predictions
and reduced uncertainty, regardless of whether the query images are panoramas
or limited FoV images. Our experiments show that GeoDistill significantly
improves localization performance across different frameworks. Additionally, we
introduce a novel orientation estimation network that predicts relative
orientation without requiring precise planar position ground truth. GeoDistill
provides a scalable and efficient solution for real-world cross-view
localization challenges. Code and model can be found at
https://github.com/tongshw/GeoDistill.
comment: accepted by ICCV2025
☆ Commuting Distance Regularization for Timescale-Dependent Label Inconsistency in EEG Emotion Recognition
In this work, we address the often-overlooked issue of Timescale Dependent
Label Inconsistency (TsDLI) in training neural network models for EEG-based
human emotion recognition. To mitigate TsDLI and enhance model generalization
and explainability, we propose two novel regularization strategies: Local
Variation Loss (LVL) and Local-Global Consistency Loss (LGCL). Both methods
incorporate classical mathematical principles--specifically, functions of
bounded variation and commute-time distances--within a graph theoretic
framework. Complementing our regularizers, we introduce a suite of new
evaluation metrics that better capture the alignment between temporally local
predictions and their associated global emotion labels. We validate our
approach through comprehensive experiments on two widely used EEG emotion
datasets, DREAMER and DEAP, across a range of neural architectures including
LSTM and transformer-based models. Performance is assessed using five distinct
metrics encompassing both quantitative accuracy and qualitative consistency.
Results consistently show that our proposed methods outperform state-of-the-art
baselines, delivering superior aggregate performance and offering a principled
trade-off between interpretability and predictive power under label
inconsistency. Notably, LVL achieves the best aggregate rank across all
benchmarked backbones and metrics, while LGCL frequently ranks the second,
highlighting the effectiveness of our framework.
☆ NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization
Language-guided navigation is a cornerstone of embodied AI, enabling agents
to interpret language instructions and navigate complex environments. However,
expert-provided instructions are limited in quantity, while synthesized
annotations often lack quality, making them insufficient for large-scale
research. To address this, we propose NavComposer, a novel framework for
automatically generating high-quality navigation instructions. NavComposer
explicitly decomposes semantic entities such as actions, scenes, and objects,
and recomposes them into natural language instructions. Its modular
architecture allows flexible integration of state-of-the-art techniques, while
the explicit use of semantic entities enhances both the richness and accuracy
of instructions. Moreover, it operates in a data-agnostic manner, supporting
adaptation to diverse navigation trajectories without domain-specific training.
Complementing NavComposer, we introduce NavInstrCritic, a comprehensive
annotation-free evaluation system that assesses navigation instructions on
three dimensions: contrastive matching, semantic consistency, and linguistic
diversity. NavInstrCritic provides a holistic evaluation of instruction
quality, addressing limitations of traditional metrics that rely heavily on
expert annotations. By decoupling instruction generation and evaluation from
specific navigation agents, our method enables more scalable and generalizable
research. Extensive experiments provide direct and practical evidence for the
effectiveness of our method.
☆ Modernizing CNN-based Weather Forecast Model towards Higher Computational Efficiency
Recently, AI-based weather forecast models have achieved impressive advances.
These models have reached accuracy levels comparable to traditional NWP
systems, marking a significant milestone in data-driven weather prediction.
However, they mostly leverage Transformer-based architectures, which often
leads to high training complexity and resource demands due to the massive
parameter sizes. In this study, we introduce a modernized CNN-based model for
global weather forecasting that delivers competitive accuracy while
significantly reducing computational requirements. To present a systematic
modernization roadmap, we highlight key architectural enhancements across
multiple design scales from an earlier CNN-based approach. KAI-a incorporates a
scale-invariant architecture and InceptionNeXt-based blocks within a
geophysically-aware design, tailored to the structure of Earth system data.
Trained on the ERA5 daily dataset with 67 atmospheric variables, the model
contains about 7 million parameters and completes training in just 12 hours on
a single NVIDIA L40s GPU. Our evaluation shows that KAI-a matches the
performance of state-of-the-art models in medium-range weather forecasting,
while offering a significantly lightweight design. Furthermore, case studies on
the 2018 European heatwave and the East Asian summer monsoon demonstrate
KAI-a's robust skill in capturing extreme events, reinforcing its practical
utility.
comment: 26pages, 9 Figures
☆ Trexplorer Super: Topologically Correct Centerline Tree Tracking of Tubular Objects in CT Volumes MICCAI 2025
Tubular tree structures, such as blood vessels and airways, are essential in
human anatomy and accurately tracking them while preserving their topology is
crucial for various downstream tasks. Trexplorer is a recurrent model designed
for centerline tracking in 3D medical images but it struggles with predicting
duplicate branches and terminating tracking prematurely. To address these
issues, we present Trexplorer Super, an enhanced version that notably improves
performance through novel advancements. However, evaluating centerline tracking
models is challenging due to the lack of public datasets. To enable thorough
evaluation, we develop three centerline datasets, one synthetic and two real,
each with increasing difficulty. Using these datasets, we conduct a
comprehensive evaluation of existing state-of-the-art (SOTA) models and compare
them with our approach. Trexplorer Super outperforms previous SOTA models on
every dataset. Our results also highlight that strong performance on synthetic
data does not necessarily translate to real datasets. The code and datasets are
available at https://github.com/RomStriker/Trexplorer-Super.
comment: Submitted Version. Accepted at MICCAI 2025
☆ Focus on Texture: Rethinking Pre-training in Masked Autoencoders for Medical Image Classification MICCAI 2025
Masked Autoencoders (MAEs) have emerged as a dominant strategy for
self-supervised representation learning in natural images, where models are
pre-trained to reconstruct masked patches with a pixel-wise mean squared error
(MSE) between original and reconstructed RGB values as the loss. We observe
that MSE encourages blurred image re-construction, but still works for natural
images as it preserves dominant edges. However, in medical imaging, when the
texture cues are more important for classification of a visual abnormality, the
strategy fails. Taking inspiration from Gray Level Co-occurrence Matrix (GLCM)
feature in Radiomics studies, we propose a novel MAE based pre-training
framework, GLCM-MAE, using reconstruction loss based on matching GLCM. GLCM
captures intensity and spatial relationships in an image, hence proposed loss
helps preserve morphological features. Further, we propose a novel formulation
to convert matching GLCM matrices into a differentiable loss function. We
demonstrate that unsupervised pre-training on medical images with the proposed
GLCM loss improves representations for downstream tasks. GLCM-MAE outperforms
the current state-of-the-art across four tasks - gallbladder cancer detection
from ultrasound images by 2.1%, breast cancer detection from ultrasound by
3.1%, pneumonia detection from x-rays by 0.5%, and COVID detection from CT by
0.6%. Source code and pre-trained models are available at:
https://github.com/ChetanMadan/GLCM-MAE.
comment: To appear at MICCAI 2025
♻ ☆ Biomechanics-Guided Residual Approach to Generalizable Human Motion Generation and Estimation
Human pose, action, and motion generation are critical for applications in
digital humans, character animation, and humanoid robotics. However, many
existing methods struggle to produce physically plausible movements that are
consistent with biomechanical principles. Although recent autoregressive and
diffusion models deliver impressive visual quality, they often neglect key
biodynamic features and fail to ensure physically realistic motions.
Reinforcement Learning (RL) approaches can address these shortcomings but are
highly dependent on simulation environments, limiting their generalizability.
To overcome these challenges, we propose BioVAE, a biomechanics-aware framework
with three core innovations: (1) integration of muscle electromyography (EMG)
signals and kinematic features with acceleration constraints to enable
physically plausible motion without simulations; (2) seamless coupling with
diffusion models for stable end-to-end training; and (3) biomechanical priors
that promote strong generalization across diverse motion generation and
estimation tasks. Extensive experiments demonstrate that BioVAE achieves
state-of-the-art performance on multiple benchmarks, bridging the gap between
data-driven motion synthesis and biomechanical authenticity while setting new
standards for physically accurate motion generation and pose estimation.
♻ ☆ Augmenting End-to-End Steering Angle Prediction with CAN Bus Data
In recent years, end to end steering prediction for autonomous vehicles has
become a major area of research. The primary method for achieving end to end
steering was to use computer vision models on a live feed of video data.
However, to further increase accuracy, many companies have added data from
light detection and ranging (LiDAR) and or radar sensors through sensor fusion.
However, the addition of lasers and sensors comes at a high financial cost. In
this paper, I address both of these issues by increasing the accuracy of the
computer vision models without the increased cost of using LiDAR and or
sensors. I achieved this by improving the accuracy of computer vision models by
sensor fusing CAN bus data, a vehicle protocol, with video data. CAN bus data
is a rich source of information about the vehicle's state, including its speed,
steering angle, and acceleration. By fusing this data with video data, the
accuracy of the computer vision model's predictions can be improved. When I
trained the model without CAN bus data, I obtained an RMSE of 0.02492, while
the model trained with the CAN bus data achieved an RMSE of 0.01970. This
finding indicates that fusing CAN Bus data with video data can reduce the
computer vision model's prediction error by 20% with some models decreasing the
error by 80%.
comment: 5 pages
♻ ☆ QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning ICCV 2025
The practical deployment of diffusion models is still hindered by the high
memory and computational overhead. Although quantization paves a way for model
compression and acceleration, existing methods face challenges in achieving
low-bit quantization efficiently. In this paper, we identify imbalanced
activation distributions as a primary source of quantization difficulty, and
propose to adjust these distributions through weight finetuning to be more
quantization-friendly. We provide both theoretical and empirical evidence
supporting finetuning as a practical and reliable solution. Building on this
approach, we further distinguish two critical types of quantized layers: those
responsible for retaining essential temporal information and those particularly
sensitive to bit-width reduction. By selectively finetuning these layers under
both local and global supervision, we mitigate performance degradation while
enhancing quantization efficiency. Our method demonstrates its efficacy across
three high-resolution image generation tasks, obtaining state-of-the-art
performance across multiple bit-width settings.
comment: ICCV 2025. Code is available at
https://github.com/hatchetProject/QuEST
♻ ☆ Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays
Recent works have revisited the infamous task ``Name That Dataset'',
demonstrating that non-medical datasets contain underlying biases and that the
dataset origin task can be solved with high accuracy. In this work, we revisit
the same task applied to popular open-source chest X-ray datasets. Medical
images are naturally more difficult to release for open-source due to their
sensitive nature, which has led to certain open-source datasets being extremely
popular for research purposes. By performing the same task, we wish to explore
whether dataset bias also exists in these datasets. To extend our work, we
apply simple transformations to the datasets, repeat the same task, and perform
an analysis to identify and explain any detected biases. Given the importance
of AI applications in medical imaging, it's vital to establish whether modern
methods are taking shortcuts or are focused on the relevant pathology. We
implement a range of different network architectures on the datasets: NIH,
CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more
explainable research being performed in medical imaging and the creation of
more open-source datasets in the medical domain. Our code can be found here:
https://github.com/eedack01/x_ray_ds_bias.
♻ ☆ Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing
The development of image time series retrieval (ITSR) methods is a growing
research interest in remote sensing (RS). Given a user-defined image time
series (i.e., the query time series), ITSR methods search and retrieve from
large archives the image time series that have similar content to the query
time series. Existing ITSR methods in RS are designed for unimodal retrieval
problems, relying on an assumption that users always have access to a query
image time series in the considered image modality. In operational scenarios,
this assumption may not hold. To overcome this issue, as a first time in RS we
introduce the task of cross-modal text-image time series retrieval (text-ITSR).
In detail, we present a self-supervised cross-modal text-ITSR method that
enables the retrieval of image time series using text sentences as queries, and
vice versa. We focus our attention on text-ITSR in pairs of images (i.e.,
bitemporal images). Our text-ITSR method consists of two key components: 1)
modality-specific encoders to model the semantic content of bitemporal images
and text sentences with discriminative features; and 2) modality-specific
projection heads to align textual and image representations in a shared
embedding space. To effectively model the temporal information in the
bitemporal images, we exploit two fusion strategies: i) global feature fusion
(GFF) strategy that combines global image features through simple yet effective
operators; and ii) transformer-based feature fusion (TFF) strategy that
leverages transformers for fine-grained temporal integration. Extensive
experiments conducted on two benchmark RS archives demonstrate the
effectiveness of our method in accurately retrieving semantically relevant
bitemporal images (or text sentences) to a query text sentence (or bitemporal
image). The code of this work is publicly available at
https://git.tu-berlin.de/rsim/cross-modal-text-tsir .
♻ ☆ Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation ICCV 2025
Vision Foundation Models (VFMs) have delivered remarkable performance in
Domain Generalized Semantic Segmentation (DGSS). However, recent methods often
overlook the fact that visual cues are susceptible, whereas the underlying
geometry remains stable, rendering depth information more robust. In this
paper, we investigate the potential of integrating depth information with
features from VFMs, to improve the geometric consistency within an image and
boost the generalization performance of VFMs. We propose a novel fine-tuning
DGSS framework, named DepthForge, which integrates the visual cues from frozen
DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of
the VFMs, we incorporate depth-aware learnable tokens to continuously decouple
domain-invariant visual and spatial information, thereby enhancing depth
awareness and attention of the VFMs. Finally, we develop a depth refinement
decoder and integrate it into the model architecture to adaptively refine
multi-layer VFM features and depth-aware learnable tokens. Extensive
experiments are conducted based on various DGSS settings and five different
datsets as unseen target domains. The qualitative and quantitative results
demonstrate that our method significantly outperforms alternative approaches
with stronger performance, steadier visual-spatial attention, and superior
generalization ability. In particular, DepthForge exhibits outstanding
performance under extreme conditions (e.g., night and snow). Code is available
at https://github.com/anonymouse-xzrptkvyqc/DepthForge.
comment: Accepted by ICCV 2025
♻ ☆ Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence SC 2025
The collection and release of street-level recordings as Open Data play a
vital role in advancing autonomous driving systems and AI research. However,
these datasets pose significant privacy risks, particularly for pedestrians,
due to the presence of Personally Identifiable Information (PII) that extends
beyond biometric traits such as faces. In this paper, we present cRID, a novel
cross-modal framework combining Large Vision-Language Models, Graph Attention
Networks, and representation learning to detect textual describable clues of
PII and enhance person re-identification (Re-ID). Our approach focuses on
identifying and leveraging interpretable features, enabling the detection of
semantically meaningful PII beyond low-level appearance cues. We conduct a
systematic evaluation of PII presence in person image datasets. Our experiments
show improved performance in practical cross-dataset Re-ID scenarios, notably
from Market-1501 to CUHK03-np (detected), highlighting the framework's
practical utility. Code is available at https://github.com/RAufschlaeger/cRID.
comment: accepted for publication at the 2025 IEEE 28th International
Conference on Intelligent Transportation Systems (ITSC 2025), taking place
during November 18-21, 2025 in Gold Coast, Australia
♻ ☆ Moner: Motion Correction in Undersampled Radial MRI with Unsupervised Neural Representation ICLR 2025
Motion correction (MoCo) in radial MRI is a particularly challenging problem
due to the unpredictability of subject movement. Current state-of-the-art
(SOTA) MoCo algorithms often rely on extensive high-quality MR images to
pre-train neural networks, which constrains the solution space and leads to
outstanding image reconstruction results. However, the need for large-scale
datasets significantly increases costs and limits model generalization. In this
work, we propose Moner, an unsupervised MoCo method that jointly reconstructs
artifact-free MR images and estimates accurate motion from undersampled, rigid
motion-corrupted k-space data, without requiring any training data. Our core
idea is to leverage the continuous prior of implicit neural representation
(INR) to constrain this ill-posed inverse problem, facilitating optimal
solutions. Specifically, we integrate a quasi-static motion model into the INR,
granting its ability to correct subject's motion. To stabilize model
optimization, we reformulate radial MRI reconstruction as a back-projection
problem using the Fourier-slice theorem. Additionally, we propose a novel
coarse-to-fine hash encoding strategy, significantly enhancing MoCo accuracy.
Experiments on multiple MRI datasets show our Moner achieves performance
comparable to SOTA MoCo techniques on in-domain data, while demonstrating
significant improvements on out-of-domain data. The code is available at:
https://github.com/iwuqing/Moner
comment: Accepted by ICLR 2025 Spotlight
♻ ☆ Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs IJCAI
Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, Quy Duong Dang, Satwik Ramchandre, Son Lam Phung, Zhibin Liao, Minh-Son To, Johan Verjans, Phi Le Nguyen, Vu Minh Hieu Phan
Medical Large Multi-modal Models (LMMs) have demonstrated remarkable
capabilities in medical data interpretation. However, these models frequently
generate hallucinations contradicting source evidence, particularly due to
inadequate localization reasoning. This work reveals a critical limitation in
current medical LMMs: instead of analyzing relevant pathological regions, they
often rely on linguistic patterns or attend to irrelevant image areas when
responding to disease-related queries. To address this, we introduce
HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive
benchmark designed to evaluate LMMs' localization abilities and hallucination
robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to
assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA
pairs, with doctor-annotated anatomical segmentation masks for pathological
regions. To improve visual reasoning, we propose the Localize-before-Answer
(LobA) framework, which trains LMMs to localize target regions of interest and
self-prompt to emphasize segmented pathological areas, generating grounded and
reliable answers. Experimental results demonstrate that our approach
significantly outperforms state-of-the-art biomedical LMMs on the challenging
HEAL-MedVQA benchmark, advancing robustness in medical VQA.
comment: Accepted at Joint Conference on Artificial Intelligence (IJCAI) 2025
♻ ☆ Similarity Memory Prior is All You Need for Medical Image Segmentation
In recent years, it has been found that "grandmother cells" in the primary
visual cortex (V1) of macaques can directly recognize visual input with complex
shapes. This inspires us to examine the value of these cells in promoting the
research of medical image segmentation. In this paper, we design a Similarity
Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically,
we propose a Dynamic Memory Weights-Loss Attention (DMW-LA), which matches and
remembers the category features of specific lesions or organs in medical images
through the similarity memory prior in the prototype memory bank, thus helping
the network to learn subtle texture changes between categories. DMW-LA also
dynamically updates the similarity memory prior in reverse through Weight-Loss
Dynamic (W-LD) update strategy, effectively assisting the network directly
extract category features. In addition, we propose the Double-Similarity Global
Internal Enhancement Module (DS-GIM) to deeply explore the internal differences
in the feature distribution of input data through cosine similarity and
euclidean distance. Extensive experiments on four public datasets show that
Sim-MPNet has better segmentation performance than other state-of-the-art
methods. Our code is available on https://github.com/vpsg-research/Sim-MPNet.
♻ ☆ Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding ICCV 2025
Recent advances in deep learning have led to increasingly complex models with
deeper layers and more parameters, reducing interpretability and making their
decisions harder to understand. While many methods explain black-box reasoning,
most lack effective interventions or only operate at sample-level without
modifying the model itself. To address this, we propose the Concept Bottleneck
Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU).
CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable
framework to approximate black-box reasoning and communicate conceptual
understanding. Detrimental concepts are automatically identified and refined
(removed/replaced) based on global gradient contributions. The modified CBM
then distills corrected knowledge back into the black-box model, enhancing both
interpretability and accuracy. We evaluate CBM-HNMU on various CNN and
transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft,
and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum
increase in average accuracy across 1.03%. Source code is available at:
https://github.com/XiGuaBo/CBM-HNMU.
comment: Accepted by ICCV 2025
♻ ☆ Text Embedding Knows How to Quantize Text-Guided Diffusion Models ICCV 2025
Despite the success of diffusion models in image generation tasks such as
text-to-image, the enormous computational complexity of diffusion models limits
their use in resource-constrained environments. To address this, network
quantization has emerged as a promising solution for designing efficient
diffusion models. However, existing diffusion model quantization methods do not
consider input conditions, such as text prompts, as an essential source of
information for quantization. In this paper, we propose a novel quantization
method dubbed Quantization of Language-to-Image diffusion models using text
Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit
precision for every layer at each time step. In addition, QLIP can be
seamlessly integrated into existing quantization methods to enhance
quantization efficiency. Our extensive experiments demonstrate the
effectiveness of QLIP in reducing computational complexity and improving the
quality of the generated images across various datasets.
comment: ICCV 2025
♻ ☆ petBrain: A New Pipeline for Amyloid, Tau Tangles and Neurodegeneration Quantification Using PET and MRI
Pierrick Coupé, Boris Mansencal, Floréal Morandat, Sergio Morell-Ortega, Nicolas Villain, Jose V. Manjón, Vincent Planche
INTRODUCTION: Quantification of amyloid plaques (A), neurofibrillary tangles
(T2), and neurodegeneration (N) using PET and MRI is critical for Alzheimer's
disease (AD) diagnosis and prognosis. Existing pipelines face limitations
regarding processing time, variability in tracer types, and challenges in
multimodal integration.
METHODS: We developed petBrain, a novel end-to-end processing pipeline for
amyloid-PET, tau-PET, and structural MRI. It leverages deep learning-based
segmentation, standardized biomarker quantification (Centiloid, CenTauR,
HAVAs), and simultaneous estimation of A, T2, and N biomarkers. The pipeline is
implemented as a web-based platform, requiring no local computational
infrastructure or specialized software knowledge.
RESULTS: petBrain provides reliable and rapid biomarker quantification, with
results comparable to existing pipelines for A and T2. It shows strong
concordance with data processed in ADNI databases. The staging and
quantification of A/T2/N by petBrain demonstrated good agreement with
CSF/plasma biomarkers, clinical status, and cognitive performance.
DISCUSSION: petBrain represents a powerful and openly accessible platform for
standardized AD biomarker analysis, facilitating applications in clinical
research.
♻ ☆ ED$^4$: Explicit Data-level Debiasing for Deepfake Detection
Learning intrinsic bias from limited data has been considered the main reason
for the failure of deepfake detection with generalizability. Apart from the
discovered content and specific-forgery bias, we reveal a novel spatial bias,
where detectors inertly anticipate observing structural forgery clues appearing
at the image center, also can lead to the poor generalization of existing
methods. We present ED$^4$, a simple and effective strategy, to address
aforementioned biases explicitly at the data level in a unified framework
rather than implicit disentanglement via network design. In particular, we
develop ClockMix to produce facial structure preserved mixtures with arbitrary
samples, which allows the detector to learn from an exponentially extended data
distribution with much more diverse identities, backgrounds, local manipulation
traces, and the co-occurrence of multiple forgery artifacts. We further propose
the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting
features with spatial bias, which adversarially generates spatial-inconsistent
images and constrains their extracted feature to be consistent. As a
model-agnostic debiasing strategy, ED$^4$ is plug-and-play: it can be
integrated with various deepfake detectors to obtain significant benefits. We
conduct extensive experiments to demonstrate its effectiveness and superiority
over existing deepfake detection approaches.
♻ ☆ Supercharging Floorplan Localization with Semantic Rays ICCV 2025
Floorplans provide a compact representation of the building's structure,
revealing not only layout information but also detailed semantics such as the
locations of windows and doors. However, contemporary floorplan localization
techniques mostly focus on matching depth-based structural cues, ignoring the
rich semantics communicated within floorplans. In this work, we introduce a
semantic-aware localization framework that jointly estimates depth and semantic
rays, consolidating over both for predicting a structural-semantic probability
volume. Our probability volume is constructed in a coarse-to-fine manner: We
first sample a small set of rays to obtain an initial low-resolution
probability volume. We then refine these probabilities by performing a denser
sampling only in high-probability regions and process the refined values for
predicting a 2D location and orientation angle. We conduct an evaluation on two
standard floorplan localization benchmarks. Our experiments demonstrate that
our approach substantially outperforms state-of-the-art methods, achieving
significant improvements in recall metrics compared to prior works. Moreover,
we show that our framework can easily incorporate additional metadata such as
room labels, enabling additional gains in both accuracy and efficiency.
comment: Accepted at ICCV 2025. https://tau-vailab.github.io/SemRayLoc/
♻ ☆ Robustifying 3D Perception via Least-Squares Graphs for Multi-Agent Object Tracking
The critical perception capabilities of EdgeAI systems, such as autonomous
vehicles, are required to be resilient against adversarial threats, by enabling
accurate identification and localization of multiple objects in the scene over
time, mitigating their impact. Single-agent tracking offers resilience to
adversarial attacks but lacks situational awareness, underscoring the need for
multi-agent cooperation to enhance context understanding and robustness. This
paper proposes a novel mitigation framework on 3D LiDAR scene against
adversarial noise by tracking objects based on least-squares graph on
multi-agent adversarial bounding boxes. Specifically, we employ the
least-squares graph tool to reduce the induced positional error of each
detection's centroid utilizing overlapped bounding boxes on a fully connected
graph via differential coordinates and anchor points. Hence, the multi-vehicle
detections are fused and refined mitigating the adversarial impact, and
associated with existing tracks in two stages performing tracking to further
suppress the adversarial threat. An extensive evaluation study on the
real-world V2V4Real dataset demonstrates that the proposed method significantly
outperforms both state-of-the-art single and multi-agent tracking frameworks by
up to 23.3% under challenging adversarial conditions, operating as a resilient
approach without relying on additional defense mechanisms.
comment: 6 pages, 3 figures, 4 tables
♻ ☆ Partition Map-Based Fast Block Partitioning for VVC Inter Coding
Among the new techniques of Versatile Video Coding (VVC), the quadtree with
nested multi-type tree (QT+MTT) block structure yields significant coding gains
by providing more flexible block partitioning patterns. However, the recursive
partition search in the VVC encoder increases the encoder complexity
substantially. To address this issue, we propose a partition map-based
algorithm to pursue fast block partitioning in inter coding. Based on our
previous work on partition map-based methods for intra coding, we analyze the
characteristics of VVC inter coding, and thus improve the partition map by
incorporating an MTT mask for early termination. Next, we develop a neural
network that uses both spatial and temporal features to predict the partition
map. It consists of several special designs including stacked top-down and
bottom-up processing, quantization parameter modulation layers, and
partitioning-adaptive warping. Furthermore, we present a dual-threshold
decision scheme to achieve a fine-grained trade-off between complexity
reduction and rate-distortion (RD) performance loss. The experimental results
demonstrate that the proposed method achieves an average 51.30% encoding time
saving with a 2.12% Bjontegaard Delta Bit Rate (BDBR) under the random access
configuration.
comment: 23 pages, 26 figures. Project page: https://github.com/ustcivclab/IPM
♻ ☆ EEG Emotion Copilot: Optimizing Lightweight LLMs for Emotional EEG Interpretation with Assisted Medical Record Generation
Hongyu Chen, Weiming Zeng, Chengcheng Chen, Luhui Cai, Fei Wang, Yuhu Shi, Lei Wang, Wei Zhang, Yueyang Li, Hongjie Yan, Wai Ting Siok, Nizhuan Wang
In the fields of affective computing (AC) and brain-machine interface (BMI),
the analysis of physiological and behavioral signals to discern individual
emotional states has emerged as a critical research frontier. While deep
learning-based approaches have made notable strides in EEG emotion recognition,
particularly in feature extraction and pattern recognition, significant
challenges persist in achieving end-to-end emotion computation, including
real-time processing, individual adaptation, and seamless user interaction.
This paper presents the EEG Emotion Copilot, a system optimizing a lightweight
large language model (LLM) with 0.5B parameters operating in a local setting,
which first recognizes emotional states directly from EEG signals, subsequently
generates personalized diagnostic and treatment suggestions, and finally
supports the automation of assisted electronic medical records. Specifically,
we demonstrate the critical techniques in the novel data structure of prompt,
model pruning and fine-tuning training, and deployment strategies aiming at
improving real-time performance and computational efficiency. Extensive
experiments show that our optimized lightweight LLM-based copilot achieves an
enhanced intuitive interface for participant interaction, superior accuracy of
emotion recognition and assisted electronic medical records generation, in
comparison to such models with similar scale parameters or large-scale
parameters such as 1.5B, 1.8B, 3B and 7B. In summary, through these efforts,
the proposed copilot is expected to advance the application of AC in the
medical domain, offering innovative solution to mental health monitoring. The
codes will be released at https://github.com/NZWANG/EEG_Emotion_Copilot.
comment: 17 pages, 16 figures, 5 tables
♻ ☆ Archival Faces: Detection of Faces in Digitized Historical Documents ICDAR 2025
When digitizing historical archives, it is necessary to search for the faces
of celebrities and ordinary people, especially in newspapers, link them to the
surrounding text, and make them searchable. Existing face detectors on datasets
of scanned historical documents fail remarkably -- current detection tools only
achieve around 24% mAP at 50:90% IoU. This work compensates for this failure by
introducing a new manually annotated domain-specific dataset in the style of
the popular Wider Face dataset, containing 2.2k new images from digitized
historical newspapers from the 19th to 20th century, with 11k new bounding-box
annotations and associated facial landmarks. This dataset allows existing
detectors to be retrained to bring their results closer to the standard in the
field of face detection in the wild. We report several experimental results
comparing different families of fine-tuned detectors against publicly available
pre-trained face detectors and ablation studies of multiple detector sizes with
comprehensive detection and landmark prediction performance results.
comment: Accepted to ICDAR 2025 Workshops, GREC2025
♻ ☆ AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization ICDAR2025
We introduce the AnnoPage Dataset, a novel collection of 7,550 pages from
historical documents, primarily in Czech and German, spanning from 1485 to the
present, focusing on the late 19th and early 20th centuries. The dataset is
designed to support research in document layout analysis and object detection.
Each page is annotated with axis-aligned bounding boxes (AABB) representing
elements of 25 categories of non-textual elements, such as images, maps,
decorative elements, or charts, following the Czech Methodology of image
document processing. The annotations were created by expert librarians to
ensure accuracy and consistency. The dataset also incorporates pages from
multiple, mainly historical, document datasets to enhance variability and
maintain continuity. The dataset is divided into development and test subsets,
with the test set carefully selected to maintain the category distribution. We
provide baseline results using YOLO and DETR object detectors, offering a
reference point for future research. The AnnoPage Dataset is publicly available
on Zenodo (https://doi.org/10.5281/zenodo.12788419), along with ground-truth
annotations in YOLO format.
comment: 17 pages, 2 tables, 7 figures; Accepted to GREC Workshop at ICDAR2025
♻ ☆ TorchCP: A Python Library for Conformal Prediction
Conformal prediction (CP) is a robust statistical framework that generates
prediction intervals or sets with guaranteed coverage probability, addressing
the challenge of quantifying predictive uncertainty in deep learning. Despite
advancements in deep learning architectures and datasets, reliable uncertainty
estimation remains elusive, making CP increasingly vital. This paper introduces
TorchCP, a PyTorch-native library designed to integrate state-of-the-art CP
algorithms into deep learning tasks, including classification, regression,
graph neural networks, and large language models. TorchCP offers a
comprehensive suite of advanced methodologies, a modular design for easy
customization, and full GPU-accelerated scalability. Released under the
LGPL-3.0 license, TorchCP has gained widespread adoption with over 12,582 PyPi
downloads. It is supported by approximately 16,132 lines of code, 564 unit
tests achieving 100\% coverage, and comprehensive documentation. By bridging
statistics and computer science, TorchCP empowers researchers and practitioners
to advance conformal prediction in diverse deep learning applications.
♻ ☆ ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models
With the growing research focus on multimodal dialogue systems, the
capability for proactive interaction is gradually gaining recognition. As an
alternative to conventional turn-by-turn dialogue, users increasingly expect
multimodal systems to be more initiative, for example, by autonomously
determining the timing of multi-turn responses in real time during video
playback. To facilitate progress in this emerging area, we introduce
ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's
ability to engage in proactive interaction. Since model responses are generated
at varying timestamps, we further propose PAUC, the first metric that accounts
for the temporal dynamics of model responses. This enables a more accurate
evaluation of systems operating in proactive settings. Through extensive
benchmarking of various baseline systems on ProactiveVideoQA and a user study
of human preferences, we show that PAUC is in better agreement with human
preferences than traditional evaluation metrics, which typically only consider
the textual content of responses. These findings demonstrate that PAUC provides
a more faithful assessment of user experience in proactive interaction
scenarios. Project homepage:
https://github.com/yellow-binary-tree/ProactiveVideoQA
♻ ☆ A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation
Advances in architectural design, data availability, and compute have driven
remarkable progress in semantic segmentation. Yet, these models often rely on
relaxed Bayesian assumptions, omitting critical uncertainty information needed
for robust decision-making. The resulting reliance on point estimates has
fueled interest in probabilistic segmentation, but the literature remains
fragmented. In response, this review consolidates and contextualizes
foundational concepts in uncertainty modeling, including the non-trivial task
of distinguishing between epistemic and aleatoric uncertainty and examining
their roles across four key downstream segmentation tasks, highlighting Active
Learning as particularly promising. By unifying theory, terminology, and
applications, we provide a coherent foundation for researchers and identify
critical challenges, such as strong assumptions in spatial aggregation, lack of
standardized benchmarks, and pitfalls in current uncertainty quantification
methods. We identify trends such as the adoption of contemporary generative
models, driven by advances in the broader field of generative modeling, with
segmentation-specific innovation primarily in the conditioning mechanisms.
Moreover, we observe growing interest in distribution- and sampling-free
approaches to uncertainty estimation. We further propose directions for
advancing uncertainty-aware segmentation in deep learning, including pragmatic
strategies for disentangling different sources of uncertainty, novel
uncertainty modeling approaches and improved Transformer-based backbones. In
this way, we aim to support the development of more reliable, efficient, and
interpretable segmentation models that effectively incorporate uncertainty into
real-world applications.
comment: 31 pages of content, revised
♻ ☆ VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning IJCNN 2025
Recognising emotions in context involves identifying an individual's apparent
emotions while considering contextual cues from the surrounding scene. Previous
approaches to this task have typically designed explicit scene-encoding
architectures or incorporated external scene-related information, such as
captions. However, these methods often utilise limited contextual information
or rely on intricate training pipelines to decouple noise from relevant
information. In this work, we leverage the capabilities of
Vision-and-Large-Language Models (VLLMs) to enhance in-context emotion
classification in a more straightforward manner. Our proposed method follows a
simple yet effective two-stage approach. First, we prompt VLLMs to generate
natural language descriptions of the subject's apparent emotion in relation to
the visual context. Second, the descriptions, along with the visual input, are
used to train a transformer-based architecture that fuses text and visual
features before the final classification task. This method not only simplifies
the training process but also significantly improves performance. Experimental
results demonstrate that the textual descriptions effectively guide the model
to constrain the noisy visual input, allowing our fused architecture to
outperform individual modalities. Our approach achieves state-of-the-art
performance across three datasets, BoLD, EMOTIC, and CAER-S, without bells and
whistles. The code will be made publicly available on github:
https://github.com/NickyFot/EmoCommonSense.git
comment: A. Xenos, N. Foteinopoulou and I. Ntinou contributed equally to this
work; 14 pages, 5 figures; Accepted at IJCNN 2025
♻ ☆ IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution ICCV 2025
Super-resolution (SR) has been a pivotal task in image processing, aimed at
enhancing image resolution across various applications. Recently, look-up table
(LUT)-based approaches have attracted interest due to their efficiency and
performance. However, these methods are typically designed for fixed scale
factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing
ASISR techniques often employ implicit neural representations, which come with
considerable computational cost and memory demands. To address these
limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework
that operates ASISR by learning to blend multiple interpolation functions to
maximize their representational capacity. Specifically, we introduce IM-Net, a
network trained to predict mixing weights for interpolation functions based on
local image patterns and the target scale factor. To enhance efficiency of
interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are
employed to replace computationally expensive operations, enabling lightweight
and fast inference on CPUs while preserving reconstruction quality.
Experimental results on several benchmark datasets demonstrate that IM-LUT
consistently achieves a superior balance between image quality and efficiency
compared to existing methods, highlighting its potential as a promising
solution for resource-constrained applications.
comment: ICCV 2025
♻ ☆ SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition ICCV 2025
Connectionist temporal classification (CTC)-based scene text recognition
(STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due
to their simple architecture, which only contains a visual model and a
CTC-aligned linear classifier, and therefore fast inference. However, they
generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due
to struggling with text irregularity and linguistic missing. To address these
challenges, we propose SVTRv2, a CTC model endowed with the ability to handle
text irregularities and model linguistic context. First, a multi-size resizing
strategy is proposed to resize text instances to appropriate predefined sizes,
effectively avoiding severe text distortion. Meanwhile, we introduce a feature
rearrangement module to ensure that visual features accommodate the requirement
of CTC, thus alleviating the alignment puzzle. Second, we propose a semantic
guidance module. It integrates linguistic context into the visual features,
allowing CTC model to leverage language information for accuracy improvement.
This module can be omitted at the inference stage and would not increase the
time cost. We extensively evaluate SVTRv2 in both standard and recent
challenging benchmarks, where SVTRv2 is fairly compared to popular STR models
across multiple scenarios, including different types of text irregularity,
languages, long text, and whether employing pretraining. SVTRv2 surpasses most
EDTRs across the scenarios in terms of accuracy and inference speed. Code:
https://github.com/Topdu/OpenOCR.
comment: Accepted by ICCV 2025
♻ ☆ CLA: Latent Alignment for Online Continual Self-Supervised Learning
Self-supervised learning (SSL) is able to build latent representations that
generalize well to unseen data. However, only a few SSL techniques exist for
the online CL setting, where data arrives in small minibatches, the model must
comply with a fixed computational budget, and task boundaries are absent. We
introduce Continual Latent Alignment (CLA), a novel SSL strategy for Online CL
that aligns the representations learned by the current model with past
representations to mitigate forgetting. We found that our CLA is able to speed
up the convergence of the training process in the online scenario,
outperforming state-of-the-art approaches under the same computational budget.
Surprisingly, we also discovered that using CLA as a pretraining protocol in
the early stages of pretraining leads to a better final performance when
compared to a full i.i.d. pretraining.
comment: Accepted at CoLLAs 2025 conference (oral)
♻ ☆ Fully Unified Motion Planning for End-to-End Autonomous Driving
Lin Liu, Caiyan Jia, Ziying Song, Hongyu Pan, Bencheng Liao, Wenchao Sun, Yongchang Zhang, Lei Yang, Yandan Luo
Current end-to-end autonomous driving methods typically learn only from
expert planning data collected from a single ego vehicle, severely limiting the
diversity of learnable driving policies and scenarios. However, a critical yet
overlooked fact is that in any driving scenario, multiple high-quality
trajectories from other vehicles coexist with a specific ego vehicle's
trajectory. Existing methods fail to fully exploit this valuable resource,
missing important opportunities to improve the models' performance (including
long-tail scenarios) through learning from other experts. Intuitively, Jointly
learning from both ego and other vehicles' expert data is beneficial for
planning tasks. However, this joint learning faces two critical challenges. (1)
Different scene observation perspectives across vehicles hinder inter-vehicle
alignment of scene feature representations; (2) The absence of partial modality
in other vehicles' data (e.g., vehicle states) compared to ego-vehicle data
introduces learning bias. To address these challenges, we propose FUMP (Fully
Unified Motion Planning), a novel two-stage trajectory generation framework.
Building upon probabilistic decomposition, we model the planning task as a
specialized subtask of motion prediction. Specifically, our approach decouples
trajectory planning into two stages. In Stage 1, a shared decoder jointly
generates initial trajectories for both tasks. In Stage 2, the model performs
planning-specific refinement conditioned on an ego-vehicle's state. The
transition between the two stages is bridged by a state predictor trained
exclusively on ego-vehicle data. To address the cross-vehicle discrepancy in
observational perspectives, we propose an Equivariant Context-Sharing Adapter
(ECSA) before Stage 1 for improving cross-vehicle generalization of scene
representations.
♻ ☆ Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
Hong Zhang, Zhongjie Duan, Xingjun Wang, Yuze Zhao, Weiyi Lu, Zhipeng Di, Yixuan Xu, Yingda Chen, Yu Zhang
Unified multimodal generative models aim to integrate image understanding and
generation abilities, offering significant advantages in harnessing multimodal
corpora, particularly interleaved text-image data. However, existing unified
models exhibit limitations in image synthesis quality, autoregressive error
accumulation, and image editing capability. In this work, we propose Nexus-Gen,
a novel architecture that unifies image understanding, generation, and editing
tasks in a shared image embedding space. This shared space serves as a bridge
for the autoregressive and diffusion models, which seamlessly integrates their
complementary strengths in cross-modal modeling. To mitigate the severe error
accumulation during autoregressive embedding prediction, we propose a novel
prefilled autoregression strategy that aligns training-inference dynamics by
prefilling input sequences with learnable embeddings. After multi-stage and
multi-task training on our constructed large-scale dataset with 26.3 million
samples, Nexus-Gen achieves state-of-the-art performance on the evaluation
benchmarks spanning image understanding, generation and editing tasks. All
models, datasets, and source codes are released in
https://github.com/modelscope/Nexus-Gen to facilitate further advancements
across the field.
♻ ☆ COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation ICCV 2025
Cell instance segmentation (CIS) is crucial for identifying individual cell
morphologies in histopathological images, providing valuable insights for
biological and medical research. While unsupervised CIS (UCIS) models aim to
reduce the heavy reliance on labor-intensive image annotations, they fail to
accurately capture cell boundaries, causing missed detections and poor
performance. Recognizing the absence of error-free instances as a key
limitation, we present COIN (COnfidence score-guided INstance distillation), a
novel annotation-free framework with three key steps: (1) Increasing the
sensitivity for the presence of error-free instances via unsupervised semantic
segmentation with optimal transport, leveraging its ability to discriminate
spatially minor instances, (2) Instance-level confidence scoring to measure the
consistency between model prediction and refined mask and identify highly
confident instances, offering an alternative to ground truth annotations, and
(3) Progressive expansion of confidence with recursive self-distillation.
Extensive experiments across six datasets show COIN outperforming existing UCIS
methods, even surpassing semi- and weakly-supervised approaches across all
metrics on the MoNuSeg and TNBC datasets. The code is available at
https://github.com/shjo-april/COIN.
comment: Accepted at ICCV 2025
♻ ☆ Learning and Transferring Better with Depth Information in Visual Reinforcement Learning
Depth information is robust to scene appearance variations and inherently
carries 3D spatial details. In this paper, a visual backbone based on the
vision transformer is proposed to fuse RGB and depth modalities for enhancing
generalization. Different modalities are first processed by separate CNN stems,
and the combined convolutional features are delivered to the scalable vision
transformer to obtain visual representations. Moreover, a contrastive
unsupervised learning scheme is designed with masked and unmasked tokens to
accelerate the sample efficiency during the reinforcement learning progress.
For sim2real transfer, a flexible curriculum learning schedule is developed to
deploy domain randomization over training processes.
♻ ☆ CycleSAM: Few-Shot Surgical Scene Segmentation with Cycle- and Scene-Consistent Feature Matching
Surgical image segmentation is highly challenging, primarily due to scarcity
of annotated data. Generalist prompted segmentation models like the
Segment-Anything Model (SAM) can help tackle this task, but because they
require image-specific visual prompts for effective performance, their use is
limited to improving data annotation efficiency. Recent approaches extend SAM
to automatic segmentation by using a few labeled reference images to predict
point prompts; however, they rely on feature matching pipelines that lack
robustness to out-of-domain data like surgical images. To tackle this problem,
we introduce CycleSAM, an improved visual prompt learning approach that employs
a data-efficient training phase and enforces a series of soft constraints to
produce high-quality feature similarity maps. CycleSAM label-efficiently
addresses domain gap by leveraging surgery-specific self-supervised feature
extractors, then adapts the resulting features through a short
parameter-efficient training stage, enabling it to produce informative
similarity maps. CycleSAM further filters the similarity maps with a series of
consistency constraints before robustly sampling diverse point prompts for each
object instance. In our experiments on four diverse surgical datasets, we find
that CycleSAM outperforms existing few-shot SAM approaches by a factor of 2-4x
in both 1-shot and 5-shot settings, while also achieving strong performance
gains over traditional linear probing, parameter-efficient adaptation, and
pseudo-labeling methods.
♻ ☆ Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy IROS2025
Depth estimation is a cornerstone of 3D reconstruction and plays a vital role
in minimally invasive endoscopic surgeries. However, most current depth
estimation networks rely on traditional convolutional neural networks, which
are limited in their ability to capture global information. Foundation models
offer a promising approach to enhance depth estimation, but those models
currently available are primarily trained on natural images, leading to
suboptimal performance when applied to endoscopic images. In this work, we
introduce a novel fine-tuning strategy for the Depth Anything Model and
integrate it with an intrinsic-based unsupervised monocular depth estimation
framework. Our approach includes a low-rank adaptation technique based on
random vectors, which improves the model's adaptability to different scales.
Additionally, we propose a residual block built on depthwise separable
convolution to compensate for the transformer's limited ability to capture
local features. Our experimental results on the SCARED dataset and Hamlyn
dataset show that our method achieves state-of-the-art performance while
minimizing the number of trainable parameters. Applying this method in
minimally invasive endoscopic surgery can enhance surgeons' spatial awareness,
thereby improving the precision and safety of the procedures.
comment: Accepted by IROS2025, 8 pages, 7 figures
♻ ☆ Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
The rapid evolution of multimodal large language models (MLLMs) has
significantly enhanced their real-world applications. However, achieving
consistent performance across languages, especially when integrating cultural
knowledge, remains a significant challenge. To better assess this issue, we
introduce two new benchmarks: KnowRecall and VisRecall, which evaluate
cross-lingual consistency in MLLMs. KnowRecall is a visual question answering
benchmark designed to measure factual knowledge consistency in 15 languages,
focusing on cultural and historical questions about global landmarks. VisRecall
assesses visual memory consistency by asking models to describe landmark
appearances in 9 languages without access to images. Experimental results
reveal that state-of-the-art MLLMs, including proprietary ones, still struggle
to achieve cross-lingual consistency. This underscores the need for more robust
approaches that produce truly multilingual and culturally aware models.
comment: https://github.com/nlp-waseda/traveling-across-languages
♻ ☆ Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control
Hyperspectral pansharpening has received much attention in recent years due
to technological and methodological advances that open the door to new
application scenarios. However, research on this topic is only now gaining
momentum. The most popular methods are still borrowed from the more mature
field of multispectral pansharpening and often overlook the unique challenges
posed by hyperspectral data fusion, such as i) the very large number of bands,
ii) the overwhelming noise in selected spectral ranges, iii) the significant
spectral mismatch between panchromatic and hyperspectral components, iv) a
typically high resolution ratio. Imprecise data modeling especially affects
spectral fidelity. Even state-of-the-art methods perform well in certain
spectral ranges and much worse in others, failing to ensure consistent quality
across all bands, with the risk of generating unreliable results. Here, we
propose a hyperspectral pansharpening method that explicitly addresses this
problem and ensures uniform spectral quality. To this end, a single lightweight
neural network is used, with weights that adapt on the fly to each band. During
fine-tuning, the spatial loss is turned on and off to ensure a fast convergence
of the spectral loss to the desired level, according to a hysteresis-like
dynamic. Furthermore, the spatial loss itself is appropriately redefined to
account for nonlinear dependencies between panchromatic and spectral bands.
Overall, the proposed method is fully unsupervised, with no prior training on
external data, flexible, and low-complexity. Experiments on a recently
published benchmarking toolbox show that it ensures excellent sharpening
quality, competitive with the state-of-the-art, consistently across all bands.
The software code and the full set of results are shared online on
https://github.com/giu-guarino/rho-PNN.
♻ ☆ Longitudinal Study of Facial Biometrics at the BEZ: Temporal Variance Analysis
Mathias Schulz, Alexander Spenke, Pia Funk, Florian Blümel, Markus Rohde, Ralph Breithaupt, Gerd Nolden, Norbert Jung, Robert Lange
This study presents findings from long-term biometric evaluations conducted
at the Biometric Evaluation Center (bez). Over the course of two and a half
years, our ongoing research with over 400 participants representing diverse
ethnicities, genders, and age groups were regularly assessed using a variety of
biometric tools and techniques at the controlled testing facilities. Our
findings are based on the General Data Protection Regulation-compliant local
bez database with more than 238.000 biometric data sets categorized into
multiple biometric modalities such as face and finger. We used state-of-the-art
face recognition algorithms to analyze long-term comparison scores. Our results
show that these scores fluctuate more significantly between individual days
than over the entire measurement period. These findings highlight the
importance of testing biometric characteristics of the same individuals over a
longer period of time in a controlled measurement environment and lays the
groundwork for future advancements in biometric data analysis.
comment: 11 pages, 10 figures, 8 tables
♻ ☆ EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention
Crack detection on road surfaces is a critical measurement technology in the
instrumentation domain, essential for ensuring infrastructure safety and
transportation reliability. However, due to limited energy and low-resolution
imaging, smart terminal devices struggle to maintain real-time monitoring
performance. To overcome these challenges, this paper proposes a multi-stage
detection approach for road crack detection, EECD-Net, to enhance accuracy and
energy efficiency of instrumentation. Specifically, the sophisticated
Super-Resolution Convolutional Neural Network (SRCNN) is employed to address
the inherent challenges of low-quality images, which effectively enhance image
resolution while preserving critical structural details. Meanwhile, a Spike
Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is
proposed to convert these images into sparse pulse sequences, significantly
reducing power consumption. Additionally, a Gated Attention Transformer (GAT)
module is designed to strategically fuse multi-scale feature representations
through adaptive attention mechanisms, effectively capturing both long-range
dependencies and intricate local crack patterns, and significantly enhancing
detection robustness across varying crack morphologies. The experiments on the
CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6\%
detection accuracy, surpassing state-of-the-art counterparts such as
Hybrid-Segmentor by a significant 1.5\%. Notably, the EECD-Net maintains
exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial
33\% reduction compared to baseline implementations. This work pioneers a
transformative approach in instrumentation-based crack detection, offering a
scalable, low-power solution for real-time, large-scale infrastructure
monitoring in resource-constrained environments.
comment: After further careful review and additional checks, we have
identified multiple issues in our experimental results and data analysis that
significantly affect the validity and reliability of our findings. We believe
that these issues are substantial enough to compromise the scientific
integrity of the manuscript
♻ ☆ Driving by Hybrid Navigation: An Online HD-SD Map Association Framework and Benchmark for Autonomous Vehicles
Autonomous vehicles rely on global standard-definition (SD) maps for
road-level route planning and online local high-definition (HD) maps for
lane-level navigation. However, recent work concentrates on construct online HD
maps, often overlooking the association of global SD maps with online HD maps
for hybrid navigation, making challenges in utilizing online HD maps in the
real world. Observing the lack of the capability of autonomous vehicles in
navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the
first benchmark for the association of hybrid navigation-oriented online maps,
which enhances the planning capabilities of autonomous vehicles. Based on
existing datasets, the OMA contains 480k of roads and 260k of lane paths and
provides the corresponding metrics to evaluate the performance of the model.
Additionally, we propose a novel framework, named Map Association Transformer,
as the baseline method, using path-aware attention and spatial attention
mechanisms to enable the understanding of geometric and topological
correspondences. The code and dataset can be accessed at
https://github.com/WallelWan/OMA-MAT.
comment: Fix bug for repeat reference
♻ ☆ MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation
We introduce MARL-MambaContour, the first contour-based medical image
segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our
approach reframes segmentation as a multi-agent cooperation task focused on
generate topologically consistent object-level contours, addressing the
limitations of traditional pixel-based methods which could lack topological
constraints and holistic structural awareness of anatomical regions. Each
contour point is modeled as an autonomous agent that iteratively adjusts its
position to align precisely with the target boundary, enabling adaptation to
blurred edges and intricate morphologies common in medical images. This
iterative adjustment process is optimized by a contour-specific Soft
Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization
Adjustment Mechanism (ERAM) which dynamically balance agent exploration with
contour smoothness. Furthermore, the framework incorporates a Mamba-based
policy network featuring a novel Bidirectional Cross-attention Hidden-state
Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion
limitations associated with long-range modeling in state space models, thereby
facilitating more accurate inter-agent information exchange and informed
decision-making. Extensive experiments on five diverse medical imaging datasets
demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting
its potential as an accurate and robust clinical application.
♻ ☆ Pavlok-Nudge: A Feedback Mechanism for Atomic Behaviour Modification with Snoring Usecase
This paper proposes an atomic behaviour intervention strategy using the
Pavlok wearable device. Pavlok utilises beeps, vibration and shocks as a mode
of aversion technique to help individuals with behaviour modification. While
the device can be useful in certain periodic daily life situations, like alarms
and exercise notifications, it relies on manual operations that limit its
usage. To automate behaviour modification, we propose a framework that first
detects targeted behaviours through a lightweight deep learning model and
subsequently nudges the user. Our proposed solution is implemented and verified
in the context of snoring, which captures audio from the environment following
a prediction of whether the audio content is a snore or not using a lightweight
1D convolutional neural network. Based on the prediction, we use Pavlok to
nudge users for preventive measures, such as a change in sleeping posture. We
believe that this simple solution can help people change their atomic habits,
which may lead to long-term health benefits. Our proposed lightweight model
(99.8% fewer parameters over SOTA; 790,273$\rightarrow$1,337) achieves SOTA
test accuracy of 0.99 on a public benchmark. The code and model are publicly
available at https://github.com/hasan-rakibul/pavlok-nudge-snore.
comment: Md Rakibul Hasan and Shreya Ghosh are co-first authors
♻ ☆ Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA ICCV 2025
Amodal segmentation aims to infer the complete shape of occluded objects,
even when the occluded region's appearance is unavailable. However, current
amodal segmentation methods lack the capability to interact with users through
text input and struggle to understand or reason about implicit and complex
purposes. While methods like LISA integrate multi-modal large language models
(LLMs) with segmentation for reasoning tasks, they are limited to predicting
only visible object regions and face challenges in handling complex occlusion
scenarios. To address these limitations, we propose a novel task named amodal
reasoning segmentation, aiming to predict the complete amodal shape of occluded
objects while providing answers with elaborations based on user text input. We
develop a generalizable dataset generation pipeline and introduce a new dataset
focusing on daily life scenarios, encompassing diverse real-world occlusions.
Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a
novel model with advanced global and spatial-level designs specifically
tailored to handle complex occlusions. Extensive experiments validate AURA's
effectiveness on the proposed dataset.
comment: Accepted by ICCV 2025, 17 pages, 9 figures, 5 tables
♻ ☆ Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Chain-of-thought (CoT) reasoning greatly improves the interpretability and
problem-solving abilities of multimodal large language models (MLLMs). However,
existing approaches are focused on text CoT, limiting their ability to leverage
visual cues. Visual CoT remains underexplored, and the only work is based on
supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data
and is hard to generalize to unseen cases. In this paper, we introduce
Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT
reasoning via preference optimization. UV-CoT performs preference comparisons
between model-generated bounding boxes (one is preferred and the other is
dis-preferred), eliminating the need for bounding-box annotations. We get such
preference data by introducing an automatic data generation pipeline. Given an
image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using
a template prompt and then answers the question using each bounded region as
input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these
rankings serve as supervision to train the target MLLM with UV-CoT by
minimizing negative log-likelihood losses. By emulating human
perception--identifying key regions and reasoning based on them--UV-CoT can
improve visual comprehension, particularly in spatial reasoning tasks where
textual descriptions alone fall short. Our experiments on six datasets
demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual
and visual CoT methods. Our zero-shot testing on four unseen datasets shows the
strong generalization of UV-CoT. The code is available in
https://github.com/kesenzhao/UV-CoT.
♻ ☆ GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding ICCV 2025
Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
Pixel grounding, encompassing tasks such as Referring Expression Segmentation
(RES), has garnered considerable attention due to its immense potential for
bridging the gap between vision and language modalities. However, advancements
in this domain are currently constrained by limitations inherent in existing
datasets, including limited object categories, insufficient textual diversity,
and a scarcity of high-quality annotations. To mitigate these limitations, we
introduce GroundingSuite, which comprises: (1) an automated data annotation
framework leveraging multiple Vision-Language Model (VLM) agents; (2) a
large-scale training dataset encompassing 9.56 million diverse referring
expressions and their corresponding segmentations; and (3) a meticulously
curated evaluation benchmark consisting of 3,800 images. The GroundingSuite
training dataset facilitates substantial performance improvements, enabling
models trained on it to achieve state-of-the-art results. Specifically, a cIoU
of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the
GroundingSuite annotation framework demonstrates superior efficiency compared
to the current leading data annotation method, i.e., $4.5 \times$ faster than
GLaMM.
comment: To appear at ICCV 2025. Code:
https://github.com/hustvl/GroundingSuite
♻ ☆ MVCTrack: Boosting 3D Point Cloud Tracking via Multimodal-Guided Virtual Cues ICRA 2025
3D single object tracking is essential in autonomous driving and robotics.
Existing methods often struggle with sparse and incomplete point cloud
scenarios. To address these limitations, we propose a Multimodal-guided Virtual
Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse
point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on
the generated virtual cues. Specifically, the MVCP scheme seamlessly integrates
RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to
create dense 3D virtual cues that significantly improve the sparsity of point
clouds. These virtual cues can naturally integrate with existing LiDAR-based 3D
trackers, yielding substantial performance gains. Extensive experiments
demonstrate that our method achieves competitive performance on the NuScenes
dataset.
comment: Accepted by ICRA 2025
♻ ☆ Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement
Low-light image enhancement, particularly in cross-domain tasks such as
mapping from the raw domain to the sRGB domain, remains a significant
challenge. Many deep learning-based methods have been developed to address this
issue and have shown promising results in recent years. However, single-stage
methods, which attempt to unify the complex mapping across both domains,
leading to limited denoising performance. In contrast, existing two-stage
approaches typically overlook the characteristic of demosaicing within the
Image Signal Processing (ISP) pipeline, leading to color distortions under
varying lighting conditions, especially in low-light scenarios. To address
these issues, we propose a novel Mamba-based method customized for low light
RAW images, called RAWMamba, to effectively handle raw images with different
CFAs. Furthermore, we introduce a Retinex Decomposition Module (RDM) grounded
in Retinex prior, which decouples illumination from reflectance to facilitate
more effective denoising and automatic non-linear exposure correction, reducing
the effect of manual linear illumination enhancement. By bridging demosaicing
and denoising, better enhancement for low light RAW images is achieved.
Experimental evaluations conducted on public datasets SID and MCR demonstrate
that our proposed RAWMamba achieves state-of-the-art performance on
cross-domain mapping. The code is available at
https://github.com/Cynicarlos/RetinexRawMamba.
♻ ☆ FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation
Open-vocabulary semantic segmentation (OVSS) aims to segment objects from
arbitrary text categories without requiring densely annotated datasets.
Although contrastive learning based models enable zero-shot segmentation, they
often lose fine spatial precision at pixel level, due to global representation
bias. In contrast, diffusion-based models naturally encode fine-grained spatial
features via attention mechanisms that capture both global context and local
details. However, they often face challenges in balancing the computation costs
and the quality of the segmentation mask. In this work, we present FA-Seg, a
Fast and Accurate training-free framework for open-vocabulary segmentation
based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step
from a pretrained diffusion model. Moreover, instead of running multiple times
for different classes, FA-Seg performs segmentation for all classes at once. To
further enhance the segmentation quality, FA-Seg introduces three key
components: (i) a dual-prompt mechanism for discriminative, class-aware
attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD)
that enhances semantic precision via multi-resolution attention fusion, and
(iii) a Test-Time Flipping (TTF) scheme designed to improve spatial
consistency. Extensive experiments show that FA-Seg achieves state-of-the-art
training-free performance, obtaining 43.8% average mIoU across PASCAL VOC,
PASCAL Context, and COCO Object benchmarks while maintaining superior inference
efficiency. Our results demonstrate that FA-Seg provides a strong foundation
for extendability, bridging the gap between segmentation quality and inference
efficiency. The source code will be open-sourced after this paper is accepted.
♻ ☆ (Almost) Free Modality Stitching of Foundation Models
Foundation multi-modal models are often designed by stitching of multiple
existing pretrained uni-modal models: for example, an image classifier with an
text model. This stitching process is performed by training a connector module
that aims to align the representation spaces of these uni-modal models towards
a multi-modal objective. However, given the complexity of training such
connectors on large scale web-based datasets coupled with the ever-increasing
number of available pretrained uni-modal models, the task of uni-modal models
selection and subsequent connector module training becomes computationally
demanding. To address this under-studied critical problem, we propose
Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal
uni-modal model selection and connector training by leveraging hypernetworks.
Specifically, our framework utilizes the parameter prediction capability of a
hypernetwork to obtain jointly trained connector modules for $N \times M$
combinations of uni-modal models. In our experiments, Hyma reduces the cost of
searching for the best performing uni-modal model pair by $10\times$, while
matching the ranking and trained connector performance obtained via grid search
across a suite of diverse multi-modal benchmarks.
comment: Pre-print
♻ ☆ PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening ICCV 2025
PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with
low-resolution multi-spectral (MS) images to generate high-resolution
multi-spectral (HRMS) outputs. However, cross-modality misalignment -- caused
by sensor placement, acquisition timing, and resolution disparity -- induces a
fundamental challenge. Conventional deep learning methods assume perfect
pixel-wise alignment and rely on per-pixel reconstruction losses, leading to
spectral distortion, double edges, and blurring when misalignment is present.
To address this, we propose PAN-Crafter, a modality-consistent alignment
framework that explicitly mitigates the misalignment gap between PAN and MS
modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a
single network to jointly reconstruct HRMS and PAN images, leveraging PAN's
high-frequency details as auxiliary self-supervision. Additionally, we
introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism
that bidirectionally aligns MS texture to PAN structure and vice versa,
enabling adaptive feature refinement across modalities. Extensive experiments
on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the
most recent state-of-the-art method in all metrics, even with 50.11$\times$
faster inference time and 0.63$\times$ the memory size. Furthermore, it
demonstrates strong generalization performance on unseen satellite datasets,
showing its robustness across different conditions.
comment: ICCV 2025 (camera-ready version). Please visit our project page
https://kaist-viclab.github.io/PAN-Crafter_site
♻ ☆ A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection
Jiangning Zhang, Haoyang He, Zhenye Gan, Qingdong He, Yuxuan Cai, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu
Visual anomaly detection aims to identify anomalous regions in images through
unsupervised learning paradigms, with increasing application demand and value
in fields such as industrial inspection and medical lesion detection. Despite
significant progress in recent years, there is a lack of comprehensive
benchmarks to adequately evaluate the performance of various mainstream methods
across different datasets under the practical multi-class setting. The absence
of standardized experimental setups can lead to potential biases in training
epochs, resolution, and metric results, resulting in erroneous conclusions.
This paper addresses this issue by proposing a comprehensive visual anomaly
detection benchmark, ADer, which is a modular framework that is highly
extensible for new methods. The benchmark includes multiple datasets from
industrial and medical domains, implementing fifteen state-of-the-art methods
and nine comprehensive metrics. Additionally, we have proposed the GPU-assisted
ADEval package to address the slow evaluation problem of metrics like
time-consuming mAU-PRO on large-scale data, significantly reducing evaluation
time by more than \textit{1000-fold}. Through extensive experimental results,
we objectively reveal the strengths and weaknesses of different methods and
provide insights into the challenges and future directions of multi-class
visual anomaly detection. We hope that ADer will become a valuable resource for
researchers and practitioners in the field, promoting the development of more
robust and generalizable anomaly detection systems. Full codes are open-sourced
at https://github.com/zhangzjn/ader.
♻ ☆ From Real Artifacts to Virtual Reference: A Robust Framework for Translating Endoscopic Images
Domain adaptation, which bridges the distributions across different
modalities, plays a crucial role in multimodal medical image analysis. In
endoscopic imaging, combining pre-operative data with intra-operative imaging
is important for surgical planning and navigation. However, existing domain
adaptation methods are hampered by distribution shift caused by in vivo
artifacts, necessitating robust techniques for aligning noisy and artifact
abundant patient endoscopic videos with clean virtual images reconstructed from
pre-operative tomographic data for pose estimation during intraoperative
guidance. This paper presents an artifact-resilient image translation method
and an associated benchmark for this purpose. The method incorporates a novel
``local-global'' translation framework and a noise-resilient feature extraction
strategy. For the former, it decouples the image translation process into a
local step for feature denoising, and a global step for global style transfer.
For feature extraction, a new contrastive learning strategy is proposed, which
can extract noise-resilient features for establishing robust correspondence
across domains. Detailed validation on both public and in-house clinical
datasets has been conducted, demonstrating significantly improved performance
compared to the current state-of-the-art.
comment: The conclusions of the paper has error. It requires substantial
re-evaluation, and I plan to resubmit an updated version in the future
♻ ☆ Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
With the rapid advancements in Artificial Intelligence Generated Image (AGI)
technology, the accurate assessment of their quality has become an increasingly
vital requirement. Prevailing methods typically rely on cross-modal models like
CLIP or BLIP to evaluate text-image alignment and visual quality. However, when
applied to AGIs, these methods encounter two primary challenges: semantic
misalignment and details perception missing. To address these limitations, we
propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
(SC-AGIQA), a unified framework that leverages text-visual semantic constraints
to significantly enhance the comprehensive evaluation of both text-image
consistency and perceptual distortion in AI-generated images. Our approach
integrates key capabilities from multiple models and tackles the aforementioned
challenges by introducing two core modules: the Text-assisted Semantic
Alignment Module (TSAM), which leverages Multimodal Large Language Models
(MLLMs) to bridge the semantic gap by generating an image description and
comparing it against the original prompt for a refined consistency check, and
the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which
draws inspiration from Human Visual System (HVS) properties by employing
frequency domain analysis combined with perceptual sensitivity weighting to
better quantify subtle visual distortions and enhance the capture of
fine-grained visual quality details in images. Extensive experiments conducted
on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing
state-of-the-art methods. The code is publicly available at
https://github.com/mozhu1/SC-AGIQA.
comment: 9 pages, 5 figures, Accepted at ACMMM 2025
♻ ☆ PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models ICCV 2025
Jinhua Zhang, Hualian Sheng, Sijia Cai, Bing Deng, Qiao Liang, Wen Li, Ying Fu, Jieping Ye, Shuhang Gu
Controllable generation is considered a potentially vital approach to address
the challenge of annotating 3D data, and the precision of such controllable
generation becomes particularly imperative in the context of data production
for autonomous driving. Existing methods focus on the integration of diverse
generative information into controlling inputs, utilizing frameworks such as
GLIGEN or ControlNet, to produce commendable outcomes in controllable
generation. However, such approaches intrinsically restrict generation
performance to the learning capacities of predefined network architectures. In
this paper, we explore the innovative integration of controlling information
and introduce PerLDiff (\textbf{Per}spective-\textbf{L}ayout \textbf{Diff}usion
Models), a novel method for effective street view image generation that fully
leverages perspective 3D geometric information. Our PerLDiff employs 3D
geometric priors to guide the generation of street view images with precise
object-level control within the network learning process, resulting in a more
robust and controllable output. Moreover, it demonstrates superior
controllability compared to alternative layout control methods. Empirical
results justify that our PerLDiff markedly enhances the precision of
controllable generation on the NuScenes and KITTI datasets.
comment: Accepted by ICCV 2025
♻ ☆ Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach
A fine-grained comparison of generative models requires the identification of
sample types generated differently by each of the involved models. While
quantitative scores have been proposed in the literature to rank different
generative models, score-based evaluation and ranking do not reveal the nuanced
differences between the generative models in producing different sample types.
In this work, we propose solving a differential clustering problem to detect
sample types generated differently by two generative models. To solve the
differential clustering problem, we develop a spectral method called
Fourier-based Identification of Novel Clusters (FINC) to identify modes
produced by a generative model with a higher frequency in comparison to a
reference distribution. FINC provides a scalable algorithm based on random
Fourier features to estimate the eigenspace of kernel covariance matrices of
two generative models and utilize the principal eigendirections to detect the
sample types present more dominantly in each model. We demonstrate the
application of the FINC method to large-scale computer vision datasets and
generative modeling frameworks. Our numerical results suggest the scalability
of the developed Fourier-based method in highlighting the sample types produced
with different frequencies by generative models. The project code is available
at https://github.com/buyeah1109/FINC.
♻ ☆ Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS) in Edge Iterative MRI Lesion Localization System (EdgeIMLocSys)
Brain tumor segmentation plays a critical role in clinical diagnosis and
treatment planning, yet the variability in imaging quality across different MRI
scanners presents significant challenges to model generalization. To address
this, we propose the Edge Iterative MRI Lesion Localization System
(EdgeIMLocSys), which integrates Continuous Learning from Human Feedback to
adaptively fine-tune segmentation models based on clinician feedback, thereby
enhancing robustness to scanner-specific imaging characteristics. Central to
this system is the Graph-based Multi-Modal Interaction Lightweight Network for
Brain Tumor Segmentation (GMLN-BTS), which employs a Modality-Aware Adaptive
Encoder (M2AE) to extract multi-scale semantic features efficiently, and a
Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) to model
complementary cross-modal relationships via graph structures. Additionally, we
introduce a novel Voxel Refinement UpSampling Module (VRUM) that
synergistically combines linear interpolation and multi-scale transposed
convolutions to suppress artifacts while preserving high-frequency details,
improving segmentation boundary accuracy. Our proposed GMLN-BTS model achieves
a Dice score of 85.1% on the BraTS2017 dataset with only 4.58 million
parameters, representing a 98% reduction compared to mainstream 3D Transformer
models, and significantly outperforms existing lightweight approaches. This
work demonstrates a synergistic breakthrough in achieving high-accuracy,
resource-efficient brain tumor segmentation suitable for deployment in
resource-constrained clinical environments.
♻ ☆ Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models ICCV 2025
Multimodal large language models (MLLMs) hold considerable promise for
applications in healthcare. However, their deployment in safety-critical
settings is hindered by two key limitations: (i) sensitivity to prompt design,
and (ii) a tendency to generate incorrect responses with high confidence. As
clinicians may rely on a model's stated confidence to gauge the reliability of
its predictions, it is especially important that when a model expresses high
confidence, it is also highly accurate. We introduce Prompt4Trust, the first
reinforcement learning (RL) framework for prompt augmentation targeting
confidence calibration in MLLMs. A lightweight LLM is trained to produce
context-aware auxiliary prompts that guide a downstream task MLLM to generate
responses in which the expressed confidence more accurately reflects predictive
accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically
prioritizes aspects of calibration most critical for safe and trustworthy
clinical decision-making. Beyond improvements driven by this clinically
motivated calibration objective, our proposed method also improves task
accuracy, achieving state-of-the-art medical visual question answering (VQA)
performance on the PMC-VQA benchmark, which is composed of multiple-choice
questions spanning diverse medical imaging modalities. Moreover, our framework
trained with a small downstream task MLLM showed promising zero-shot
generalization to larger MLLMs in our experiments, suggesting the potential for
scalable calibration without the associated computational costs. This work
demonstrates the potential of automated yet human-aligned prompt engineering
for improving the the trustworthiness of MLLMs in safety critical settings. Our
codebase can be found at https://github.com/xingbpshen/prompt4trust.
comment: Accepted to ICCV 2025 Workshop CVAMD
♻ ☆ EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation CVPR2025
Recent work on human animation usually involves audio, pose, or movement maps
conditions, thereby achieves vivid animation quality. However, these methods
often face practical challenges due to extra control conditions, cumbersome
condition injection modules, or limitation to head region driving. Hence, we
ask if it is possible to achieve striking half-body human animation while
simplifying unnecessary conditions. To this end, we propose a half-body human
animation method, dubbed EchoMimicV2, that leverages a novel Audio-Pose Dynamic
Harmonization strategy, including Pose Sampling and Audio Diffusion, to enhance
half-body details, facial and gestural expressiveness, and meanwhile reduce
conditions redundancy. To compensate for the scarcity of half-body data, we
utilize Head Partial Attention to seamlessly accommodate headshot data into our
training framework, which can be omitted during inference, providing a free
lunch for animation. Furthermore, we design the Phase-specific Denoising Loss
to guide motion, detail, and low-level quality for animation in specific
phases, respectively. Besides, we also present a novel benchmark for evaluating
the effectiveness of half-body human animation. Extensive experiments and
analyses demonstrate that EchoMimicV2 surpasses existing methods in both
quantitative and qualitative evaluations.
comment: CVPR2025
♻ ☆ What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies
Advances in vision-based sensors and computer vision algorithms have
significantly improved the analysis and understanding of traffic scenarios. To
facilitate the use of these improvements for road safety, this survey
systematically categorizes the critical elements that demand attention in
traffic scenarios and comprehensively analyzes available vision-driven tasks
and datasets. Compared to existing surveys that focus on isolated domains, our
taxonomy categorizes attention-worthy traffic entities into two main groups
that are anomalies and normal but critical entities, integrating ten categories
and twenty subclasses. It establishes connections between inherently related
fields and provides a unified analytical framework. Our survey highlights the
analysis of 35 vision-driven tasks and comprehensive examinations and
visualizations of 73 available datasets based on the proposed taxonomy. The
cross-domain investigation covers the pros and cons of each benchmark with the
aim of providing information on standards unification and resource
optimization. Our article concludes with a systematic discussion of the
existing weaknesses, underlining the potential effects and promising solutions
from various perspectives. The integrated taxonomy, comprehensive analysis, and
recapitulatory tables serve as valuable contributions to this rapidly evolving
field by providing researchers with a holistic overview, guiding strategic
resource selection, and highlighting critical research gaps.
comment: 45 pages, 52 figures, 2 large tables (divided into 5), 73 datatsets,
35 tasks
♻ ☆ View Invariant Learning for Vision-Language Navigation in Continuous Environments
Vision-Language Navigation in Continuous Environments (VLNCE), where an agent
follows instructions and moves freely to reach a destination, is a key research
problem in embodied AI. However, most navigation policies are sensitive to
viewpoint changes, i.e., variations in camera height and viewing angle that
alter the agent's observation. In this paper, we introduce a generalized
scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View
Invariant Learning), a view-invariant post-training strategy that enhances the
robustness of existing navigation policies to changes in camera viewpoint. VIL
employs a contrastive learning framework to learn sparse and view-invariant
features. Additionally, we introduce a teacher-student framework for the
Waypoint Predictor Module, a core component of most VLNCE baselines, where a
view-dependent teacher model distills knowledge into a view-invariant student
model. We employ an end-to-end training paradigm to jointly optimize these
components, thus eliminating the cost for individual module training. Empirical
results show that our method outperforms state-of-the-art approaches on
V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets
R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE
setting and find that, despite being trained for varied viewpoints, it often
still improves performance. On the more challenging RxR-CE dataset, our method
also achieved state-of-the-art performance across all metrics when compared to
other map-free methods. This suggests that adding VIL does not diminish the
standard viewpoint performance and can serve as a plug-and-play post-training
method.
comment: Under review
♻ ☆ Roadside Monocular 3D Detection Prompted by 2D Detection
Roadside monocular 3D detection requires detecting objects of predefined
classes in an RGB frame and predicting their 3D attributes, such as
bird's-eye-view (BEV) locations. It has broad applications in traffic control,
vehicle-vehicle communication, and vehicle-infrastructure cooperative
perception. To address this task, we introduce Promptable 3D Detector (Pro3D),
a novel detector design that leverages 2D detections as prompts. We build our
Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D
detector is ``easier'' to train due to fewer loss terms and performs
significantly better at localizing objects w.r.t 2D metrics. Second, once 2D
detections precisely locate objects in the image, a 3D detector can focus on
lifting these detections into 3D BEV, especially when fixed camera pose or
scene geometry provide an informative prior. To encode and incorporate 2D
detections, we explore three methods: (a) concatenating features from both 2D
and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c)
encoding properties of predicted 2D bounding boxes \{$x$, $y$, width, height,
label\} and attentively fusing them with the 3D detector feature.
Interestingly, the third method significantly outperforms the others,
underscoring the effectiveness of 2D detections as prompts that offer precise
object targets and allow the 3D detector to focus on lifting them into 3D.
Pro3D is adaptable for use with a wide range of 2D and 3D detectors with
minimal modifications. Comprehensive experiments demonstrate that our Pro3D
significantly enhances existing methods, achieving state-of-the-art results on
two contemporary benchmarks.