Exploring Top-Down Visual Attention for Transportation Behavior Analysis: Probing Top-Down Attention Mechanisms in Vision-Language Models for Pedestrian Behavior Analysis
-
2026-05-01
-
Details
-
Creators:
-
Corporate Creators:
-
Corporate Contributors:
-
Subject/TRT Terms:
-
Resource Type:
-
Geographical Coverage:
-
Edition:Final Report: 2024-2026
-
Corporate Publisher:
-
Abstract:Human action recognition is relevant to a number interactions that occur in transportation systems, including those between a driver and their vehicle, between drivers and pedestrians, and between other humans and existing vehicles or traffic control devices. As action recognition systems scale to handle complex urban environments, understanding exactly where a model focuses its attention is critical for ensuring reliability and fairness. This study proposes a framework for integrating top-down visual attention for transportation applications, focusing on the case of pedestrians navigating complex environments such as signalized intersections. The project developed a specialized probing tool for Vision-Language Models (VLMs) to visualize cross-modal attention maps and employs a bio-inspired feedback mechanism to enhance and improve the attention mechanism in the vision backbone. The study established a framework to track how text prompts guide visual focus. The core technical achievement is an algorithmic pipeline that accurately maps 1D sequence tokens back to 2D pixel coordinates, overcoming the spatial distortions caused by aggressive image padding, resizing, and sliding-window cropping. This tool significantly enhances the interpretability of VLMs used in transportation applications.
-
Format:
-
Funding:
-
Collection(s):
-
Main Document Checksum:urn:sha-512:5ba50b8d3fac215c1da62d827636cf32fae43297aff4e42a2883cee9dffc536fe0b6b930afc7f56acd4388fe9bf04f11bccc52c819f2d2d9f121d7b245f5fef7
-
Download URL:
-
File Type: