Exploring Top-Down Visual Attention for Transportation Behavior Analysis: Probing Top-Down Attention Mechanisms in Vision-Language Models for Pedestrian Behavior Analysis

AbdulRahman, Bilal; Chen, Gong Qi; Conway, Alison; Zhu, Zhigang

i

Exploring Top-Down Visual Attention for Transportation Behavior Analysis: Probing Top-Down Attention Mechanisms in Vision-Language Models for Pedestrian Behavior Analysis

2026-05-01
By AbdulRahman, Bilal ; Chen, Gong Qi ; Conway, Alison ; ...

File Language:

English

Details

Creators:

AbdulRahman, Bilal ; Chen, Gong Qi ; Conway, Alison ; Zhu, Zhigang
Corporate Creators:

City College of New York
Corporate Contributors:

National Center for Understanding Future Travel Behavior and Demand (TBD) ; United States. Department of Transportation. University Transportation Centers (UTC) Program ; United States. Department of Transportation. Office of the Assistant Secretary for Research and Technology
Subject/TRT Terms:

Analysis Attention Consumer Behavior Drivers Interpretability Neural Networks Pedestrian Behavior Pedestrians Signalized Intersections Top-Down Attention Traffic Control Transportation Planning Urban Areas Vision Vision-Language Models
Resource Type:

Tech Report
Geographical Coverage:

United States
Edition:

Final Report: 2024-2026
Corporate Publisher:

United States. Department of Transportation. University Transportation Centers (UTC) Program
Abstract:

Human action recognition is relevant to a number of interactions that occur in transportation systems, including those between a driver and their vehicle, between drivers and pedestrians, and between other humans and existing vehicles or traffic control devices. As action recognition systems scale to handle complex urban environments, understanding exactly where a model focuses its attention is critical for ensuring reliability and fairness. This study proposes a framework for integrating top-down visual attention for transportation applications, focusing on the case of pedestrians navigating complex environments such as signalized intersections. The project developed a specialized probing tool for Vision-Language Models (VLMs) to visualize cross-modal attention maps and employs a bio-inspired feedback mechanism to enhance and improve the attention mechanism in the vision backbone. The study established a framework to track how text prompts guide visual focus. The core technical achievement is an algorithmic pipeline that accurately maps 1D sequence tokens back to 2D pixel coordinates, overcoming the spatial distortions caused by aggressive image padding, resizing, and sliding-window cropping. This tool significantly enhances the interpretability of VLMs used in transportation applications.
Format:

PDF
Funding:

69A3552344815 ; 69A3552348320
Collection(s):

University Transportation Centers
Main Document Checksum:

urn:sha-512:5ba50b8d3fac215c1da62d827636cf32fae43297aff4e42a2883cee9dffc536fe0b6b930afc7f56acd4388fe9bf04f11bccc52c819f2d2d9f121d7b245f5fef7
Download URL:

https://rosap.ntl.bts.gov/view/dot/91969/dot_91969_DS1.pdf
File Type:

[PDF - 1.19 MB ]

File Language:

English

ON THIS PAGE

Details

ROSA P serves as an archival repository of USDOT-published products including scientific findings, journal articles, guidelines, recommendations, or other information authored or co-authored by USDOT or funded partners. As a repository, ROSA P retains documents in their original published format to ensure public access to scientific information.