FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

CVPR 2026 Findings

Dian Shao1, Zhengzheng Xu1, Peiyang Wang2, Like Liu1, Yule Wang1, Jieqi Shi2, Jing Huo2
1Northwestern Polytechnical University    2Nanjing University
Corresponding Authors

📄 Paper 📝 arXiv 💻 Code 💾 Dataset

Abstract

FineCog-Nav Poster

Poster

FineCog-Nav Teaser

Teaser

Vision-Language Navigation (VLN) for UAVs requires understanding ambiguous multi-step instructions and executing long-horizon planning in complex 3D environments from an egocentric view. Existing zero-shot methods often rely on very large models, generic prompts, and loosely organized modules, which limits robustness and interpretability. We propose FineCog-Nav, a top-down framework that simulates human cognition through fine-grained collaborative modules, including perception, attention, memory, imagination, reasoning, planning, and decision-making. Each module is driven by a moderate-sized foundation model with carefully designed cognitive-role prompts and structured input-output protocols. To support detailed evaluation, we construct AerialVLN-Fine, a curated benchmark with 300 trajectories, sentence-level instruction-trajectory alignment, and refined landmark-grounded instructions. Experiments show that FineCog-Nav consistently improves instruction adherence, long-horizon planning quality, and generalization to unseen environments in zero-shot UAV navigation.

Method

FineCog-Nav Framework

FineCog-Nav Framework Pipeline

FineCog-Nav organizes zero-shot UAV navigation into a closed cognitive loop: (1) instruction parsing and subgoal extraction, (2) attention-guided perception, (3) imagination-assisted subgoal judgment, (4) multi-level memory management at step/subgoal/instruction granularity, and (5) explainable decision-making with collision-aware action selection. This modular interdependence improves interpretability and task progression over monolithic prompting pipelines.

Task objective: let the final UAV position $p_{stop}$ approach destination $p_{dest}$ under threshold $\delta$, i.e. $\|p_{stop} - p_{dest}\| \leq \delta$. The action space includes movement, turning, altitude changes, and task completion.

Experimental Results

Table 1. Comparison with Single-module BaseModel

We evaluate several base LLMs with basic prompts, forming single-module baselines referred to as BaseModels. For a fair and meaningful comparison, all experiments are conducted on AerialVLN-Fine, which provides a cleaner and more discriminative benchmark. FineCog-Nav consistently outperforms BaseModel across metrics.

Model Method SR2D (%) SR3D (%) OSR (%) NE (m) nDTW (%) PL (m) Steps
Gemini 2.5 Flash-Lite BaseModel 1.33 1.00 3.67 240.04 8.84 329.78 129.70
Gemini 2.5 Flash-Lite FineCog-Nav 4.00 2.67 6.33 120.70 15.66 146.24 57.52
GPT-4o-mini BaseModel 0.33 0.33 2.00 325.98 8.74 350.43 103.10
GPT-4o-mini FineCog-Nav 4.00 2.33 3.67 100.37 20.45 56.98 31.73
InternLM3-8B BaseModel 0.67 0.33 3.33 128.13 14.01 86.30 34.65
InternLM3-8B FineCog-Nav 2.67 2.33 6.67 120.72 14.91 139.72 42.22
ChatGLM-4-9B BaseModel 1.00 0.33 1.33 124.03 13.27 50.19 20.36
ChatGLM-4-9B FineCog-Nav 3.00 2.67 2.67 97.05 19.26 30.49 19.39
InternLM2.5-20B BaseModel 0.33 0.33 1.67 152.34 9.94 141.27 74.16
InternLM2.5-20B FineCog-Nav 2.00 2.00 4.00 103.94 19.35 61.05 28.49
ChatGLM-4-32B BaseModel 2.33 2.00 5.00 180.66 10.59 235.03 104.20
ChatGLM-4-32B FineCog-Nav 3.33 2.33 5.33 94.18 21.25 45.91 20.18
Qwen3-32B BaseModel 2.67 3.00 6.33 142.72 17.07 114.56 39.12
Qwen3-32B FineCog-Nav 5.00 4.00 7.00 95.31 20.31 60.36 26.05
Llama3.3-70B BaseModel 3.00 2.67 5.67 263.10 9.67 329.36 111.48
Llama3.3-70B FineCog-Nav 6.67 6.00 9.67 98.84 20.17 84.96 38.91
Qwen2.5-72B BaseModel 2.67 2.67 6.33 270.10 10.69 298.24 116.91
Qwen2.5-72B FineCog-Nav 8.00 6.00 9.00 91.43 22.48 67.96 33.90

Table 2. Comparison with Framework Baselines

We first evaluate on AerialVLN-Fine, followed by scalable experiments on the larger and noisier AerialVLN-S-Val subset. Across both datasets, FineCog-Nav consistently outperforms all baselines across all metrics, validating the efficacy of the cognitive collaborative framework.

Dataset Method SR2D (%) SR3D (%) OSR (%) nDTW (%) NE (m) PL (m) Steps
AerialVLN-Fine NavGPT 0.33 0.00 0.67 15.90 110.94 21.44 9.33
AerialVLN-Fine DiscussNav 2.67 2.67 3.33 19.63 98.36 39.51 17.76
AerialVLN-Fine FineCog-Nav 8.00 6.00 9.00 22.48 91.43 67.96 33.90
AerialVLN-S-Val NavGPT 0.12 0.12 0.46 8.29 135.65 44.92 13.02
AerialVLN-S-Val DiscussNav 0.46 0.46 0.93 8.33 158.46 81.06 22.46
AerialVLN-S-Val FineCog-Nav 1.97 1.50 1.85 11.47 130.32 75.28 34.52

Table 3. Ablation Analysis

Single-module ablations show that removing any module leads to performance drops, with the largest decline occurring when hierarchical memory is replaced by plain history. Joint ablations also degrade performance. The best results are achieved when all modules are enabled.

Attn. Imag. Subgoal Mem. SR (%) OSR (%) NE (m) nDTW (%) S-SR (%) S-OSR (%) S-NE (m)
No Yes Yes H 3.00 7.00 104.38 19.55 19.92 26.46 66.21
Yes No Yes H 3.67 5.00 101.27 19.67 20.48 24.21 64.85
Yes Yes No H 2.00 4.33 102.32 19.13 19.42 24.28 65.54
Yes Yes Yes P 0.67 2.67 97.76 19.69 19.42 23.50 62.90
No No Yes H 3.67 8.67 106.59 18.14 19.35 27.30 70.16
Yes No No H 4.00 8.67 98.75 19.78 16.68 26.46 65.36
Yes Yes Yes H 6.00 9.00 91.43 22.48 22.03 27.23 59.02

Table 4. Sentence-level Analysis

Recall that AerialVLN-Fine provides sentence-level annotations for fine-grained evaluation. As the base LLM size increases, performance steadily improves. Small sentence-level differences can accumulate over long trajectories, leading to larger gaps in full-trajectory metrics.

Base LLM S-SR (%) S-nDTW (%) S-NE (m)
Gemini2.5-Flash-Lite16.4026.3379.00
GPT-4o-mini19.5628.5865.96
InternLM3-8B15.6228.1972.01
ChatGLM-4-9B19.9231.4262.18
InternLM2.5-20B19.0034.1167.67
ChatGLM-4-32B15.9523.2062.67
Qwen3-32B18.7630.0462.42
Llama3.3-70B21.3233.8964.82
Qwen2.5-72B22.0335.3759.02

Across all four tables, the results consistently support the same conclusion: a cognition-inspired, collaborative modular framework enables more reliable and interpretable zero-shot UAV navigation than both single-module prompting and existing multi-agent baselines.

Qualitative Analysis

FineCog-Nav qualitative trajectories

Qualitative Navigation Trajectories

Representative trajectories show that the agent grounds language cues to visible landmarks and executes coherent subgoal transitions. Even in difficult cases, FineCog-Nav tends to keep instruction-consistent behavior rather than drifting early.

FineCog-Nav human study

Human Study Results

Human study results further support the quantitative findings: participants generally prefer FineCog-Nav for instruction-following quality and perceived navigation intelligence.

AerialVLN-Fine Dataset

AerialVLN-Fine overview

Overview of AerialVLN-Fine

AerialVLN-Fine is a curated benchmark built from AerialVLN for more reliable zero-shot UAV VLN evaluation. It provides sentence-level alignment between instruction segments and trajectory segments, and refines ambiguous expressions with explicit visual endpoints and landmark references.

Key Statistics

The dataset is designed to support fine-grained capability diagnosis and sentence-level evaluation, while keeping annotation quality high through repeated manual verification.

💾 Dataset Link 💻 Code Link

Citation

Coming soon.

Contact

For questions about this work, please contact:
shaodian@nwpu.edu.cn  |  isjieqi@nju.edu.cn

× Full size preview