FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

CVPR 2026 Findings

Dian Shao¹^†, Zhengzheng Xu¹, Peiyang Wang², Like Liu¹, Yule Wang¹, Jieqi Shi²^†, Jing Huo²
¹Northwestern Polytechnical University ²Nanjing University
^†Corresponding Authors

📄 Paper 📝 arXiv 💻 Code 💾 Dataset

Abstract

Poster

Teaser

Vision-Language Navigation (VLN) for UAVs requires understanding ambiguous multi-step instructions and executing long-horizon planning in complex 3D environments from an egocentric view. Existing zero-shot methods often rely on very large models, generic prompts, and loosely organized modules, which limits robustness and interpretability. We propose FineCog-Nav, a top-down framework that simulates human cognition through fine-grained collaborative modules, including perception, attention, memory, imagination, reasoning, planning, and decision-making. Each module is driven by a moderate-sized foundation model with carefully designed cognitive-role prompts and structured input-output protocols. To support detailed evaluation, we construct AerialVLN-Fine, a curated benchmark with 300 trajectories, sentence-level instruction-trajectory alignment, and refined landmark-grounded instructions. Experiments show that FineCog-Nav consistently improves instruction adherence, long-horizon planning quality, and generalization to unseen environments in zero-shot UAV navigation.

Method

FineCog-Nav Framework Pipeline

FineCog-Nav organizes zero-shot UAV navigation into a closed cognitive loop: (1) instruction parsing and subgoal extraction, (2) attention-guided perception, (3) imagination-assisted subgoal judgment, (4) multi-level memory management at step/subgoal/instruction granularity, and (5) explainable decision-making with collision-aware action selection. This modular interdependence improves interpretability and task progression over monolithic prompting pipelines.

Task objective: let the final UAV position $p_{stop}$ approach destination $p_{dest}$ under threshold $\delta$, i.e. $\|p_{stop} - p_{dest}\| \leq \delta$. The action space includes movement, turning, altitude changes, and task completion.

Experimental Results

Table 1. Comparison with Single-module BaseModel

We evaluate several base LLMs with basic prompts, forming single-module baselines referred to as BaseModels. For a fair and meaningful comparison, all experiments are conducted on AerialVLN-Fine, which provides a cleaner and more discriminative benchmark. FineCog-Nav consistently outperforms BaseModel across metrics.

Model	Method	SR2D (%)	SR3D (%)	OSR (%)	NE (m)	nDTW (%)	PL (m)	Steps
Gemini 2.5 Flash-Lite	BaseModel	1.33	1.00	3.67	240.04	8.84	329.78	129.70
Gemini 2.5 Flash-Lite	FineCog-Nav	4.00	2.67	6.33	120.70	15.66	146.24	57.52
GPT-4o-mini	BaseModel	0.33	0.33	2.00	325.98	8.74	350.43	103.10
GPT-4o-mini	FineCog-Nav	4.00	2.33	3.67	100.37	20.45	56.98	31.73
InternLM3-8B	BaseModel	0.67	0.33	3.33	128.13	14.01	86.30	34.65
InternLM3-8B	FineCog-Nav	2.67	2.33	6.67	120.72	14.91	139.72	42.22
ChatGLM-4-9B	BaseModel	1.00	0.33	1.33	124.03	13.27	50.19	20.36
ChatGLM-4-9B	FineCog-Nav	3.00	2.67	2.67	97.05	19.26	30.49	19.39
InternLM2.5-20B	BaseModel	0.33	0.33	1.67	152.34	9.94	141.27	74.16
InternLM2.5-20B	FineCog-Nav	2.00	2.00	4.00	103.94	19.35	61.05	28.49
ChatGLM-4-32B	BaseModel	2.33	2.00	5.00	180.66	10.59	235.03	104.20
ChatGLM-4-32B	FineCog-Nav	3.33	2.33	5.33	94.18	21.25	45.91	20.18
Qwen3-32B	BaseModel	2.67	3.00	6.33	142.72	17.07	114.56	39.12
Qwen3-32B	FineCog-Nav	5.00	4.00	7.00	95.31	20.31	60.36	26.05
Llama3.3-70B	BaseModel	3.00	2.67	5.67	263.10	9.67	329.36	111.48
Llama3.3-70B	FineCog-Nav	6.67	6.00	9.67	98.84	20.17	84.96	38.91
Qwen2.5-72B	BaseModel	2.67	2.67	6.33	270.10	10.69	298.24	116.91
Qwen2.5-72B	FineCog-Nav	8.00	6.00	9.00	91.43	22.48	67.96	33.90

Table 2. Comparison with Framework Baselines

We first evaluate on AerialVLN-Fine, followed by scalable experiments on the larger and noisier AerialVLN-S-Val subset. Across both datasets, FineCog-Nav consistently outperforms all baselines across all metrics, validating the efficacy of the cognitive collaborative framework.

Dataset	Method	SR2D (%)	SR3D (%)	OSR (%)	nDTW (%)	NE (m)	PL (m)	Steps
AerialVLN-Fine	NavGPT	0.33	0.00	0.67	15.90	110.94	21.44	9.33
AerialVLN-Fine	DiscussNav	2.67	2.67	3.33	19.63	98.36	39.51	17.76
AerialVLN-Fine	FineCog-Nav	8.00	6.00	9.00	22.48	91.43	67.96	33.90
AerialVLN-S-Val	NavGPT	0.12	0.12	0.46	8.29	135.65	44.92	13.02
AerialVLN-S-Val	DiscussNav	0.46	0.46	0.93	8.33	158.46	81.06	22.46
AerialVLN-S-Val	FineCog-Nav	1.97	1.50	1.85	11.47	130.32	75.28	34.52

Table 3. Ablation Analysis

Single-module ablations show that removing any module leads to performance drops, with the largest decline occurring when hierarchical memory is replaced by plain history. Joint ablations also degrade performance. The best results are achieved when all modules are enabled.

Attn.	Imag.	Subgoal	Mem.	SR (%)	OSR (%)	NE (m)	nDTW (%)	S-SR (%)	S-OSR (%)	S-NE (m)
No	Yes	Yes	H	3.00	7.00	104.38	19.55	19.92	26.46	66.21
Yes	No	Yes	H	3.67	5.00	101.27	19.67	20.48	24.21	64.85
Yes	Yes	No	H	2.00	4.33	102.32	19.13	19.42	24.28	65.54
Yes	Yes	Yes	P	0.67	2.67	97.76	19.69	19.42	23.50	62.90
No	No	Yes	H	3.67	8.67	106.59	18.14	19.35	27.30	70.16
Yes	No	No	H	4.00	8.67	98.75	19.78	16.68	26.46	65.36
Yes	Yes	Yes	H	6.00	9.00	91.43	22.48	22.03	27.23	59.02

Table 4. Sentence-level Analysis

Recall that AerialVLN-Fine provides sentence-level annotations for fine-grained evaluation. As the base LLM size increases, performance steadily improves. Small sentence-level differences can accumulate over long trajectories, leading to larger gaps in full-trajectory metrics.

Base LLM	S-SR (%)	S-nDTW (%)	S-NE (m)
Gemini2.5-Flash-Lite	16.40	26.33	79.00
GPT-4o-mini	19.56	28.58	65.96
InternLM3-8B	15.62	28.19	72.01
ChatGLM-4-9B	19.92	31.42	62.18
InternLM2.5-20B	19.00	34.11	67.67
ChatGLM-4-32B	15.95	23.20	62.67
Qwen3-32B	18.76	30.04	62.42
Llama3.3-70B	21.32	33.89	64.82
Qwen2.5-72B	22.03	35.37	59.02

Across all four tables, the results consistently support the same conclusion: a cognition-inspired, collaborative modular framework enables more reliable and interpretable zero-shot UAV navigation than both single-module prompting and existing multi-agent baselines.

Qualitative Analysis

Qualitative Navigation Trajectories

Representative trajectories show that the agent grounds language cues to visible landmarks and executes coherent subgoal transitions. Even in difficult cases, FineCog-Nav tends to keep instruction-consistent behavior rather than drifting early.

Human Study Results

Human study results further support the quantitative findings: participants generally prefer FineCog-Nav for instruction-following quality and perceived navigation intelligence.

AerialVLN-Fine Dataset

Overview of AerialVLN-Fine

AerialVLN-Fine is a curated benchmark built from AerialVLN for more reliable zero-shot UAV VLN evaluation. It provides sentence-level alignment between instruction segments and trajectory segments, and refines ambiguous expressions with explicit visual endpoints and landmark references.

Key Statistics

Total trajectories: 300 high-quality instruction-trajectory pairs
Source split: Val-Seen and Val-Unseen scenes from AerialVLN
Average per trajectory: 189 meters and 76 actions
Fine-grained semantics: 4.6 aligned instruction sentences per trajectory
Total scale: 56,050 meters and 22,835 actions
Instruction refinement: total words 17,572 to 30,762; average length 59 to 103
Sentence-level total: 1,383 aligned sentences (average 41 meters, 17 actions)

The dataset is designed to support fine-grained capability diagnosis and sentence-level evaluation, while keeping annotation quality high through repeated manual verification.

💾 Dataset Link 💻 Code Link

Citation

Coming soon.

Contact

For questions about this work, please contact:
shaodian@nwpu.edu.cn | isjieqi@nju.edu.cn