A Survey on Autonomous Mobile Navigation Using Multimodal Agents and Natural Language Commands
- DOI
- 10.2991/978-94-6239-616-6_15How to use a DOI?
- Keywords
- Accessibility; Android Accessibility Service; Speech-to-Text (STT); Voice-controlled navigation; Large Language Models (LLM); Crew AI; Lang Graph; Vision-Language Models (VLM); Multimodal AI; Hands-free smartphone control
- Abstract
Smartphones are mostly controlled by touch. However, current voice assistants provide limited navigation, especially for complex or changing interfaces. This limitation drives research into multimodal systems that combine speech, vision, and language understanding. Past work in this area falls into three categories: rule-based, OCR-based, and large language model (LLM)-driven methods. This survey reviews and compares these approaches, focusing on their methods, strengths, and weaknesses. It identifies major research gaps, such as adapting to changing user interfaces, improving multimodal perception, and the absence of unified benchmarks. The survey also points out ongoing challenges like latency, privacy, and real-world scalability, particularly in hands-free and accessibility situations. By summarizing recent advancements and comparing key systems, this work offers a helpful reference for researchers and practitioners looking to improve autonomous mobile navigation.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - L. Durgadevi AU - C. Dinesh Kumar AU - G. Nithish AU - R. S. Shankar PY - 2026 DA - 2026/03/31 TI - A Survey on Autonomous Mobile Navigation Using Multimodal Agents and Natural Language Commands BT - Proceedings of the International Conference on Artificial Intelligence and Secure Data Analytics (ICAISDA 2025) PB - Atlantis Press SP - 174 EP - 190 SN - 1951-6851 UR - https://doi.org/10.2991/978-94-6239-616-6_15 DO - 10.2991/978-94-6239-616-6_15 ID - Durgadevi2026 ER -