Smart agents must possess the ability to interact with humans or other agents in their environment. We have proposed a computational model that can efficiently interact with an oracle while navigating in an unknown indoor environment. In this model, when the agent gets confused regarding where it should go, it prompts the dialogue model to generate a question and answer pair regarding where it should go. The dialogue model which is an oracle sees the future images in the trajectory and generates an instruction for the navigator agent. The navigator agent can prompt the dialogue model multiple times to generate question and answer pairs, thereby engaging in dialogue during navigation.
The dialogue model is based on a GPT-2 decoder that handles multimodal data consisting of both text and images. First, the dialogue model is trained to generate a question using the current image and an answer using the current image along with future images toward the goal. Next, a VLN model is trained so that the agent can output a navigation action or trigger the dialogue model if it needs help.
In our experimental analysis, we show that the model we propose achieves state-of-the-art performance on the main navigation tasks implying dialogue, Cooperative Vision and Dialogue Navigation (CVDN) and Navigation from Dialogue History (NDH), proving that our approach is effective in generating useful questions and answers to guide navigation.