Beyond Visual Field of View

Comparison with state-of-the-art

We start by comparing our models against competitive baselines of Average, Echo2Depth, RGB2Depth, VisualEchoes [38],and Materials [69] to estimate depth inside RGB FoV (90◦) in Table 1. We adopt same experimental setup as [38, 69]. When using echoes alone, our method of combining echoes received from different orientations performs better than Echo2Depth [38]3 with a large margin. With the presence of target orientation RGB image, our proposed approach achieves improvement of 15.0% (Replica) and 22.1% (Matterport3D) over VisualEchoes [38]. Furthermore, we observe that the method Materials [69] performs the best on Replica dataset while only attaining similar results as VisualEchoes [38] on Matterport3D. This may explain that the material cues brought from the pretrained material approach [9] has dominant impact for depth prediction on Replica. However, for large Matterport3D environment scene, its influence declines. Remarkably, our model achieves the state-of-the-art results on Matterport3D dataset, overwhelming Materials [69] around 18.2% on RMSE. It is worth noting that the Materials [69] model has 316.9M parameters in comparison to our 21.7M. In addition, we show qualitative visualizations of depth estimation in Fig. 4. More qualitative examples are presented in supplementary materials. These indicate our proposed methods better perceive the geometrical information.

Estimating depth maps beyond the visual field of view

We experiment our model in Fig. 2 (w/o and w/ echoes) for depth extension (FoV 120◦) using echoes and RGB of FoV ∈ {15◦, 30◦, 45◦, 60◦, 75◦, 90◦, 105◦, 120◦}. We observe from Fig. 5 that associating echoes outperforms the counterpart results (w/o echoes) over different FoV. Especially for Replica, the improvement gets smaller when increasing the RGB FoV. This indicates that the echoes serve as a strong spatial cue when goes to the region where RGB is not available. Thus, increasing the RGB FoV to apply to echoes does not bring large performance gain. Interestingly, when enlarging the RGB FoV for Matterport3D, the results of using echoes and RGB have relatively stable improvement than using RGB alone. This may because the Matterport3D contains large 3D environment scenes and the RGB image captures important geometric structure for large environment scenes.

Extending depth prediction to complete unseen areas

It is a more challenging problem when there is no overlap between the input RGB and target depth. Fig. 6 visualizes depth prediction metrics of the sideways (“left” and “right”) and “back” when giving echoes and RGB image. The input RGB and predicted depth are of FoV 90◦. Using RGB image to infer depth of an unseen orientation may benefit from the similarity and extension of the visual surfaces. he visual similarity between the forward and backward is comparatively lower, thus it revels by the worse performance of the blue bar from “Back” depth prediction compared to the “Left” and “Right”. The model of using echoes alone (dashed black line) performs better than using RGB image (blue bars). We also experiment depth estimation of a target orientation by using the RGB images from three rest orientations. For instance, we use the RGB images from “Left”, “Right”, and “Back” sides to predict the front depth. Its result is shown as dashed green line in Fig. 6, which indicates investing additional cameras can bring performance gain but increase substantial computing complexity.

However, fusing echoes into one RGB image (red bars) attains superior improvements. For all the metrics, the red bar surpasses the blue bar, blank dashed line, and the green dashed line by a large margin. These reflect the efficacy of fusing echoes into RGB image for exploiting the geometrical information. Specifically, we observe that, after fusing echoes into RGB, the performance differences predicting for among “Left”, “Right”, and “Back” get smaller for all the metrics. This is interesting because it suggests that the echoes capture complementary spatial information for each orientation, which also shows echoes contain strong geometrical cues when we go outside of the RGB FoV.

Navigating Using Echoes and RGB

We introduce PointGoal echo navigation (Fig. 3) to directly use echoes to perceive the spatial cues of physical space for 3D navigation. The echo navigation network is composed of an echo encoder and action predictor. The echo encoder maps the binaural echoes into a vector. The action predictor processes the echo feature vector and GPS signal to predict agent actions. Moreover, we take advantage of audio-visual learning by fusing echoes to visual observations for better embodied 3D navigation.

Navigation Trajectories Visualization

Fig. 7 shows examples of navigation trajectories on top-down maps using echoes in comparison with using raw RGB and original depth image. PointGoal RGB agent moves back and forth (light blue path) and bumps into obstacles multiple times. In contrast, the echoes, especially for the highlighted regions (dash red circles), better sense the obstacles and efficiently avoid backtracking.