Multilevel Inverse Patchmatch Network with Local and Global Refinement for Underwater Stereo Matching

Leng, Jiaqi; Lv, Qingxuan; Zhang, Shu; Rao, Yuan; Liu, Yimei; Fan, Hao

doi:10.3390/jmse11050930

Open AccessArticle

Multilevel Inverse Patchmatch Network with Local and Global Refinement for Underwater Stereo Matching

¹

Haide College, Ocean University of China, Qingdao 266100, China

²

College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(5), 930; https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050930

Submission received: 7 April 2023 / Revised: 22 April 2023 / Accepted: 25 April 2023 / Published: 26 April 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Vision-based underwater autonomous systems play a significant role in marine exploration. Stereo matching is one of the most popular applications for vision-based underwater autonomous systems, which recovers the geometric information of underwater scenes via stereo disparity estimation. While stereo matching in the air has achieved great progress with the development of neural networks, it generalizes poorly to the underwater scenario due to the challenging underwater degradation. In this paper, we propose a novel Multilevel Inverse Patchmatch Network (MIPNet) to iteratively model pair-wise correlations under underwater degradation and estimate stereo disparity with both local and global refinements. Specifically, we first utilized the inverse Patchmatch module in a novel multilevel pyramid structure to recover the detailed stereo disparity from the input stereo images. Secondly, we introduced a powerful Attentional Feature Fusion module to model pair-wise correlations with global context, ensuring high-quality stereo disparity estimation for both in-air and underwater scenarios. We evaluate the proposed method on the popular real-world ETH3D benchmark, and the highly competitive performance against the popular baselines demonstrates the effectiveness of the proposed method. Moreover, with its superior performance on our real-world underwater dataset, e.g., our method outperforms the popular baseline RAFT-Stereo by 27.1%, we show the good generalization ability of our method to underwater scenarios. We finally discuss the potential challenges for underwater stereo matching via our experiments on the impact of water.

Keywords:

underwater stereo matching; Deep Inverse Patchmatch Network; multilevel recurrent neural network; Attentional Feature Fusion; real-world underwater scenarios

1. Introduction

Stereo matching aims at recovering a scene’s geometric information via stereo disparity estimation, which is a long-standing vision-based task for many underwater autonomous systems [1,2] and applications [3,4,5]. The early rectified stereo matching method focuses on finding the per-pixel disparity along the horizontal baseline in a rectified binocular image pair in a handcrafted manner. This standard pipeline [6] can be roughly divided into four parts: per-pixel matching cost computation, cost aggregation, disparity optimization, and disparity refinement. Despite their high efficiency, these traditional methods are not able to achieve satisfactory results on recent challenging benchmarks.

Recent stereo matching techniques have achieved tremendous progress in the air environment with the rapid development of deep learning. Although the recent advanced network design brings performance gains to some extent, the complex-designed module as well as the larger parameters usually limit their underwater applications due to the high computational and memory costs [7]. To address this limitation, some works propose to treat stereo matching as a special case of optical flow estimation with horizontal line constraints, inspiring the recent popular iterative-based methods that show notable performance with high efficiency. These methods can be roughly classified as local-based methods or global-based methods according to their means of modeling the pair-wise correlation.

The global-based methods, such as RAFT-Stereo [8], which computes all-pair correlations in a global view, utilize a multilevel structure to update the correlations with manageable memory consumption for standard input image sizes. However, it remains a major challenge, especially for underwater scenarios, to process high-resolution images. The local-based methods, such as the CREStereo [9], leverage local adaptive correlation to predict the stereo disparity. Nevertheless, their results are sometimes unsatisfactory under the conditions of occlusion or textureless cases, especially for underwater scenarios [10]. The learning-based underwater stereo matching is rarely studied, and it remains challenging to directly apply the existing in-air stereo matching models to underwater scenarios due to the complex conditions of underwater degradation, occlusions, textureless cases, and low illumination. Moreover, due to the irreconcilable difference between the features in-air and underwater, it is also difficult for the pretrained in-air model to extract underwater features directly, which may result in magnified local defects.

To address these limitations and explore the stereo matching approach for both in-air and underwater purposes, we propose to model accurate pixel correspondences from the perspective of Patchmatch [11]. Generally, a Patchmatch algorithm should contain three basic parts: random initialization, propagation, and random local search. After random initialization, the initialized correlations are passed to their neighbors by propagation according to the law of large numbers. Afterwards, the random search is adopted to refine the correlations. For robust underwater stereo matching with high efficiency, in this paper, we present the novel Multilevel Inverse Patchmatch Network (MIP-Net) that iteratively leverages the inverse propagation strategy [12] in a multilevel pyramid structure to model the pair-wise correlation for stereo matching. With a novel Attentional Feature Fusion module, we iteratively extract deep features with the ResNet [13] encoder to progressively increase the receptive field for underwater feature extraction. The Patchmatch module, which consists of an inverse propagation module and a rectangular local search block, operates on different levels of resolution to update the disparity in a coarse-to-fine manner and output the refined-level prediction, ensuring a robust disparity search to deal with underwater degradation, occlusions, textureless features, and low illumination. The overall structure of our network is illustrated in Figure 1.

Different from most existing methods, our approach has two major advantages. On the one hand, it only performs correlation on nearby pixels iteratively, instead of constructing a global matching cost, which will reduce memory costs significantly. On the other hand, we update the correlation by propagation, and a local random search is performed only for refinement, avoiding the defects caused by using a local search. The highly competitive results on the public benchmark ETH3D [14] and the real-world underwater dataset demonstrate the effectiveness as well as the good generalization ability of our proposed method for underwater stereo matching research.

2. Related Works

2.1. Stereo Matching in Atmosphere

When developing stereo matching methods, measuring per-pixel matching costs and aggregating them is extremely important. Most impactful works make great contributions to the development of accurate and economic volumes. In the era of traditional methods, there have been many inspiring ideas, such as using a support window centered at pixels to compute the matching cost [15,16] and using belief propagation to formulate the cost function [17,18]. Patchmatch was also introduced at this time [19].

After Zbontar and LeCun [20] trained a convolutional neural network to evaluate the matching score, the performance of the algorithm improved rapidly. Mayer et al. proposed the first end-to-end network, DispNet/ DispNetC [21], based on 2D convolutions. Kendall et al. [22] constructed the first 3D architecture, which aims to act as a differentiable approximation to classical filtering algorithms. All these methods achieved state-of-the-art performance at that time, but early end-to-end networks are plagued either by poor generalization ability or high computation costs. To solve these issues, Guo et al. [23] designed group-wise correlation to improve similarity measurement and save memory at the same time. Yang et al. [24] proposed a hierarchical coarse-to-fine architecture based on pyramidal approaches to deal with high-resolution images. Shen et al. [25] constructed a pyramid cost volume to cover multi-scale receptive fields and a 3D warping volume to refine the initial disparity map. Moreover, networks based on optical flow estimation were introduced into stereo matching at this time to deal with the above issues. Tankovich et al. [26] guided the stereo predictions using predicted tiles to leverage the planar geometry of the scene as an inductive prior in the network design. Lipson et al. [8] constructed a 3D correlation volume and introduced a multilevel update operator to provide robust disparity predictions and save memory by avoiding calculation operating at high resolutions. Li et al. [9] introduced an adaptive group correlation layer to be more memory-efficient and overcome imperfect calibrations for real-world stereo cameras. They also constructed a synthetic dataset that focuses on challenging cases in real-world scenes and features various enhancements. Moreover, Cheng et al. [27] introduced the Neural Architecture Search to reduce the human effort in neural network design.

Another line of work focuses on improving the disparity or matching cost. Most traditional methods and early machine learning methods within the stereo pipeline use individual steps such as a consistency check [28] to obtain more accurate disparity maps at the disparity dimension. After the emergence of end-to-end networks, there were many works focusing on optimizing other dimensions. Liu et al. [29] designed a feature fusion module based on the Local Similarity Pattern to provide higher-quality feature maps. Xu et al. [30] used deformable convolution to enhance the standard convolution’s capability of modeling geometric transformations. This strategy is also used in [9,29], where a Cost Self-Reassembling module and a deformable search window are designed, respectively. Patchmatch was also introduced into a 3D architecture by DeepPruner [31] to improve the distinctiveness of the matching cost generated by the backbone. Xu et al. [32] proposed multilevel adaptive patch matching to generate reliable attention weights, which aim to suppress redundant information and enhance matching-related information in the concatenation volume. However, this is only a refinement method so a high memory cost is still unavoidable.

2.2. Stereo Matching under Water

Compared to stereo matching in the atmosphere, underwater stereo matching is much less developed. Xu et al. [7] used zero-based normalized cross-correlation and the Hamming distance to measure similarity and designed the smoothness term based on the color metric to overcome the discontinuity of the disparity map. Zhuang et al. [10] proposed a new method to express underwater images via a direction-information image and position image, and an image optimization method that fuses color correction and dark channels in advance.

In general, most works chose to improve the performance by using extra techniques, such as camera calibration [33], instead of the algorithms themselves. Only a few of them focused on improving the algorithms by using the unique characteristics of images from underwater scenarios, and most of them are traditional methods or early machine learning methods within the stereo pipeline. Therefore, our method based on an end-to-end network is certainly innovative for underwater stereo matching.

3. Methods

The main contribution of our method is that we have customized various components in our network for underwater scenarios.

3.1. Inverse Patchmatch for Stereo Matching

Calculating correlations by Patchmatch is the most crucial step in our method. To better adapt to underwater stereo matching, we first integrated prior experience obtained by observing public datasets while initializing the disparity field. After this, we introduced a process named “Inverse Propagation” [12], which was first developed for optical flow estimation. During this process, we added a vertical constraint to maintain the disparity predictions on the horizontal line. Moreover, due to the unequal states between the vertical and horizontal directions, we applied a rectangular local search window to refine the correlation computed in the inverse propagation stage.

3.1.1. Gamma Initialization

According to the law of large numbers, there are always some good estimates when initializing the disparity field using a large amount of random numbers. In theory, we can use any distribution to generate a random disparity field as long as the appropriate range is covered. However, providing more good estimates will help the network to propagate more correct information during the early training stage, which may allow the network to converge faster. According to the statistics in [34], the proportion of disparity in most public datasets is not uniform. Considering the shape of the distribution, we use gamma distribution to initialize, which is different from [12,31]. Eventually, our final model only needs 6 iterations to obtain very competitive results.

3.1.2. Inverse Propagation for Stereo Matching

The key idea of propagation is to pass the information to its neighbors. In this work, the information that we need to pass is the matching cost under the current disparity prediction. Specifically, we regard the element-wise multiplication of two feature maps as the correlation. Given a disparity prediction d, the target feature map should be wrapped according to the shift in disparity prediction, so the correlation during the propagation process can be formulated by

C = F_{1} \cdot W (F_{2}, d + Δ d)

where C is the 4-dimensional correlation cost volume,

F_{1}

and

F_{2}

are feature maps,

W

is the warping operation, and d and

Δ d

are the current disparity field and shift disparity. It is noteworthy that we added the extra constraint on the vertical direction, so the disparity prediction d will only fall on the horizontal line.

However, as mentioned in [12], when choosing to receive information from n neighbor seed points and recurrently update it m times, there will be

n \times m

shifting and warping operations. Therefore, to overcome the computation cost caused by traditional propagation, inverse propagation was proposed. Differing from the original process, inverse propagation directly shifts the feature map with the shift disparity prediction

Δ d

before warping the shifted feature map and offsets the shift disparity prediction later. The process can be formulated as

C = F_{1} \cdot S (W (S (F_{2}, Δ d), d), - Δ d)

where

S

represents the shift operation.

This is reasonable because disparity shift and feature warping are also serial and coupled, which means that inverse propagation should be equivalent to the original propagation when considering obtaining a correlation, but can avoid many operations in the intended design. As a target feature point is scattered to its seed points, the target features can be shifted and stacked in advance. Moreover, since

Δ d

is very small in each iteration, if we ignore the backpropagation, the inverse propagation process will become

C = F_{1} \cdot W (S (F_{2}, Δ d), d)

and we can only perform a warping operation once during each iteration. The process of inverse propagation is illustrated in Figure 2.

In our implementation, we followed [12]; we chose 4 static seed neighbors to balance the performance and complexity. In our final model, we only need to shift target features 4 times and warp the shifted target features 5 times to obtain a very accurate correlation.

3.1.3. Rectangular Local Search Window

In order to pursue state-of-the-art amphibious performance, we need to refine the local correlation additionally. We also performed a local neighborhood search with a fixed radius. This radius is much smaller than the size of the images, so that it will be very memory-efficient. Unlike [9,12], we developed a rectangular local search window to emphasize the unequal states between the vertical and horizontal directions. In general, the rectangular local search can be formulated as

C = F_{1} \cdot S (W (F_{2}), Δ d)

and the rectangular local search block is illustrated in Figure 3.

Thus far, we have introduced the components of inverse Patchmatch for stereo matching. In the following sections, we will introduce the overall architecture of our network, including the feature extraction and multilevel update pyramid.

3.2. Network Architecture

Our network is composed of an encoder and an iterative decoder. In our encoder, we added an iterative Attentional Feature Fusion module to improve the performance in underwater scenarios by providing feature maps with higher quality. As for the decoder, it is composed of a series of inverse Patchmatch modules operating at different resolutions.

3.2.1. Feature Extraction with Global Context

The aim of our encoder (Figure 4) is to map the rectified binocular image pairs to dense feature maps so that our iterative decoder can use them to predict the disparity field. In [8], the authors introduced two types of feature extractors, the feature encoder and context encoder, where the former is used to compute the correlation and the latter is used to initialize the following modules and provide the information of the relationship in context to avoid local noise. As in many previous works [8,9,23], we used residual blocks [13] to extract these two types of features and then construct a feature pyramid, but we did not capture them separately. Instead, we follow [9,12], where the context is split from part of the feature map. This will slim the module and reduce computation and memory costs.

Meanwhile, reflection, scattering and other phenomena can cause much more noise. Inspired by [29], we use more context information to refine the features obtained by the residual block and the solution that we adopt is to increase the receptive field while obtaining features. Most works chose to fuse multi-scale features via interpolation and element-wise addition, but we use element-wise adaptive fusion weight to conceal low-level noise. More specifically, we adopt the two-stage iterative Attentional Feature Fusion proposed by [35], which has proven its ability in classification tasks. The element-wise adaptive fusion weight is generated by a multi-channel attention module. We have global channel context

g (X)

and local channel context

L (X)

, which are computed by

L (X) = B (P W C o n v_{2} (δ (B (P W C o n v_{1} (X)))))

g (X) = B (W_{2} δ (B (W_{1} (g (X)))))

where X is an input feature,

B

represents a bottleneck structure, PWConv1 and PWConv2 are point-wise convolution,

g (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{[:, i, j]}

is the global average pooling,

δ

is the Rectified Linear Unit [36],

B

represents Batch Normalization [37], and

W_{1}

and

W_{2}

are two fully connected layers. Then, the feature is refined by

X^{'} = X \otimes σ (L (X) \oplus g (X))

where ⊗ represents the element-wise multiplication of the matrix and ⊕ represents the broadcasting addition. Therefore, after the above operations, the element-wise adaptive fusion weight used to refine and fuse features can be obtained.

Thus far, we have provided the weight to refine features while they are fusing. However, ref. [35] points out that it is important to consider how to inject features into the attention module, so an iterative approach is proposed. It is composed of two intertwined multi-channel attention modules so that the weights when injecting into the attention module can be updated iteratively. In general, the iterative attentional fusion can be formulated as

Z = M (X ⊎ Y) \otimes X + (1 - M (X + Y)) \otimes Y

where

X ⊎ Y = M (X + Y) \otimes X + (1 - M (X + Y)) \otimes Y

,

M

is the multi-channel attention module mentioned above.

3.2.2. Multilevel Update Pyramid

Our decoder is a multilevel update pyramid that consists of a series of inverse Patchmatch modules. In each inverse Patchmatch block, the inverse propagation block and rectangular local search block are followed by corresponding cross-connected GRUs [38]:

z_{t} = σ (C o n v_{3 \times 3} ([h_{t - 1}, x_{t}], W_{z}))

r_{t} = σ (C o n v_{3 \times 3} ([h_{t - 1}, x_{t}], W_{r}))

h_{t} = t a n h (C o n v_{3 \times 3} ([r_{t} \otimes h_{t - 1}, x_{t}], W_{h}))

h_{t} = (1 - z_{t}) \otimes h_{t - 1} + z_{t} \otimes h_{t}

For one inverse Patchmatch block, there is one inverse propagation block and one rectangular local search block with their corresponding cross-connected GRUs concatenated to each other. After the correlation C is obtained by the inverse propagation block or rectangular local search block, it is concatenated with the current disparity field and context feature separated from the output of the encoder and then injected into the GRU to predict the shift disparity. After several iterations of updates, the predicted disparity estimation will be upsampled [38] to a high resolution level and launch another update process. Eventually, after gamma initializing at a 1/16 resolution, the disparity field is updated 6 times and upsampled to a 1/4 resolution for another 6 iterations.

3.3. Loss Function

Following [8,9], we supervise the full sequence of predictions.

L = \sum_{i = 1}^{N} γ^{N - i} {∥ d_{g t} - d_{i} ∥}_{1}, γ = 0.8

where

d_{g t}

represents the ground truth of the disparity,

d_{i}

represents the estimated disparity of each iteration, and N is the number of GRU iterations.

4. Experiment

All models mentioned in the following sections were implemented by Pytorch and trained on four RTX2080Ti GPUs with 11 GB memory using the AdamW optimizer and a one-cycle learning rate schedule [39]. We performed a two-stage pretrain process on the synthetic datasets proposed in [9], and the ablation study was implemented on the Sceneflow [21] cleanpass subset. Moreover, we evaluated the method on the ETH3D [14] benchmark and our hand-crafted underwater dataset. All experiments in this section, excluding the results submitted to the benchmark, are described with the mean of three independent realizations.

4.1. Dataset and Metrics

Synthetic dataset. The models used to submit to the public benchmark and compare with other baselines are pretrained on the synthetic datasets proposed in ref. [9]. It is composed of approximately 200K image pairs and each image pair is captured by dual virtual cameras and customarily positioned objects, using Blender to generate the synthetic dataset. This dataset addresses the generalization from synthetic datasets to real-world scenarios, which enables us to pursue better performance on public benchmarks and underwater scenarios. As for the ablation study, we chose another synthetic dataset, Sceneflow [21]. It consists of three subsets: FlyingThings 3D, Monkaa and Driving. This dataset covers a wide range of scenarios, including common objects, animated short films and street scenes. The cleanpass subset that we used consists of around 15K image pairs. The metric that we monitor is the end-point-error (EPE), calculated by

\frac{1}{N} \sum_{i = 1} | d_{i} - d_{i}^{*} |

, and the 1-pixel error (>1 px), as a percentage of the points with EPE larger than 1.

ETH3D. ETH3D [14] is a real-world dataset that covers indoor and outdoor scenes. It is recorded by a high-precision laser scanner and captured by DSLR cameras. The dataset for stereo matching consists of a training set with 27 image pairs and a test set with a size of 20 image pairs. The default metric on the benchmark is also the 1-pixel error, but it also provides other metrics, such as 90% error quantiles (A90) and the root-mean-square disparity error (rms) in pixels.

Underwater dataset. Our underwater dataset consists of 49 high-resolution (

1920 \times 1080

) image pairs. The images were captured via a ZED camera and we also used a high-precision laser scanner to generate ground truth disparity. To simulate real-world underwater scenarios, some images were taken under a low-light condition, which is similar to the situation in the deep ocean, while others were taken in more turbid liquid, which simulates nearshore water. We compared our results with other baselines on similar metrics in ETH3D.

Some samples from the above datasets are shown in Figure 5.

4.2. Data Augmentation

Appropriate data augmentation techniques can significantly improve the performance of networks. In this work, we followed RAFT-Stereo [8] and applied techniques such as changing the image saturation, perturbation and stretch before feeding the data into our network. Specifically, we adjusted the image saturation of both left and right images between 0 and 1.4 and the right images were perturbed to simulate imperfect rectification. Additionally, to cover a broader disparity range in the pretraining process, the images and ground truth disparity were stretched by a factor in the range

[2^{- 0.2}, 2^{0.4}]

. One sample after data augmentation is shown in Figure 6.

4.3. Ablation Study

Vertical constraint. As shown in Table 1, if we cancel the vertical constraint, which means that we directly handle the stereo matching task as an optical flow estimation task, the result shows that the entire algorithm will fail. This implies that although stereo matching can be regarded as a special case of optical flow estimation, it is still not practical to merge these two tasks intuitively.

Attentional Feature Fusion. We implemented a very lightweight attention mechanism in our feature encoder, which aims to provide feature maps with higher quality. According to the results shown in Table 1, if we replace the attentional fusion with a traditional plus

Z = X + Y

, where

X

and

Y

are given feature maps, the performance of our network will drop.

Inverse Propagation and Rectangular Local Search. In fact, the inverse propagation module and rectangular local search module can perform stereo matching independently, so we launched two experiments to test their impacts on the performance. The results in Table 1 show that the inverse propagation module has more parameters than the rectangular local search module and makes a greater contribution to the accuracy of the network.

Radius of Rectangular Local Search. To prove the effectiveness of the rectangular local search module, we also explored different searching radii. To balance performance and computation costs, we chose the vertical searching radius to be 1 and horizontal searching radius to be 3, according to the results shown in Table 2. Moreover, an excessive searching radius in the vertical direction will cause a significant drop in the performance.

Computation Cost. We also tested the impact of different components on the computation cost in our underwater dataset. The results of the fine-tuned model are shown in Figure 7. We can notice that the inverse propagation (IP) will occupy most of the time and memory in the process, but it is still unwise to remove any of the components. Moreover, our full model is able to handle a

1920 \times 1080

image pair in 0.53 s on a single RTX2080Ti GPU and only uses 5.54 GB. This implies that our method is very memory- and time-efficient. Additionally, to prove the efficiency of our method, we also tested our full model on an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz CPU and the runtime was 95.36 s, which means that our method can even be applied to edge devices where resources are not sufficient.

4.4. Performance in Amphibious Scenarios

The results shown in the following sections were obtained based on our pretrained model. We launched a two-stage pretraining process on the 200K synthetic dataset in [9]. In the first stage, the network was trained with the maximum learning rate being

6 \times 10^{- 4}

and a batch size of 8. The random crop size was set to be

512 \times 384

. The first stage lasted for 100K iterations. As for the second stage, the network was trained for another 100K iterations with a maximum learning rate

2 \times 10^{- 4}

and a batch size of 4. The random crop size was set to be

768 \times 384

at this stage.

We monitored the metrics in the pretraining process. As seen in Figure 8 and Figure 9, the end-point error was calculated on the training batch for each step and the valid EPE was calculated on the validation set.

We can notice that while implementing a two-stage pretraining process, the model can achieve very competitive performance at stage 1 and can still obtain additional improvements after stage 2. This phenomenon fits the design of our network.

ETH3D Benchmark

As with many other real-world datasets, ETH3D is too small to train directly on its training set, so we applied the fine-tune strategy. We fine-tuned it for 100 iterations, with the maximum learning rate being

2 \times 10^{- 4}

and a batch size of 8. The random crop size was set to be

512 \times 384

. During our experiments on the validation set that we used, we noticed that although we could improve the performance using extra inverse Patchmatch iterations, it is still not worthwhile considering the time cost.

Observing the samples in Figure 10, we noticed that our method could handle the overexposed and underexposed areas extraordinarily well, which will especially benefit underwater stereo matching. Eventually, our results were compared to other baselines, as shown in Table 3.

4.5. Underwater Scenarios

In this section, we describe fine-tuning on the training set for 7000 steps with a batch size of 4 and the maximum learning rate set to

4 \times 10^{- 4}

. The random crop size was

640 \times 400

.

Different from the scenarios in the atmosphere, the underwater environment will significantly increase the difficulty for stereo matching, despite the fact that our images were captured under a high resolution. From the results shown in Table 4, our method outperforms others by 27.1%, but the accuracy compared to other results on ETH3D drops significantly. It is noteworthy that all methods shown in Table 4 were fine-tuned on the training set and the inference process of all methods was operated on a single RTX2080Ti GPU.

In order to better visualize the output of our network, we transform our disparity maps into depth maps by

d e p t h = \frac{b \cdot f}{d}

where b is the length of the baseline, f is the focal length and d is the disparity. The final depth maps and their ground truths are also shown in Figure 11.

Moreover, to further explore the impact of water on underwater stereo matching, we launched a series of experiments with different numbers of GRU iterations. Figure 12 shows the experiments on the turbid group in our test set. In the first image pair (first row), where there are many larger objects, the performance of our network seems to be more affected by the turbid water, which leads to significant bias in the first iteration. However, for the second image pair, where there are more small objects, the bias caused by turbid water is not as significant as in the first pair but requires more GRU iterations to be erased. Meanwhile, smaller objects are more difficult to perceive, especially for underwater environments. This is why we implement the Attentional Feature Fusion module in our feature encoder.

For the low-light group in the test set shown in Figure 13, we adopted a similar strategy. In the first image pair, the objects appear very vague against the low-light environment but our network fails to distinguish them from the background. This implies the oversmoothing problem, which is very common in convolution networks. Although our Attentional Feature Fusion module can alleviate the phenomenon, it is still a great challenge for underwater stereo matching. As for the second and third image pairs, the main challenge is to recover the details of the objects. Our network shows a strong ability to deal with this issue.

Furthermore, comparing Figure 12 and Figure 13, we can notice that in different environments, the network can achieve satisfactory results with different numbers of iterations. This implies that we can adjust it in real-world practice to pursue real-time stereo matching.

5. Conclusions

In this paper, we have proposed a novel Multilevel Inverse Patchmatch Network with local and global refinement to perform underwater stereo matching. Based on the similarity between the disparity and optical flow, we utilize inverse propagation, which was originally designed to calculate the correlation for optical flow estimation. Specifically, we maintain the displacement falling on the horizontal line to satisfy the stereo matching task. Moreover, to improve the performance of our network in underwater scenarios, we customize other components, such as adding an attention mechanism for feature fusion and resizing the local search window.

Eventually, after pretraining our networks on a large synthetic dataset, we fine-tuned the pretrained models on the ETH3D dataset and our hand-crafted underwater dataset. According to the results shown on the benchmark’s official website and our dataset, our methods achieved competitive performance. Additionally, we launched a series of ablation studies to show the impact of each component on the performance of our network, as well as a series of experiments to explore the impact of water.

When analyzing the results in underwater scenarios, we discovered that although our method had already outperformed other novel methods, it still suffered from noise in the underwater environment. In the future, we plan to construct an underwater synthetic dataset where we can emphasize the impact of noise. We believe that this can significantly improve the performance. Moreover, our network also suffers from the oversmoothing problem. Inspired by [29], we plan to add cross-entropy loss to our loss function because it has been proven to be useful to overcome the oversmoothing problem [46].

Author Contributions

Conceptualization, J.L., S.Z. and H.F.; Data curation, J.L.; Formal analysis, J.L.; Investigation, J.L.; Methodology, J.L.; Project administration, J.L. and H.F.; Resources, Q.L. and H.F.; Software, J.L. and Q.L.; Supervision, S.Z. and H.F.; Validation, J.L., S.Z., Y.R., H.F. and Y.L.; Visualization, J.L.; Writing—original draft, J.L.; Writing—review and editing, J.L., S.Z., Y.R., H.F. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the Natural Science Foundation of Shandong Province (Grant No. ZR2020QF030), the China Postdoctoral Science Foundation (Grant No. 2020M672144), the Hainan Provincial Joint Project of Sanya Yazhou Bay Science and Technology City (2021JJLH0061), and the National Natural Science Foundation of China (Grant No. 41927805, 42106193, 41906177).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The synthetic dataset in ref. [9] can be obtained at https://github.com/megvii-research/CREStereo (accessed on 6 April 2023). Sceneflow synthetic dataset [21] is available at https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html (accessed on 6 April 2023). ETH3D dataset [14] and the results reported are available at https://www.eth3d.net (accessed on 6 April 2023). Underwater dataset is unavailable due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFF	Attentional Feature Fusion
RLS	Rectangular Local Search
IPM	Inverse Patchmatch Module
IP	Inverse Propagation

References

Klapwijk, M.; Lemaire, S. And…Action! Setting the Scene for Accurate Visual CFD Comparisons Using Ray Tracing. J. Mar. Sci. Eng. 2021, 9, 1066. [Google Scholar] [CrossRef]
Sun, B.; Mei, Y.; Yan, N.; Chen, Y. UMGAN: Underwater Image Enhancement Network for Unpaired Image-to-Image Translation. J. Mar. Sci. Eng. 2023, 11, 447. [Google Scholar] [CrossRef]
Cebrián-Robles, D.; Ortega-Casanova, J. Low cost 3D underwater surface reconstruction technique by image processing. Ocean Eng. 2016, 113, 24–33. [Google Scholar] [CrossRef]
Drap, P.; Seinturier, J.; Hijazi, B.; Merad, D.; Boi, J.M.; Chemisky, B.; Seguin, E.; Long, L. The ROV 3D Project. J. Comput. Cult. Herit. 2015, 8, 1–24. [Google Scholar] [CrossRef]
Williams, K.; Rooper, C.N.; De Robertis, A.; Levine, M.; Towler, R. A method for computing volumetric fish density using stereo cameras. J. Exp. Mar. Biol. Ecol. 2018, 508, 21–26. [Google Scholar] [CrossRef]
Scharstein, D.; Szeliski, R.; Zabih, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings of the IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001), Kauai, HI, USA, 9–10 December 2001; pp. 131–140. [Google Scholar] [CrossRef]
Xu, Y.; Yu, D.; Ma, Y.; Li, Q.; Zhou, Y. Underwater stereo-matching algorithm based on belief propagation. Signal Image Video Process. 2022, 17, 891–897. [Google Scholar] [CrossRef]
Lipson, L.; Teed, Z.; Deng, J. RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Online, 1–3 December 2021; pp. 218–227. [Google Scholar] [CrossRef]
Li, J.; Wang, P.; Xiong, P.; Cai, T.; Yan, Z.; Yang, L.; Liu, J.; Fan, H.; Liu, S. Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16242–16251. [Google Scholar] [CrossRef]
Zhuang, S.; Zhang, X.; Tu, D.; Ji, Y.; Yao, Q. A dense stereo matching method based on optimized direction-information images for the real underwater measurement environment. Measurement 2021, 186, 110142. [Google Scholar] [CrossRef]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch. ACM Trans. Graph. 2009, 28, 1–11. [Google Scholar] [CrossRef]
Zheng, Z.; Nie, N.; Ling, Z.; Xiong, P.; Liu, J.; Wang, H.; Li, J. DIP: Deep Inverse Patchmatch for High-Resolution Optical Flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8925–8934. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Schops, T.; Schonberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Van Meerbergen, G.; Vergauwen, M.; Pollefeys, M.; Van Gool, L. A hierarchical stereo algorithm using dynamic programming. In Proceedings of the IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001), Kauai, HI, USA, 9–10 December 2002. [Google Scholar] [CrossRef]
Birchfield, S.; Tomasi, C. Depth discontinuities by pixel-to-pixel stereo. In Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, 4–7 January 1998. [Google Scholar] [CrossRef]
Klaus, A.; Sormann, M.; Karner, K. Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006. [Google Scholar] [CrossRef]
Yang, Q.; Wang, L.; Yang, R.; Stewenius, H.; Nister, D. Stereo Matching with Color-Weighted Correlation, Hierarchical Belief Propagation, and Occlusion Handling. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 492–504. [Google Scholar] [CrossRef] [PubMed]
Bleyer, M.; Rhemann, C.; Rother, C. PatchMatch Stereo - Stereo Matching with Slanted Support Windows. In Proceedings of the British Machine Vision Conference 2011, Dundee, UK, 29 August–2 September 2011; pp. 14.1–14.11. [Google Scholar] [CrossRef]
Zbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 2287–2318. [Google Scholar]
Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar] [CrossRef]
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-End Learning of Geometry and Context for Deep Stereo Regression. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar] [CrossRef]
Guo, X.; Yang, K.; Yang, W.; Wang, X.; Li, H. Group-Wise Correlation Stereo Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3268–3277. [Google Scholar] [CrossRef]
Yang, G.; Manela, J.; Happold, M.; Ramanan, D. Hierarchical Deep Stereo Matching on High-Resolution Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5510–5519. [Google Scholar] [CrossRef]
Shen, Z.; Dai, Y.; Song, X.; Rao, Z.; Zhou, D.; Zhang, L. PCW-Net: Pyramid Combination and Warping Cost Volume for Stereo Matching. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Lecture Notes in Computer Science. pp. 280–297. [Google Scholar] [CrossRef]
Tankovich, V.; Hane, C.; Zhang, Y.; Kowdle, A.; Fanello, S.; Bouaziz, S. HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14362–14372. [Google Scholar] [CrossRef]
Cheng, X.; Zhong, Y.; Harandi, M.; Dai, Y.; Chang, X.; Drummond, T.; Li, H.; Ge, Z. Hierarchical Neural Architecture Search for Deep Stereo Matching. Adv. Neural Inf. Process. Syst. 2020, 33, 22158–22169. [Google Scholar]
Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 2, pp. 807–814. [Google Scholar]
Liu, B.; Yu, H.; Long, Y. Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1647–1655. [Google Scholar] [CrossRef]
Xu, H.; Zhong, J. AANet: Adaptive Aggregation Network for Efficient Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1959–1968. [Google Scholar]
Duggal, S.; Wang, S.; Ma, W.C.; Hu, R.; Urtasun, R. DeepPruner: Learning Efficient Stereo Matching via Differentiable PatchMatch. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4383–4392. [Google Scholar] [CrossRef]
Xu, G.; Cheng, J.; Guo, P.; Yang, X. Attention Concatenation Volume for Accurate and Efficient Stereo Matching. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12971–12980. [Google Scholar] [CrossRef]
Deng, Z.; Sun, Z. Binocular Camera Calibration for Underwater Stereo Matching. J. Phys. Conf. Ser. 2020, 1550, 032047. [Google Scholar] [CrossRef]
Rao, Z.; Dai, Y.; Shen, Z.; He, R. Rethinking Training Strategy in Stereo Matching. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–14. [Google Scholar] [CrossRef] [PubMed]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Teed, Z.; Deng, J. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar] [CrossRef]
Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. Artif. Intell. Mach. Learn. Multi-Domain Oper. Appl. 2019, 11006, 369–386. [Google Scholar] [CrossRef]
Zurich,, C.V.; G., G. ETH Low-Res Two-View Results-ETH3D. Available online: https://www.eth3d.net/low_res_two_view (accessed on 1 April 2023).
Xu, H.; Zhang, J.; Cai, J.; Rezatofighi, H.; Yu, F.; Tao, D.; Geiger, A. Unifying Flow, Stereo and Depth Estimation. arXiv 2022, arXiv:2211.05783. [Google Scholar]
Zhao, H.; Zhou, H.; Zhang, Y.; Zhao, Y.; Yang, Y.; Ouyang, T. EAI-Stereo: Error Aware Iterative Network for Stereo Matching. In Proceedings of the Computer Vision—ACCV 2022, Macau, China, 4–8 December 2023; pp. 3–19. [Google Scholar] [CrossRef]
Song, X.; Yang, G.; Zhu, X.; Zhou, H.; Wang, Z.; Shi, J. AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10328–10337. [Google Scholar]
Rao, Z.; He, M.; Dai, Y.; Zhu, Z.; Li, B.; He, R. NLCA-Net: A non-local context attention network for stereo matching. APSIPA Trans. Signal Inf. Process. 2020, 9, e18. [Google Scholar] [CrossRef]
Zhang, F.; Qi, X.; Yang, R.; Prisacariu, V.; Wah, B.; Torr, P. Domain-Invariant Stereo Matching Networks. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 420–439. [Google Scholar] [CrossRef]
Chen, C.; Ma, H.; Cheng, H. On the Over-Smoothing Problem of CNN Based Disparity Estimation. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8996–9004. [Google Scholar] [CrossRef]

Figure 1. Overall structure of our network. The rectified binocular image pair is first fed into the encoder to generate feature maps on different resolutions and initialize the inverse Patchmatch module (IPM) (blue rectangular), which is composed of a concatenated inverse propagation (IP) block (pink) and rectangular local search block (purple) with their respective GRUs (green) to predict shift disparity. After the random initialization of the disparity field using gamma distribution, it is updated at a 1/16 resolution and 1/4 resolution separately and we obtain the final disparity prediction.

Figure 2. Inverse Propagation (IP) Block. After padding and shifting according to the seed points, the concatenated feature maps are warped by the current disparity flow and the correlation is computed. The process in the blue rectangle is performed in the initializing stage because no disparity flow is needed.

Figure 3. Rectangular Local Search Block. Right features are first warped by the current disparity flow and then the correlation within the rectangular search window is calculated to generate the shift disparity for the next iteration.

Figure 4. Encoder. The encoder is composed of a series of residual blocks and Attentional Feature Fusion modules concatenated to each other. In this figure, ⊗ and ⊕ represents the element-wise multiplication of the matrix and broadcast addition, respectively.

Figure 5. Sample image pairs. From left to right, they are image pairs from synthetic datasets in ref. [9], Sceneflow, ETH3D, low-light group in underwater dataset and turbid group in underwater dataset.

Figure 6. Data augmentation. From left to right, they are the original image pair, images after color transformation and images after spatial transformation.

Figure 7. Impact of different components on computation cost. The inference time of all experiments was measured on a single RTX2080Ti GPU. (a) Without RLS; (b) without IP; (c) full model.

Figure 8. Training process in stage 1. The validation set is a subset with 400 image pairs separated from 200K pretraining datatset. (a) <1 px; (b) <3 px; (c) <5 px; (d) valid EPE.

Figure 9. Training process in stage 2. The validation set is a subset with 400 image pairs separated from 200K pretraining datatset. All curves are smoothed (dark orange) to capture obvious trends. (a) <1 px; (b) <3 px; (c) <5 px; (d) valid EPE.

Figure 10. Samples from the ETH3D test set. (a) CRE-Stereo; (b) GM-Stereo; (c) EAI-Stereo; (d) RAFT-Stereo; (e) ACVNet; (f) HITNet; (g) Ada-Stereo; (h) DeepPruner; (i) HSM; (j) NLCA-Net; (k) left image; (l) ours.

Figure 11. Results of our network on the test set in underwater dataset. The first two image pairs are from the turbid group and the rest are from the low-light group. From top to bottom, they are the original left images, disparity maps generated from our network, depth maps transformed from our predicted disparity maps and depth maps transformed from ground truth disparity.

Figure 12. Experiments on the turbid group with different numbers of GRU iterations. From left to right, the number of iterations increases from 1 to 3.

Figure 13. Experiments on the low-light group with different numbers of GRU iterations. From left to right, the number of iterations increases from 1 to 3.

Table 1. Ablation on the key components of the network.

Experiment	>1 px	>2 px	>4 px	EPE	Parameters
Without AFF	6.734	4.159	2.590	0.589	5.37 M
Without RLS	8.454	5.061	3.150	0.724	3.06 M
Without IP	7.061	4.308	2.655	0.609	3.67 M
Without constraint	18.867	-	-	1.260	5.37 M

AFF—Attentional Feature Fusion, RLS—Rectangular Local Search, IP—Inverse Propagation.

Table 2. Ablation on the radius of vertical and horizontal search.

Experiment	>1 px	>2 px	>4 px	EPE
$r_{x} = 1, r_{y} = 1$	6.570	4.062	2.541	0.574
$r_{x} = 2, r_{y} = 1$	6.578	4.145	2.557	0.574
$r_{x} = 3, r_{y} = 1$	6.538	4.008	2.468	0.568
$r_{x} = 4, r_{y} = 1$	6.537	4.007	2.468	0.567
$r_{x} = 2, r_{y} = 2$	7.190	4.472	2.796	0.660

Table 3. Results compared to other baselines on ETH3D benchmark [40].

Method	>0.5 px	>1 px	>2 px	>4 px	A90	A95	A99	EPE
CRE-Stereo [9]	3.58	0.98	0.22	0.10	0.28	0.39	0.92	0.13
GM-Stereo [41]	5.94	1.83	0.25	0.08	0.39	0.55	1.01	0.19
EAI-Stereo [42]	5.21	2.31	1.14	0.70	0.34	0.65	2.35	0.21
RAFT-Stereo [8]	7.04	2.44	0.44	0.15	0.39	0.57	1.20	0.18
HITNet [26]	7.83	2.79	0.80	0.19	0.41	0.65	2.07	0.20
Ada-Stereo [43]	10.22	3.09	0.65	0.20	0.50	0.70	1.45	0.24
DeepPruner [31]	10.77	3.52	0.86	0.23	0.53	0.74	2.05	0.26
NLCA-Net [44]	12.05	3.84	1.04	0.35	0.53	0.79	2.54	0.27
HSM [24]	11.33	4.00	1.36	0.63	0.51	1.19	2.35	0.28
Ours	6.75	2.06	0.64	0.34	0.37	0.56	2.33	0.21

>1 px is the default metric.

Table 4. Results compared with other baselines on the test set of the underwater dataset.

Method	>3 px	>5 px	>7 px	EPE	Time
GWC-Net [23]	53.89	47.36	40.20	7.14	0.55
DSM [45]	51.13	40.20	33.01	5.67	4.81
CRE-Stereo * [9]	51.02	39.68	1.25	5.32	7.52
RAFT-Stereo [8]	51.02	39.68	1.25	5.32	7.52
Ours	50.08	27.27	17.66	3.88	0.53

* Non-official code repository.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Leng, J.; Lv, Q.; Zhang, S.; Rao, Y.; Liu, Y.; Fan, H. Multilevel Inverse Patchmatch Network with Local and Global Refinement for Underwater Stereo Matching. J. Mar. Sci. Eng. 2023, 11, 930. https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050930

AMA Style

Leng J, Lv Q, Zhang S, Rao Y, Liu Y, Fan H. Multilevel Inverse Patchmatch Network with Local and Global Refinement for Underwater Stereo Matching. Journal of Marine Science and Engineering. 2023; 11(5):930. https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050930

Chicago/Turabian Style

Leng, Jiaqi, Qingxuan Lv, Shu Zhang, Yuan Rao, Yimei Liu, and Hao Fan. 2023. "Multilevel Inverse Patchmatch Network with Local and Global Refinement for Underwater Stereo Matching" Journal of Marine Science and Engineering 11, no. 5: 930. https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multilevel Inverse Patchmatch Network with Local and Global Refinement for Underwater Stereo Matching

Abstract

1. Introduction

2. Related Works

2.1. Stereo Matching in Atmosphere

2.2. Stereo Matching under Water

3. Methods

3.1. Inverse Patchmatch for Stereo Matching

3.1.1. Gamma Initialization

3.1.2. Inverse Propagation for Stereo Matching

3.1.3. Rectangular Local Search Window

3.2. Network Architecture

3.2.1. Feature Extraction with Global Context

3.2.2. Multilevel Update Pyramid

3.3. Loss Function

4. Experiment

4.1. Dataset and Metrics

4.2. Data Augmentation

4.3. Ablation Study

4.4. Performance in Amphibious Scenarios

ETH3D Benchmark

4.5. Underwater Scenarios

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI