AI Summary • Published on Dec 3, 2025
Accurately perceiving transparent objects presents a significant challenge in computer vision because conventional depth sensors, such as time-of-flight cameras, struggle to capture their depth due to light refraction and reflection. This results in incomplete or erroneous depth information. While supervised learning methods have been developed to complete these depth maps, they demand a substantial amount of painstakingly labeled data, which is expensive and impractical to acquire for transparent objects. Existing self-supervised approaches for general depth completion often simulate global depth deficits, which do not accurately reflect the localized and specific patterns of missing depth associated with transparent objects.
The authors propose a novel self-supervised learning method for transparent object depth completion that simulates depth deficits within non-transparent object regions. The process begins by using the Segment Anything Model (SAM) to generate semantic segmentation masks for both transparent and non-transparent objects in an RGB image. These masks are then applied to the raw depth map to create masked depth inputs. To more realistically emulate the depth loss of transparent objects, which often retains some edge information, the non-transparent object masks undergo morphological erosion (shrinking) before masking. During self-supervised training, transparent object regions are completely obscured in both RGB and depth images to prevent their influence. The masked RGB and depth inputs are fed into the TDCNet model, while the original raw depth map (excluding transparent object regions) serves as the ground truth. A custom loss function (L1) combining RMSE and cosine similarity of normal vectors is used, calculated only on non-transparent regions. For scenarios with limited labeled data, the model is pre-trained using this self-supervised approach, and then fine-tuned with a standard supervised loss (L2 and L_Supervised) that prioritizes depth recovery in transparent regions.
Experiments were conducted using the TransCG dataset, a large-scale real-world dataset containing both transparent and non-transparent objects, and evaluated using metrics such as RMSE, REL, MAE, and Threshold-sigma. In a fully self-supervised setting, the proposed method achieved approximately 70% of the performance of existing supervised methods and outperformed some non-end-to-end approaches and a global random masking MAE baseline in reducing depth errors. For supervised fine-tuning with limited data (5% and 10% of the training set), pre-training with the proposed self-supervised method significantly boosted model performance compared to no pre-training or MAE-based pre-training. This performance gain was more pronounced with smaller amounts of supervised data. An ablation study also confirmed the importance of the morphological erosion step in the masking strategy for better simulating realistic depth loss patterns.
This work presents a significant step towards addressing the challenge of transparent object depth completion by substantially reducing the reliance on costly, manually labeled datasets. The proposed self-supervised method offers a viable alternative to purely supervised approaches, demonstrating comparable performance in a fully self-supervised context and providing substantial benefits when combined with limited supervised fine-tuning. The novel masking strategy, which simulates localized depth deficits, is key to its effectiveness. However, the method is currently limited to datasets containing both transparent and non-transparent objects, and the simulation of transparent object depth missing is not yet perfect, as actual transparent object depth loss patterns can differ from the designed masking strategy. Future work could focus on refining the masking to better reflect diverse transparent object depth anomalies and mitigate visual disparities between object types.