Lang3DSG
Language-based contrastive pre-training for
3D scene graph prediction
Our method takes as input a class-agnostic segmented point cloud and
extracts point sets of objects and pairs of objects (a). The point sets are passed into a PointNet backbone to construct an initial feature graph
(b). Using a GCN, the features in the graph get refined (c) and node, edge and node-edge-node triplets are projected into the language
feature space (d). Using a contrastive loss, we align the 3D graph features with the CLIP embeddings of the scene description (e).
Qualitative results of 3D scene graph prediction with Lang3DSG for
three different example scenes. We visualize the top-1 object class prediction for each node and the predicates with a probability greater
than 0.5 for each edge. Ground truth labels are shown in square brackets.
@inproceedings{koch2022lang3dsg,
title={Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction},
author={Koch, Sebastian and Hermosilla, Pedro and Vaskevicius, Narunas and Colosi, Mirco and Ropinski, Timo},
booktitle={2024 International Conference on 3D Vision (3DV)},
year={2024},
}