Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
Given a point cloud and RGB-D images with their poses, we distill the knowledge of two vision-language
models into our GNN. The nodes are supervised by the embedding of OpenSeg and the edges are supervised by the embedding of the
InstructBLIP vision encoder. At inference time, we first compute the cosine similarity between object queries encoded by CLIP
and our distilled 3D node features to infer the object classes. Then we use the edge embedding as well as the inferred object classes to
predict relationships for pairs of objects using a Qformer & LLM from InstructBLIP.
For each instance in the 3D point cloud, we select the top k frames for object and predicate
supervision. For objects, we encode the frames using OpenSeg and aggregate the computed features over the projected points. For
predicates, we identify object pairs in the frame, crop the image at multiple scales and compute the image feature with the BLIP image
encoder. The features are aggregated over all crops. Finally, both object and predicate features are fused across the multiple views.
We show the top-1 predictions on ScanNet from Open3DSG. The nodes are queried using the 3DSSG 160 class label set, while the edges are generated directly from the graph-conditioned LLM.
|
![]() |
| Object Retrieval using relationship description | 3D Scene Graph + Open-Vocabulary Attributes |
Reasoning over inter-object affordances by LLM prompting
@inproceedings{koch2024open3dsg,
title={Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships},
author={Koch, Sebastian and Vaskevicius, Narunas and Colosi, Mirco and Hermosilla, Pedro and Ropinski, Timo},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month={June},
year={2024},
}