Abstract: Vision-Language models like CLIP have been widely adopted for various tasks due to their impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric features ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results