YouTube video summary

Stanford Seminar - Living Scenes: Creating and updating 3D representations of evolving indoor scenes

Artificial intelligence

16 Nov 202418 min summaryFrom Stanford Online

Stanford Seminar - Living Scenes: Creating and updating 3D representations of evolving indoor scenes

Stanford Online

Save to your library

Chat with this summary

Introduction and Motivation

The speaker is from the Civil and Environmental Engineering department, but their team consists mainly of roboticists and computer scientists working to solve problems in the building industry using computer vision 22s.
The team's work focuses on understanding what exists in a space, how it was constructed, and how it changes over time, with the goal of creating sustainable, inclusive, and adaptive built environments that prioritize human needs 46s.
The team is also exploring the intersection of physical and digital spaces, including immersive technologies like VR and AR, and designing physical spaces that allow interaction between the physical and virtual worlds, which they call "gradient realities" 1m12s.
The team's lab is called "Gradient Spaces" and is working on creating and updating digital replicas of evolving indoor scenes, which they call "living scenes" 1m31s.

Creating and Updating 3D Representations of Evolving Indoor Scenes

The concept of living scenes is based on the idea that buildings are like living organisms that evolve over time, and the team is working on developing methods to realistically make, maintain, and update their representations throughout their lifespan 1m53s.
Agents that exist in the environment can navigate, act, and interact with their surroundings, but need to be able to map and understand the environment to do so 2m4s.
The team is working on developing methods to align and merge spatial and temporal data collected by agents to create evolving representations of indoor environments 2m53s.
By acquiring data from agents performing repetitive tasks within a single scene, the team can create a cumulative scene understanding and representation that improves in geometric completeness and accuracy over time 3m20s.
This can enhance interaction with objects within the scene by having a more accurate geometry, and is particularly useful for scenes or parts of scenes that have not been seen before 3m51s.
Creating and updating 3D representations of evolving indoor scenes involves understanding how objects move within the scene and having a foundational understanding of the scene's geometry and semantics 4m7s.

Methods for Acquiring 3D Representations

Two methods for acquiring 3D representations at a single temporal point are Loop Splat and Adaptive Realtime 3D Semantic Understanding 4m41s.
Loop Splat uses the 3D cion Splat representation to reconstruct the scene with great accuracy and can perform Loop detection and Loop closure by registering 3D cion Splats together 4m47s.
Loop Splat's innovation is its ability to minimize drift from the ground truth trajectory by changing the color of the trajectory when Loop closure is detected 5m2s.
Adaptive Realtime 3D Semantic Understanding creates a single map with adaptive quality, meaning certain areas can be in high fidelity while others are in low resolution, depending on user-defined semantics or geometric complexity 5m31s.
This adaptive approach to 3D semantic understanding prioritizes sustainability by collecting only necessary data and reducing resolution in areas of less importance 5m38s.

Relocalizing and Reconstructing Objects in Evolving Environments

For evolving environments, the goal is to relocalize objects within the scene and reconstruct them on an individual object level given sparse 3D observations 6m48s.
A method for achieving this involves instance matching, registering different point clouds, and relocalizing objects within the scene 7m16s.
This method assumes instance segmentation and is robust to noise, as it has been tested with both ground truth and predicted instances 7m29s.
The method takes as input the scene at two temporal points where changes have occurred and aims to reconstruct the scene and relocalize objects 7m20s.
A 3D representation of evolving indoor scenes is created by reconstructing point clouds of objects over time to achieve a more complete and accurate geometry and accuracy of the object instance 7m52s.
The representation is able to solve three tasks - matching, relocalization, and reconstruction - by utilizing different embedding spaces, specifically equivariant and invariant feature spaces 8m6s.
The model is trained only on synthetic data from the ShapeNet database and evaluated zero-shot on a real-world Noy data set, demonstrating its ability to reconstruct unseen parts of objects by understanding geometric priors 8m19s.
The model uses a vector neuron encoder to provide an equivariant and invariant feature space, and a DSDF encoder for shape completion, making it category-agnostic but trained on seven categories from ShapeNet 8m42s.
The equivariant embedding space provides information about the pose of objects in the scene, while the invariant embedding space provides information about the actual geometry and shape of the objects 9m8s.
Qualitative results on synthetic data sets show the model's ability to track objects and reconstruct their geometry and shape over time 9m28s.
The model is also evaluated on a real-world 3RScan data set, which includes multiple scans of the same scene over time, demonstrating its ability to handle temporal changes and reconstruct the geometry and completion of objects 10m1s.
Experiments show that accumulating point clouds from different viewpoints and temporal times improves the model's ability to reconstruct the geometry and completeness of objects and minimize registration error 10m51s.
The model's performance is compared to a baseline, demonstrating its ability to outperform it in terms of geometry and completion accuracy 10m27s.

Scene Graph-Based Representations and Alignment

A second approach to evolving scene representation uses sing graphs, a work done by a student named Shiyen, which was submitted to ICV 2023 11m27s.
A map can be created in a low-level manner using representations such as occupancy maps, voxel grids, octo maps, hash grids, point clouds, and others, but these methods have limitations, including decision-making taking place on the metric space, which can limit higher-level understanding and generalization 11m46s.
Building a map in a higher level can be achieved using 3D scene graphs, which allow for both high-level and low-level information, enabling decision-making on a more abstract space, and are lightweight and privacy-preserving 12m28s.
D scene graphs are used in agents to build maps on the fly, perform robotic navigation, or task completion, and are a representation that many robotics agents are already benefiting from and using 12m51s.
The goal is to leverage the information that agents are already building in the background to create 3D maps of environments, which can be static or changed scenes with overlaps between zero to partial to full 13m22s.
The SGN Aligner is a method that takes scene graphs as input and performs node matching to identify how two graphs are aligning together, providing a great initialization for tasks such as point cloud registration or point cloud mosaicing 13m45s.
Existing methods for point cloud registration have limitations, including focusing on local feature descriptors, which can lead to issues with changes in the scene, low overlap, point cloud density, and large scenes 14m29s.
The SGN Aligner forms the alignment of scene graphs as multimodal knowledge graphs, which have three types of information within them 15m7s.
SC graphs represent semantic entities in a scene, including object instances with attributes such as category, size, and material, as well as relationships between entities like relative position or attribute similarity 15m19s.
Entity alignment methods from the multimodal Knowledge Graph domain can be redesigned to align spatial maps together, but existing works assume overlapping and accurate information, which is not the case with 3D graphs built by agents 15m51s.
SG liner is a method that takes 3D Sy graphs as input and uses unimodal embeddings, including Point Cloud encoders, structure encoders, and met encoders, to encode modality information separately and interact in a joint space 16m39s.
The goal of SG liner is to align the same object instances closer together and different object instances further apart, enabling tasks such as Point Cloud registration 17m12s.
The method was evaluated on the 3R scan data set and its extension, 3D SSG, and achieved robust performance in matching nodes together, even with predicted sing graphs and low overlap conditions 17m42s.
SG liner also performed well in aligning 3D Sy graphs with temporal changes and varying overlap conditions, outperforming usual 3D Point Cloud registration methods 18m11s.
The method can match nodes together and figure out whether they align together, even with a large number of nodes, and can handle scenarios with 50% or all nodes matched 18m47s.
The performance of the system is improved when at least two nodes are matched for the particular graphs, and this information is used to perform 3D point cloud registration 18m56s.
In contrast to previous approaches, the system registers object instances instead of entire scenes by calculating the registration between object instances that are matched with each other 19m34s.
This approach allows for a more robust and faster alignment of the entire scene, resulting in a 49% improvement in chamfer distance and a 40% improvement in relative translation error 20m4s.
The system also performs well with noisy point cloud predictions and can handle cases with 10 to 30% low overlap 20m17s.
The system can identify overlapping pairs of point clouds more correctly and three times faster than prior art, which is useful for robotics platforms 21m21s.
The system can also handle scenes with zero overlap, where previous methods are not able to perform robustly 21m2s.
The system's geometry-based and semantic-based alignments are designed for scenes that evolve with minimum changes in geometry, such as furniture being relocated or added/removed 21m42s.

Spatiotemporal 3D Point Cloud Registration for Large-Scale Changes

However, the system can also handle more drastic changes in the scene, such as those that occur over time, through spatiotemporal 3D point cloud registration 22m9s.
This approach involves finding pairwise correspondences from the static parts of the scene and excluding temporal changes 22m13s.
The system is tested on datasets with small-scale scenes, such as rooms, and captures standard daily human interaction activities 22m28s.
A student, Tan, has performed work on spatiotemporal 3D point cloud registration, which is an extension of the original system 22m3s.
Existing methods for 3D Point Cloud temporal registration can only handle small changes in the geometry of a scene, such as those found in self-driving car scenarios, but struggle with large changes, like those found in construction sites 22m52s.
Construction sites are a particular scenario where large, drastic changes in the world occur in a small amount of time, making them a challenging environment for existing methods 23m46s.
The "Nothing Stand Still" benchmark was created to evaluate the performance of existing special temporal Point Cloud registration methods on scenes with large changes, such as construction sites 24m7s.
The benchmark dataset was collected from different construction sites over time, using a tripod-based device to capture the scene at multiple temporal points 24m26s.
The dataset includes interior layout construction scenes, with a focus on the slabs, ceilings, walls, and empty spaces, but excludes exterior elements like foundations and excavators 25m36s.
The dataset is challenging due to the large changes in the scene, inconsistent capture over time, and inaccessible areas, making it difficult to align the Point Clouds temporally 25m5s.
The benchmark evaluates both pairwise and multi-way registration of the scene, allowing for the assessment of existing methods' performance on very large scenes 25m21s.
The dataset includes snapshots of the interior layout construction scenes, showcasing the drastic changes that occur over time, from empty spaces to the addition of walls, insulation, pipes, air ducts, and materials 25m54s.
Indoor scenes have repetitive elements, such as studs in walls, which can make registration algorithms struggle to match corresponding elements due to their similar appearance 26m19s.
The environment is also very uniform, with most elements being gray or brown, making it difficult to perform tasks in a robust manner 26m46s.
The data set shows the interior layout being constructed, with changes over time, including the addition of static furniture 27m2s.
Explorations of the meshes in virtual reality demonstrate how spaces change over time and the complexity of the scenes 27m20s.
Pairwise registration methods are typically used, taking two small point clouds as input, performing correspondent estimation, and then using RANSAC and ICP to determine the final transformation 27m36s.
Multi-way registration is also used, connecting point clouds over space and time with edges, and minimizing the weighted RMSD of the poses of the POS graph 28m1s.
Most existing methods struggle to handle multi-way registration, but a recently developed method has shown better performance 28m23s.
Before and after multi-way registration, the best-performing algorithm at the time shows improvement, but still with some failures 28m36s.
The colors in the visualization represent different temporal points, with each color indicating a different point in time 28m52s.
Better algorithms are needed to perform tasks involving large, drastic changes in the environment, and solving this problem could lead to solutions for other related tasks 29m1s.
The assumption is that if this hard problem can be solved, other related problems can also be solved, and more people should work on this using the provided data set 29m10s.

Applications in the Building Industry and Circular Economy

The goal is to create and update representations of evolving indoor scenes, and this technology has potential applications in various fields, including civil and environmental engineering 29m19s.
The building industry is a significant area of focus, with renovation being a major scope in architecture, engineering, and construction 29m46s.
The construction industry aims to increase sustainability and create a circular build environment by reusing materials from demolished buildings in new designs, extending the life of existing buildings, and utilizing existing resources without depleting new ones 29m51s.
Most buildings lack digital information, as computer-aided design software became widespread only after the 1980s, leaving billions of buildings on Earth without digital representations 30m21s.
The construction industry is extremely expensive, with 50% of construction costs increased due to rework, and poor estimations of costs due to a lack of knowledge about the building process 30m55s.
Construction workers accounted for 20% of all occupation fatalities in 2020 in the US, with around a thousand people killed in construction sites due to errors 31m28s.
% of non-hazardous construction and demolition material is either reusable or recyclable, but often ends up in landfills 31m44s.
Understanding spatial and temporal information can have a significant impact on human life and planet sustainability, motivating researchers to work on these problems 32m1s.
The model can be applied to improve the circular economy and reduce construction costs by capturing information about existing materials in buildings, allowing for better planning and harvesting of materials during demolition or new building design 32m50s.
Potential application scenarios include taking down old buildings, capturing information about existing materials, and planning ahead to harvest materials from demolished buildings for use in new designs 32m39s.
The goal of circularity is to plan ahead and know what materials will be available when creating a new building, allowing for the harvesting of materials from demolished buildings and designing with those materials in mind 33m7s.
To understand evolving indoor scenes, it's essential to consider not only new construction but also the deterioration of materials and their current condition, especially in areas with no new construction, and to build and update a map of the space 33m56s.
For these operations, both geometric and semantic information are necessary, which involves understanding where things are and what they are, enabling better planning for sustainable building construction 34m27s.

Knowledge Graphs and Scene Comparison

Knowledge graphs are used to represent relationships between entities, with nodes representing objects categorized using object categories, and relationships including semantic and geometric connections 35m5s.
These relationships can be relative, such as "in front of" or "to the left of," and are typically taken from a particular viewpoint, including object instances, attributes like size and material, and relative relationships within the space 35m39s.
Comparing point clouds or 3D scene graphs to existing CAD or BIM models is challenging due to issues with completeness, scale, and level of detail, making geometric alignment difficult and often inaccurate 36m32s.
While 3D scene graphs might be more robust for alignment, current methods are insufficient, and researchers are working on addressing these challenges 36m14s.
The method can be used to compare a CAD model to a point cloud, allowing for the comparison of a built environment to its designed state, but this is more challenging in construction settings where detailed building information models or CAD models are not always available 37m15s.
The method does not infer any physics from the scene, such as whether an object is brittle or will break if it falls, and assumes that any changes to the scene have already occurred and will not evolve further 37m54s.
The scene graphs created by the method are in JSON format and could potentially be loaded into Unity or other simulation software, but this has not been attempted 38m40s.
The method has not been used for simulations or virtual reality applications, but it could potentially be used for these purposes 39m14s.

Data Sets and Future Work

The data sets used to test the method are existing data sets that were not produced by the researchers, and are available on GitHub 39m36s.
The researchers are working on tools to automate the annotation of 3D geometry in construction sites, but this work is still in its early stages 40m6s.
The method could potentially be integrated with Internet of Things (IoT) devices, such as sensors that store information about the materials and condition of objects in the environment, but this is not currently being explored 40m25s.
The deconstruction process can be enhanced by using living scenes to understand the materials present in a building and what can be harvested, such as the size of panels, windows, and ducts, providing an initial estimate and hypothesis for potential reuse 41m11s.
Connections between materials, such as glues and nails, play a significant role in deconstruction, but this information is often not captured in the data, making it harder to have an accurate estimate of what can be reused 41m47s.
The deterioration of materials behind what is visible is also a challenge, as it is unknown what is happening behind the wall 50 or 100 years later, making assumptions necessary 42m32s.
Workers can use the information from the models to detect which parts can be reused and how to cut them, but an on-the-spot survey is still necessary for accurate information 41m32s.
The approach does not eliminate the need for on-site surveys, but it helps narrow down the best options for new designs by matching the existing building's characteristics 43m27s.
For construction progress monitoring and deconstruction, even a basic sensor would be sufficient, as there is currently a lack of data, but laser scanners are also useful, although they need to be fast and efficient to collect data on a large scale 44m29s.
Current methods for capturing 3D representations of indoor scenes, such as using a laser scanner, can take a long time to complete one rotation, making them not scalable for large areas 44m54s.
Backpack systems are faster and similarly accurate but may have issues with potential drift, and they capture a large number of points that can be hard to process 45m0s.
A method like Adaptive Reconstruction can be helpful for construction progress monitoring as it focuses on the newly installed elements and their correct position and time of installation 45m20s.
Imagery can also be used to capture the installation of elements, but it may not provide perfect point clouds, and sizes could be off; however, it can still provide valuable information on what has been installed and when 45m41s.
Laser scanners with adaptive reconstruction capabilities and the ability to perform fast iterative processing, combined with images, can provide the best solution for capturing 3D representations of indoor scenes 46m1s.
Images are necessary to understand materials, as point clouds cannot provide this information, and high-frequency information from images is needed to characterize materials 46m19s.
Material characterization can be done using visual information from images, but it may not always be accurate, and additional documentation or information on the planned installation of elements can help make this decision easier 47m15s.
The model can also work in non-confined spaces, such as outdoor areas like squares, roads, or parks, where there is no clear ending edge around the whole space 48m25s.
A model can be used to create a 3D representation of an urban area, such as a city, to determine potential locations for solar panels, for example on rooftops, rather than on the ground 48m43s.
Aerial scanning or imagery, such as that provided by Google Street Maps, which has 3D information in certain cities, can be helpful in this task 49m4s.
The Living Scenes model works by doing instance matching, and while it has not been tried outdoors, the Singra work, which operates in the semantic space, would likely generalize more easily to outdoor environments 49m20s.
The Living Scenes model is restricted to being trained on certain categories, which would need to be expanded to accommodate outdoor environments, and one potential workaround is to consider open scene understanding 49m32s.
Open scene understanding is a possible extension of the Living Scenes model, but it has not yet been explored 49m41s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from Stanford Online →

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

YouTube02 Jun 2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Artificial Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

YouTube02 Jun 2026

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

Entrepreneurship

Stanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer

YouTube25 May 2026

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Health & Medicine

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

YouTube25 May 2026

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

Artificial Intelligence

Stanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context

YouTube25 May 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content