HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
arxiv_cs_lg·Apr 1, 2026, 01:33 PM·8

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

Summary

HCLSM introduces a novel world model architecture designed for object-centric world modeling, addressing limitations of flat latent representations. It operates on three core principles: object-centric decomposition using slot attention with spatial broadcast decoding, hierarchical temporal dynamics combining selective state space models, sparse transformers, and compressed transformers, and causal structure learning via graph neural networks. A two-stage training protocol ensures slot specialization before dynamics prediction. Trained on the PushT robotic manipulation benchmark, the 68M-parameter model achieves high next-state prediction accuracy. A custom Triton kernel significantly accelerates the SSM scan by 38x, demonstrating practical optimization. This work represents a significant step towards more sophisticated and efficient world models for embodied AI.

Technical Impact

HCLSM offers a substantial technical advancement for development stacks, particularly in the fields of embodied AI, robotics, and simulation. - **Advanced Architecture**: It integrates state-of-the-art components like slot attention, hierarchical SSMs/transformers, and GNNs, providing a blueprint for building more capable world models that can disentangle objects, understand causal relationships, and model dynamics across multiple timescales. This moves beyond simpler, monolithic latent representations. - **Performance Optimization**: The use of a custom Triton kernel for SSMs demonstrates how low-level optimization can dramatically improve the efficiency of complex models. This is crucial for deploying world models in real-time or resource-constrained environments, influencing how future high-performance AI components might be engineered. - **Training Methodology**: The two-stage training protocol (spatial reconstruction then dynamics prediction) offers a more robust and potentially data-efficient way to train complex world models, ensuring foundational representations are learned before dynamic interactions. - **Impact on Development Stacks**: Implemented in Python with PyTorch, HCLSM is readily integrable into existing ML pipelines. However, the custom Triton kernel implies a dependency on specific hardware acceleration (e.g., NVIDIA GPUs), which needs to be considered for deployment. This research provides a powerful new paradigm for designing world models, influencing how developers approach building intelligent agents that interact with and predict complex physical environments, potentially leading to more robust and generalizable robotic control and planning systems.

HCLSMSlot AttentionSpatial Broadcast DecodingSelective State Space ModelsSparse TransformersCompressed TransformersGraph Neural NetworksTritonPyTorchPushTOpen X-Embodiment

EX ViSiON · Powered by AI