Behavior-Aware Anthropometric Scene Generation for Human-Usable 3D Layouts

04/20/2026

Behavior-Aware Anthropometric Scene Generation for Human-Usable 3D Layouts
1. Introduction
  - Physical environments fundamentally shape human movements and behavior [1]. In the real world, layouts are rarely static; users naturally adjust their surroundings to fit their specific body dimensions and movements.
  - Because the model ignores the anthropometric clearance required to actually push a chair back and stand up, the resulting layout creates an immediate conflict zone — a functional failure that a human user would have instinctively avoided by adjusting the furniture distance.
    
    From the left, the stand-up clearance (blue) blocked by furniture set too close, with the unusable overlap in magenta · the same clearance kept free by spacing the furniture apart.
  - We leverage VLMs to infer object functions from visual cues and reason about potential human interactions based on scene type and layout criteria.
  - For PO and HO conditions, we instantiated participant-specific anthropometric profiles from each participant's Skinned Multi-Person Linear (SMPL) [2] model, parameterizing the constraints (e.g., passage widths, reaching envelopes, and viewing requirements) to each participant's actual body dimensions.
2. Related Works
  - Ergonomics literature distinguishes between structural anthropometry—static body dimensions—and functional anthropometry, which describes the dynamic range of motion and clearance required for tasks, emphasizing that true usability depends on accommodating the latter [3].
    
    Structural anthropometry (gray, the static body box) and functional anthropometry (green, the movement and reach envelope). The camera can be controlled with the mouse.
  - Our work builds upon LayoutVLM [4]'s constraint optimization framework but extends it by integrating anthropometric data directly into the constraint quantification process. Rather than relying on generic distances from LLM common sense, we compute person-specific operational requirements based on individual body measurements and intended interactions — advancing scene generation from semantically plausible to human-operational.
  - The importance of this personalization becomes evident when considering human diversity; a 5th percentile female and a 95th percentile male require substantially different operational spaces [5]; however, current anthropometric-driven design methods apply uniform constraints that may be insufficient for larger individuals or wastefully spacious for smaller ones. By integrating anthropometric data directly into the generation process, we ensure that layouts accommodate specific individuals who will inhabit these spaces.
  - Although Fréchet Inception Distance and Kernel Inception Distance metrics [6, 7, 8] effectively validate visual quality and learning performance, they fail to capture the behavioral and operational aspects that determine the actual usability.
3. Method
  - our framework uses scene instructions and assets as inputs and proceeds through two main stages: (1) Semantic and Behavioral Representation for Spatial Relation Construction: Constructing behavior-aware relationalrepresentationsthat integrate objectsemantics, humanobject interaction patterns, and group-level spatial relations (2) Constraint-based Layout Generation: Inferring anthropometrically grounded constraint representations suitable for differentiable spatial optimization
  - The [A-E] stages interpret raw 3D assets and layout criteria to produce behavior-aware relational representations that link object geometry, function, and human interaction.
  - Each object was rendered from four orthogonal viewpoints (0°, 90°, 180°, and 270°) to capture comprehensive geometric details. The multi-view approach captures fine-grained features (e.g., casters, hinges, or door knobs) that are often absent from text descriptions.
    
    [B] Data Preprocessing · each asset is rendered from 0° / 90° / 180° / 270° with an orthographic camera; the four captures shown on the view planes (each perpendicular to its camera axis) are what the VLM reads.
  - This stage infers each object's functional and behavioral properties: how it operates and how humans can interact with it — a cabinet may open outward or slide laterally, and a chair may swivel or remain fixed. The VLM produces a [C] Functional Description, capturing qualitative information about movable parts and usage requirements.
    
    [C] Functional Description · the inferred movable parts: an upper door that slides along X and a lower door that swings open on a hinge; keep the front clear for access.
  - The VLM also produces a [D] Human-Object Interaction Pattern, identifying the top five human actions associated with the object based on atomic visual actions (e.g., sit, open, pull) [9].
    
    [D] Human-Object Interaction Pattern · the top actions the object affords — open / close (door), pull / push (drawer), touch; two-bone inverse kinematics lands each hand on the movable part.
  - Meaningful spatial reasoning emerges when objects are considered as part of behavioral configurations; for example, chairs around a desk forming a workspace or sofas arranged around a coffee table creating a lounge area. The [E] Semantic Grouping stage identifies both intra-group spatial relations (internal arrangement within a functional unit) and inter-group spatial relations (connectivity between distinct groups).
    
    [E] Semantic Grouping · objects that function together form a group (work / lounge / storage); solid links are intra-group relations, dashed magenta links are inter-group relations.
  - The [F-G] stages infer anthropometrically grounded constraint representations from natural-language spatial relations. They convert these relations into executable, differentiable constraints required for spatial optimization. Our framework infers constraint specifications by explicitly referencing behavioral semantics and anthropometric rationale.
  - The taxonomy organizes natural-language relations into learnable constraint types (positional, orientational, and height-based) such as chair against wall, table aligned with sofa, or lamp on top of desk.
    \[ \begin{array}{llll} \hline \textbf{Constraint Type} & \textbf{Constraint Name} & \textbf{Method} & \textbf{Description} \\ \hline \text{Position-based} & L_{\text{distance}}(p_i, p_j, d_{\min}, d_{\max}) & \text{LayoutVLM} & \text{Distance between the two assets should fall within the range } [d_{\min}, d_{\max}]. \\ & & \text{Ours} & \text{Distance between two objects to } [d_{\min}, d_{\max}]\text{, where bounds are inferred from reach and clearance requirements based on anthropometric data.} \\ & L_{\text{against wall}}(p_i, w_j, b_i) & \text{LayoutVLM} & \text{Place an asset against wall } w_j. \\ & & \text{Ours} & \text{Places the object against a specific wall while considering accessibility and clearance requirements for nearby interactions.} \\ \hline \text{Orientation-based} & L_{\text{align with}}(p_i, p_j, \Theta) & \text{LayoutVLM} & \text{Align two assets at a specified angle } \Theta. \\ & & \text{Ours} & \text{Aligns the rotations of two objects; the angle parameter } \Theta \text{ reflects task-oriented alignment (e.g., parallel or perpendicular configurations for joint use).} \\ & L_{\text{point towards}}(p_i, p_j, \Theta) & \text{LayoutVLM} & \text{Orient one asset to face another with an offset angle } \Theta. \\ & & \text{Ours} & \text{Adjusts orientation so that an object's front faces the target, with } \Theta \text{ encoding preferred viewing or interaction directions (e.g., facing a desk, or seating area).} \\ \hline \text{Height-based} & L_{\text{on top of}}(p_i, p_j, h) & \text{LayoutVLM} & \text{Position one asset on top of another.} \\ & & \text{Ours} & \text{Defines a vertical stacking relationship for placing smaller objects on surfaces while keeping sufficient interaction area.} \\ \hline \end{array} \]
    
    Spatial Constraint Taxonomy for Behavior-Aware Anthropometric Scene Generation, comparing our behavior-aware, anthropometrically grounded constraints with the geometric relations [4]. · Notation: \(p_i, p_j\) object poses; \(w_j\) wall index; \(b_i\) object bounding box; \(d_{\min}, d_{\max}\) minimum/maximum center distance; \(\Theta\) relative angle; \(h\) height offset.
    
    Spatial constraint taxonomy, animated · each of Table 1's five losses pulls its movable object from a violating placement (high loss) into the satisfied one (loss → 0); the coloured arrow is the loss's negative gradient. Grouped by type — position (blue), orientation (amber), height (green).
  - Among these, distance constraint requires additional clarification, because it directly governs human accessibility and functional clearance.
4. Validation
5. User Studies
6. Conclusion
7. References
  - [1] Weizhou Luo, Zhongyuan Yu, Rufat Rzayev, Marc Satkowski, Stefan Gumhold, Matthew McGinity, and Raimund Dachselt. 2023. Pearl: Physical Environment Based Augmented Reality Lenses for In-Situ Human Movement Analysis. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), 1–15.
  - [2] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 34, 6 (2015), 248:1–248:16.
  - [3] Carlos Viviani, Pedro M. Arezes, Sara Bragança, Johan Molenbroek, Iman Dianat, and Hector I. Castellucci. 2018. Accuracy, Precision and Reliability in Anthropometric Surveys for Ergonomics Purposes in Adult Working Populations: A Literature Review. International Journal of Industrial Ergonomics 65 (2018), 1–16.
  - [4] Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. 2025. LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 29469–29478.
  - [5] Julius Panero and Martin Zelnik. 1979. Human Dimension and Interior Space: A Source Book of Design Reference Standards. Watson-Guptill, New York.
  - [6] Chenguo Lin and Yadong Mu. 2024. InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior. arXiv:2402.04717.
  - [7] Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. 2021. ATISS: Autoregressive Transformers for Indoor Scene Synthesis. In Advances in Neural Information Processing Systems (NeurIPS) 34 (2021), 12013–12026.
  - [8] Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. 2024. DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20507–20518.
  - [9] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. 2018. AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6047–6056.