Behavior-Aware Anthropometric Scene Generation for Human-Usable 3D Layouts 04/20/2026 Behavior-Aware Anthropometric Scene Generation for Human-Usable 3D Layouts Introduction Physical environments fundamentally shape human movements and behavior. In the real world, layouts are rarely static; users naturally adjust their surroundings to fit their specific body dimensions and movements. Because the model ignores the anthropometric clearance required to actually push a chair back and stand up, the resulting layout creates an immediate conflict zone — a functional failure that a human user would have instinctively avoided by adjusting the furniture distance. From the left, the stand-up clearance (blue) blocked by furniture set too close, with the unusable overlap in magenta · the same clearance kept free by spacing the furniture apart. We leverage VLMs to infer object functions from visual cues and reason about potential human interactions based on scene type and layout criteria. For PO and HO conditions, we instantiated participant-specific anthropometric profiles from each participant's Skinned Multi-Person Linear (SMPL) model, parameterizing the constraints (e.g., passage widths, reaching envelopes, and viewing requirements) to each participant's actual body dimensions. Related Works Ergonomics literature distinguishes between structural anthropometry—static body dimensions—and functional anthropometry, which describes the dynamic range of motion and clearance required for tasks, emphasizing that true usability depends on accommodating the latter. Structural anthropometry (gray, the static body box) and functional anthropometry (green, the movement and reach envelope). The camera can be controlled with the mouse. Our work builds upon LayoutVLM's constraint optimization framework but extends it by integrating anthropometric data directly into the constraint quantification process. Rather than relying on generic distances from LLM common sense, we compute person-specific operational requirements based on individual body measurements and intended interactions — advancing scene generation from semantically plausible to human-operational. The importance of this personalization becomes evident when considering human diversity; a 5th percentile female and a 95th percentile male require substantially different operational spaces; however, current anthropometric-driven design methods apply uniform constraints that may be insufficient for larger individuals or wastefully spacious for smaller ones. By integrating anthropometric data directly into the generation process, we ensure that layouts accommodate specific individuals who will inhabit these spaces. Although Fréchet Inception Distance and Kernel Inception Distance metrics effectively validate visual quality and learning performance, they fail to capture the behavioral and operational aspects that determine the actual usability. Method our framework uses scene instructions and assets as inputs and proceeds through two main stages: (1) Semantic and Behavioral Representation for Spatial Relation Construction: Constructing behavior-aware relationalrepresentationsthat integrate objectsemantics, humanobject interaction patterns, and group-level spatial relations (2) Constraint-based Layout Generation: Inferring anthropometrically grounded constraint representations suitable for differentiable spatial optimization The [A-E] stages interpret raw 3D assets and layout criteria to produce behavior-aware relational representations that link object geometry, function, and human interaction. Each object was rendered from four orthogonal viewpoints (0°, 90°, 180°, and 270°) to capture comprehensive geometric details. The multi-view approach captures fine-grained features (e.g., casters, hinges, or door knobs) that are often absent from text descriptions. [B] Data Preprocessing · each asset is rendered from 0° / 90° / 180° / 270° with an orthographic camera; the four captures shown on the view planes (each perpendicular to its camera axis) are what the VLM reads. This stage infers each object's functional and behavioral properties: how it operates and how humans can interact with it — a cabinet may open outward or slide laterally, and a chair may swivel or remain fixed. The VLM produces a [C] Functional Description, capturing qualitative information about movable parts and usage requirements. [C] Functional Description · the inferred movable parts: an upper door that slides along X and a lower door that swings open on a hinge; keep the front clear for access. The VLM also produces a [D] Human-Object Interaction Pattern, identifying the top five human actions associated with the object based on atomic visual actions (e.g., sit, open, pull). [D] Human-Object Interaction Pattern · the top actions the object affords — open / close (door), pull / push (drawer), touch; two-bone inverse kinematics lands each hand on the movable part. Meaningful spatial reasoning emerges when objects are considered as part of behavioral configurations; for example, chairs around a desk forming a workspace or sofas arranged around a coffee table creating a lounge area. The [E] Semantic Grouping stage identifies both intra-group spatial relations (internal arrangement within a functional unit) and inter-group spatial relations (connectivity between distinct groups). [E] Semantic Grouping · objects that function together form a group (work / lounge / storage); solid links are intra-group relations, dashed magenta links are inter-group relations. Validation User Studies