Design Intents Disentanglement: A Multimodal Approach For Grounding Design Attributes In Objects
Language is ambiguous; many terms and expressions convey the same idea. This is especially true in design fields, where conceptual ideas are generally described by high-level, qualitative attributes, called design intents. Words such as organic, sequences like this chair is a mixture between Japanese aesthetics and Scandinavian design or more complex structures such as we made the furniture layering materials like “a bird weaving its nest” represent design intents. Furthermore, most design intents do not have unique visual representations, and are highly entangled within the design artifact, leading to complex relationships between language and images. Despite advances in vision-and-language representations, state-of-the-art (SOTA) machine learning (ML) models are unable to efficiently disentangle those relationships and consequently, incapable of modeling their joint distribution. Beyond its relevance in pushing ML research boundaries, this would significantly impact creative practice —designers, architects or engineers. Real-time design intents understanding could open new design scenarios (e.g., voice-assisted natural language input), that reduce procedures based on intent reinterpretation as imperative commands —move, circle, radius, extrude, vertical— required by digital design engines. ML provides an alternative to such current design frameworks: the means to learn visual representations of high-level descriptions, like dynamic, light or minimalist. For a design audience, this paper examines an alternative design scenario based on everyday natural language used by designers, where inputs such as a minimal and sleek looking chair are visually inferred by algorithms that have previously learned complex associations between designs and intents —vision and language, respectively— through hundreds of thousands of examples of designs. This is opposed to current design software, such as AutoCAD or Rhinoceros, which requires a lengthy and skillful sequential process of deterministic commands that manually shape the design object. We hypothesize that design objects latently register designer’s conceptual ideas —normally expressed in natural language both at the conceptual and post-design phases—, and that ML has the potential to abstract these conceptual ideas from the design object in images. We propose a multimodal sequence-to-sequence model based on the revolutionary Transformer model proposed by Vaswani et al., which takes design images and their descriptions and outputs a probability distribution over regions of the images in which design attributes are grounded. Expectedly, our model can reason and ground objective simple descriptors such as black or curved. Surprisingly, our model is able to reason through and ground more complex subjective attributes such as rippled or free, suggesting potential regions where the design object might register such rather vague descriptions. In addition, we introduce the Conceptual Design Descriptions (CODED) dataset of contemporary design work, which aims to provide a foundational resource for high-level attributes grounding. This has been assembled by collecting articles that include editorial descriptions along with associated images of the creative work. CODED is an organized dataset of over 240k images and 260k descriptive sentences of visual work provided by the original designers or by curators.
Keywords: Sdg9 Multimodal Machine Learning, Image-Language Grounding, Design Intents, Natural Language Processing