The fashion industry is a highly lucrative sector, projected to reach an astounding value of $2.1 trillion by 2025, according to the World Bank. This domain includes various segments such as the design, manufacturing, distribution, and sale of apparel, footwear, and accessories. Constantly evolving, the industry witnesses the emergence of new styles and trends, necessitating that fashion companies remain agile to stay relevant and achieve market success.
Generative artificial intelligence (AI) refers to algorithms that create new content—be it images, text, audio, or video—based on learned data patterns. In the fashion realm, this technology can revolutionize apparel design by enhancing personalization and reducing costs. AI-powered design tools can craft unique clothing designs based on customer input via text prompts. Moreover, AI can tailor designs to individual preferences; for instance, a customer might choose from various colors and patterns, leading to the generation of a one-of-a-kind design. Although the adoption of AI in fashion faces several technical, feasibility, and cost-related challenges, advanced generative AI techniques like natural language-based image semantic segmentation and diffusion can now overcome these hurdles for virtual styling.
This blog details the implementation of generative AI-assisted online fashion styling through text prompts. Machine learning (ML) engineers can refine and deploy text-to-semantic-segmentation and in-painting models based on pre-trained resources like CLIPSeg and Stable Diffusion using Amazon SageMaker. This empowers fashion designers and consumers to create virtual modeling images from textual descriptions while selecting preferred styles.
Generative AI Solutions
The CLIPSeg model has introduced a groundbreaking method for image semantic segmentation, allowing users to easily identify fashion items in images through simple text commands. By leveraging a text prompt or an image encoder, it encodes both textual and visual information into a multimodal embedding space, facilitating accurate segmentation of target objects based on the input prompt. The model benefits from extensive training, incorporating techniques like zero-shot transfer, natural language supervision, and multimodal self-supervised contrastive learning. This means you can utilize a publicly available pre-trained model by Timo Lüddecke et al without the need for further customization.
CLIPSeg functions by using a text encoder to convert the text prompt into a text embedding, while the image encoder processes the input image into an image embedding. These embeddings are then combined and fed through a fully connected layer to produce the final segmentation mask. The model is trained on a dataset of images paired with corresponding text prompts that describe the objects to be segmented. Upon training completion, it can take new text prompts and images to generate segmentation masks for the described objects.
Stable Diffusion is another innovative technique that allows fashion designers to produce highly realistic images in large quantities based solely on text descriptions, alleviating the need for lengthy and costly customization. This is particularly advantageous for designers looking to quickly create trendy styles and for manufacturers aiming to offer personalized products at lower costs.
The architecture and data flow of Stable Diffusion can be seen in the accompanying diagram. Unlike traditional GAN-based methods, Stable Diffusion generates more stable and photorealistic images that closely align with the original image distribution. The model can be conditioned for various purposes, including text-to-image generation, layout-to-image generation with bounding boxes, in-painting with masked images, and enhancing lower-resolution images for super-resolution. The practical applications of diffusion models are vast, benefiting numerous sectors such as fashion, retail, e-commerce, entertainment, social media, and marketing.
Vogue online styling provides customers with AI-driven fashion advice and recommendations through a digital platform. This service selects clothing and accessories that enhance the customer’s appearance, align with their budget, and reflect their personal preferences. With generative AI, these tasks can be performed more efficiently, leading to greater customer satisfaction and lower costs.
The solution can be deployed on an Amazon Elastic Compute Cloud (EC2) p3.2xlarge instance, equipped with a single V100 GPU featuring 16G memory. Various techniques were employed to enhance performance and reduce GPU memory usage, resulting in faster image generation. These techniques include using fp16 and enabling memory-efficient attention to decrease bandwidth in the attention block.
Initially, users upload a fashion image, followed by downloading and extracting the pre-trained CLIPSeg model. The image is then normalized and resized to meet the size requirements. Stable Diffusion V2 allows for image resolutions up to 768×768, while V1 supports up to 512×512.
from models.clipseg import CLIPDensePredT
# The original image
image = download_image(img_url).resize((768, 768))
# Download pre-trained CLIPSeg model and unzip the pkg
! wget https://owncloud.gwdg.de/index.php/s/ioHbRzFx6th32hn/download -O weights.zip
! unzip -d weights -j weights.zip
# Load CLIP model. Available models = ['RN50', 'RN101', 'RN50x4',
# 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']
model = CLIPDensePredT(version='ViT-B/16', reduce_dim=64)
model.eval()
# non-strict, because we only stored decoder weights (not CLIP weights)
model.load_state_dict(torch.load('weights/rd64-uni.pth',
map_location=torch.device('cuda')), strict=False)
# Image normalization and resizing
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
transforms.Resize((768, 768)),
])
img = transform(image).unsqueeze(0)
Utilizing the pre-trained CLIPSeg model, we can extract the target object from an image using a text prompt. This process involves inputting the text prompt into the text encoder, converting it into a text embedding. The image is then processed through the image encoder to obtain an image embedding. Both embeddings are concatenated and passed through a fully connected layer to yield the final segmentation mask, highlighting the target object described in the text prompt.
For further insights into the possibilities of generative AI in fashion, check out this additional blog post. You can also explore expert opinions on the subject. For those interested in the onboarding process at Amazon, this resource on Quora is excellent.
Leave a Reply