Multimodal models refer to large neural network models like chatGPT, Gemini, etc. that can work with both language and images. Over the past three years, several social media companies have started building them. For e.g., Meta has built the Llama models, X (previously twitter) has built the Grok models. These models tend to often be trained on data collected from the social media, albeit with user consent.