Microsoft has introduced a new model called “Visual ChatGPT” that allows users to interact with ChatGPT using both text and images. The system combines different types of Visual Foundation Models, such as Transformers, ControlNet, and Stable Diffusion, with ChatGPT to enable the sending and receiving of images during chats, as well as injecting visual prompts for editing images.
According to the paper titled “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models,” the visual transformer models and ChatGPT are experts of specific tasks with fixed inputs and outputs. However, combining them makes image generation and manipulation limitless. To bridge the gap between ChatGPT and VFMs, the paper proposes the use of a Prompt Manager with features such as informing ChatGPT about each VFM’s capabilities, converting visual information into language format, and managing the histories, priorities, and conflicts of different VFMs.
With the Prompt Manager, ChatGPT can leverage VFMs and receive their feedback in an iterative manner until the users’ requirements are met. Users can interact with ChatGPT using images and ask for complex image questions or visual editing by collaborating with different AI models in multi-steps. They can also ask for corrections and feedback on results. The GitHub repository provides more information on the new model.