Semantron 23 Summer 2023

AI and creative jobs

programme. DALL-E-2 is also capable of understanding in the second photo what an oil painting looks like and that the red panda will hold the coffee in its hands.

The way DALL-E-2 can do this is with two processes. One is to convert the caption into a representation of an image called the prior. The other is to turn this representation into an actual image; this part is called the decoder. To do this DALL-E-2 uses a different technology called CLIP. CLIP is a neural network model that returns the best caption when an image is inputted, the opposite of what they want DALL-E-2 to do. CLIP is a contrastive model, so it does not try to classify images; instead, it matches images to corresponding captions that are collected from the internet. To do this CLIP trains two encoders. One encoder turns images into image embeddings and another encoder turns text captions into text embeddings. Embedding is the process to represent data mathematically. The problem CLIP is optimizing for is making sure the similarity between embedding of an image and embedding of its caption is as high as possible. The prior takes the CLIP text embedding and creates a CLIP image embedding out of it. In the process of making this, the researchers tried two options for the prior. They tried autoregressive prior and diffusion prior. They found that the diffusion model worked better for the prior. Diffusion models take a piece of data and add noise to it until it is not recognizable. From that point, they try to reconstruct that image to its original form. By doing that they learn how to generate images. The decoder is also using a diffusion model and uses another technology made by open AI called glide. Glide is an image-generating model but unlike the pure diffusion model, it includes the embedding of the text that was given to the model to support the image creation process. DALL-E-2 can also make variations of an image, keeping the main element and style but changing the trivial details. For example, if the user wanted to get different variations of the Mona Lisa, DALL-E-2 would create images that looked like the Mona Lisa but from different angles. It does this by obtaining the images CLIP image embedding and running that through the decoder, creating a new image. As there are some random aspects in the diffusion process when the same CLIP image embedding goes through the decoder again, a different image will be returned. Another thing DALL-E-2 can do is something called inpainting. Inpainting is when an inputted image or video can have undesirable parts of it cut out and the algorithm will fill in the holes with data that makes sense given the context of the image. For example, if you were to take a picture of yourself and someone was in the picture that you wanted to remove DALL-E-2 could do that and make it look real. There are already other programmes to do this, but DALL-E-2 has a very complex knowledge of the world so it could be used to make the more realistic with no distortion to the image or video. However, DALL-E-2 is also capable of adding something to an image that was never there. If you inputted what you wanted to add to an image DALL-E-2 would add that to the image. However, the technology is not to the point where you can add lots of objects and it looks very realistic. A good example of this could be if a picture was taken but part of the image was not in focus, you could in paint the photo to make it look like the whole photo was taken in focus.


Made with FlippingBook - Online catalogs