Podcast: Play in new window | Download
AI image tools can produce impressive results – but getting something useful for your brand is a different story. Leading AI content educator, Ross Symons, breaks down why most people struggle with AI-generated visuals and how a few key shifts can dramatically improve your output. If you’ve ever felt like your results don’t match what’s in your head, this episode will help close that gap.
Why Good AI Image Generation Is Harder Than It Looks
I’ve been interested in AI-generated imagery for a while now, but it wasn’t until I sat down with Ross Symons, co-founder of Zen Robot and one of the more clear-eyed thinkers I’ve encountered in the generative AI space. That conversation is where I really started to understand why so many people’s results look, well, like AI made them.
The short version: most people are using the wrong mental model for these tools. And once that clicks, a lot of other things start to make sense.
Diffusion Models Are Not Chatbots
Here’s the thing that trips up a lot of marketers and business owners getting started with AI imagery. They’ve been trained, by ChatGPT or Gemini or Claude, to speak naturally to AI tools. You explain what you want, it responds. A conversation.
Diffusion models, the type that power tools like Midjourney and Imagen 3, don’t work that way. As Ross explained it: “With a diffusion model, you’re not having a conversation with it. Every prompt that you send, that conversation is basically a closed-off session.”
That means asking Midjourney to “change the color of the shirt in the previous image” doesn’t work the way you’d expect. The model doesn’t have a conversation memory. You’re not refining. You’re starting over.
The language that works best isn’t conversational at all. It’s more like a production brief: subject, comma, style, comma, lighting, comma, mood. “Man walking on a beach, sunset, 4K, moody.” Not how you’d talk to anyone. But exactly how you get results.
The LLM Shortcut Most People Miss
So what do you do if you don’t have a background in photography, film, or design? How do you write a prompt that includes the kind of specific visual language these tools need?
Ross’s answer is elegant: use an LLM as your creative director.
Describe what you’re going for in plain English to ChatGPT or Gemini. Tell it you’re a complete beginner and need a Midjourney prompt that captures a specific style or mood. The LLM can translate your vague idea into the kind of detailed, technically accurate prompt a diffusion model can work with.
As Ross put it: “You now have a thousand interns at your disposal.” Or, more accurately, a super-intern who never sleeps, works free, and has absorbed the visual vocabulary of every photographer, cinematographer, and art director who ever published anything online.
If you know what you want but not how to ask for it, that intermediary step is your best friend.
Taste Is Still the Differentiator
Here’s the part that validated something I’d been noticing across a lot of different AI applications: the people getting the best results aren’t necessarily the most technical.
They’re the ones with taste.
As Ross said, “If you’ve got bad taste, whether you’re creating with AI or without AI, your taste is still going to be bad.” AI doesn’t give you an aesthetic sensibility. It amplifies whatever you already have.
The professionals in fashion, architecture, film, graphic design. These are the people blowing everyone else’s results out of the water. They know how to describe quality. They know what makes a good composition, what lighting creates a specific mood, what makes a garment sit right on a body. That domain knowledge translates directly into better prompts, which translates directly into better output.
If you want to improve your AI imagery, one of the best investments isn’t learning more prompt tricks. It’s sharpening your visual literacy. Study photography. Spend time on Behance or Dribbble. Watch films with intention. Develop an eye, and the prompts will follow.
The Reverse Engineering Trick
One of my favorite techniques Ross shared is using an LLM to reverse engineer visual styles you admire.
Take an image you love. Drop it into ChatGPT or Gemini. Ask it to describe the visual style in language a diffusion model can understand, without using the original artist’s name. What comes back is a detailed breakdown of composition, color palette, lighting, texture, mood. That’s the full production brief, essentially, written in language your image tool will understand.
From there, you have a base style to build from. You refine, iterate, push it in new directions. This is, as Ross pointed out, more or less how human artists have always worked. Every musician, filmmaker, and designer has people they’ve studied and borrowed from. AI just speeds up the process of translating that influence into something new.
Stop Blaming the Tools
There’s been a lot of noise lately about “AI slop,” the low-quality generated imagery flooding social feeds. Ross had a good take on this: “The AI slop vibe that’s going around at the moment is a result of people being excited about the technology, using it, and not iterating over it.”
The problem isn’t the tools. It’s that people are sharing their first draft.
If you’ve ever tried to write a blog post and published the version you typed in five minutes, you know what happens. The final version looks nothing like that. AI imagery is no different. The professionals you’re seeing produce stunning work? They’re going through many rounds of refinement. They’re adjusting prompts, tweaking parameters, uploading reference images, going back and forth between an LLM and a diffusion model until they get something they’re proud of.
That takes time, practice, and a certain willingness to sit with failure until it turns into something good. There’s no shortcut past that part.
What to Do Next
If you want to get better results from AI image tools, here’s where I’d start:
Get clear on which tool to use for what. Midjourney remains excellent for artistic, stylized imagery. Imagen 3 handles text-in-image well, and the combination of LLM and diffusion model in a single interface makes iteration faster. Know the strengths of each before you start.
Learn comma-separated prompting. Stop writing sentences. Start writing production briefs. Subject, style, lighting, mood, technical quality. Practice that structure and your results will improve immediately.
Use an LLM to write your first prompt. Give ChatGPT or Gemini a plain-English description of what you’re going for and ask it to generate a detailed image prompt. Then test it, refine it, and go from there.
Develop your visual vocabulary. The more specifically you can describe what you want to see, the better your results will be. Study design, photography, film. It’s not wasted time. It’s building the creative foundation that AI will amplify.
Commit to iteration. Budget for it in your time and (if you’re using a tool with credits) your spend. The first image is a starting point, not a finished product.
Ross and his team at Zen Robot offer training programs for marketers and agencies looking to build these skills properly, from a masterclass to an in-person bootcamp to their Gen AI Academy. If your team is serious about working AI imagery into your content pipeline, it’s worth checking out what they’re doing at zenrobot.ai.
The tools are getting better every month. But the fundamentals (knowing how to communicate what you want, having a strong visual sensibility, and being willing to iterate) aren’t going away. That’s still the work, and how you differentiate your creative output from the AI slop filling the newsfeeds.
Transcript from Ross Symon’s Episode
Rich: My next guest is a leading AI content creator, animator, and educator who bridges creativity and technology. Starting his web developer in advertising, he transitioned to origami art and animation, building a strong following on Instagram. His fascination with technology led him to embrace generative AI tools like Midjourney and ChatGPT, mastering their potential for creative innovation.
As co-founder of Zen Robot, he develops workshops that train marketers, agencies and creatives to leverage AI for tests like social media, content, brand campaigns, and storyboarding. His sessions demystify AI empowering professionals to enhance their creativity. And productivity. He equips brands and agencies with the tools to thrive in a rapidly evolving digital landscape.
Today we’re going to look at how you can get better outputs from AI generated imagery with Ross Symons. Ross, welcome to the podcast.
Ross: Hey, thanks for having me, Rich.
Rich: So how did you get started working with AI generated imagery and what led you to focus on this space?
Ross: Yeah, so I come from advertising. I’m a trained web developer, so building apps and websites. And I quickly found out that working for somebody was not really going to be my jam for the foreseeable future. Fortunately, I worked that out quite early on. I’ll say early on, I’m 44 now, and I worked it out probably around my thirties. I was just like, I need to find a way to just exit from the system. I had no idea what I was going to do.
So I quit my job at Ogilvy. I had freelance work coming in and the idea was just to go try and do something by myself as a freelancer. And at the time I had started an Instagram project where I was folding a different origami figure every day. For the longest time, I’d wanted to do a 365-day project. There were quite a lot of those. This was 2014, so Instagram was just on its up. I wasn’t really much of a social media person, but I wanted a collection of all the stuff that I’d created over time, over a year. So by the end of that year, after just posting regularly, I started making stop motion animations with origami.
And I started getting approached by brands, because it was just before the whole influencer sort of wave hit as well. So brands would see opportunities with creatives like myself to make content for them post it on their account. My account at that stage was probably about, I don’t know, a hundred thousand followers, which was quite big for Instagram at that time. And I did that for the most part of 10 years. But what I never really let go of was my attempt to try and merge technology with creativity. And I learned how to do motion graphics. And this was all by myself, I didn’t really have a team.
And in 2022, Midjourney was the first AI tool I had ever used. And when I was able to make an image using text, it blew my mind. I had really just, coming from having a computer in front of me for as long as I had, and knowing what was possible with a machine like that. And then seeing what this technology was able to allow you to do, just really just shifted something inside me. And I just really went kind of headfirst into it, trying to understand it, trying to just see if there was a way to build apps and just anything I could try and get my hands on in terms of some of this tech.
Midjourney was the best at that stage, and there were a few other tools that came out. Later that year ChatGPT arrived, that was November of 2022. Same thing, large language model. And I just tried to do as much as I could with it. And what I was doing, I assumed what most people were doing, which was pretty much like AI is the future. This is obviously going to be the future of a lot of creative work. So I just thought everyone was doing that.
And about six or seven months in, I was asking people have they tried this Midjourney. And again, just assuming that they obviously had. And it was like I was talking to them about this foreign entity that didn’t exist, they’re like, “He really speaks about this AI thing. What is it?” So I realized at that time that I was maybe not ahead, but I was just looking into things that a lot of other people weren’t.
I got invited to do a stop motion animation pitch at an agency, and at the end of the pitch the pitch didn’t go that well. They were like, “Cool, don’t call us” kind of thing. And I remember the guy, it was just one guy that was I was sitting with, I said, “Are you guys using AI in your production pipeline at all?” This was probably around halfway through 2023. And he said, “No, not really.” So I showed him a couple of things. I said, do you mind if I just show you a couple of the things? I’d made and a few tools, and his eyes at the end of that were like, “Dude, you need to come back. You need to come show the rest of this team all of this stuff.” So they’re like, “What would you charge for a little workshop?” I was like, cool. They’re going to pay for this. It’s amazing.
So I went back to the team. It was about a room of about 10 or 15 people and I just showed them what I knew. I mean, there was no structure. There was no curriculum or anything. I was just showing them some of the things that I’d made and some of the tools that are available and what you can do with them. And at the end of that I realized, I’m like, you know what? If I tell enough people about this, I should actually be able to just kick something off. That might happen.
And yeah, that was the beginning of what just ended up me doing a lot of these workshops, getting flown around the country and here in South Africa to do these seminars and workshops. Which were just really inspirational sessions and guiding creative teams on this is the landscape, this is the difference between a large language model and a diffusion model. This is how I use them together. This is how you make images and video.
My ex-boss, when I was working in advertising, him and I just kept in touch and he approached me and said, “Look, I’ve got a few clients that might be interested in this whole AI thing, and your discovery session or your little inspiration session might be worth exploring if you don’t mind me introducing you to some of them. So let’s do that.” And at the end of that, he said, “Look, there’s something here. We are ahead, you clearly have an understanding of it.”
I was growing a bit of a following on LinkedIn and he just said, let’s just formalize this. So Zen Robot was born, that was at the end of 2024. And essentially we had two pillars to the business side, which is content production using generative AI in the form of animation and video, anything you really need, banner ads, that sort of thing. As well as the training. And the training, we have a few products. We have a masterclass, a bootcamp, and the Gen AI Academy, and that keeps us pretty busy with the growing community. The Gen AI Academy is something that we’ve launched at the beginning of this year.
Rich: Sounds like you’ve got your hands full. Before we jump into some of the AI stuff, just a quick question. Origami, was that something that you had always done or was that just something where you’re like, I need something to do for 365 days, origami seems interesting?
Ross: Yeah, almost exactly that. I think I, I could only fold, uh, the crane, which is like the standard little bird that’s, that most people that have dabbled in origami could do. And I just, I saw there were, there were a lot of other models and a lot of, um, there was a lot of books and tutorials that I could find on YouTube and wherever I could find them, I was just obsessed.
But I think it was my creative nature just really clawing its way out of the corporate world. It was just, at the time, I didn’t think, okay, I’m going to start this project and finish it and by the end of the year and grow a following. I mean, there was nothing like that on my radar at all. But I saw the opportunity and I realized, okay, no one else is doing this, so let’s just do that. And I think subsequently, what I learned in terms of content production, putting content out regularly on social media, commenting, getting a community or getting involved with the community definitely aided me in what I’m doing now.
Which is, again, I did the same thing on LinkedIn, which was connected with people. I found people that were doing similar work to me, posted regularly, found a kind of niche in the teaching space. And now very fortunate enough to be teaching creatives all over the world how to use AI.
Rich: Nice. So when people hear AI generated images, they often think about tools like you mentioned, Midjourney or Dall-e. How do you define what the space is really like for creatives today?
Ross: Look, I think that Midjourney was the best from the beginning. There were other tools that were a little bit more technical. There was something called Stable Diffusion, which was a very technical interface, but it was able to produce really good results.
But Midjourney, for whatever reason, just although you were using it through Discord, I mean, this was right at the beginning, you had to use special commands like forward slash imagine or forward slash blend. So it was quite techy. It was quite, you know, not to say that you had to be technical to use it, but I think it just made sense for people that were slightly more technical and using digital interfaces for them to engage with these tools.
So Midjourney and Dall-e, when Dall-e came out it was kind of like, that is ChatGPT’s first image generation tool, it wasn’t great. And I think that a lot of people just thought, well, you know, that’s the extent to which you could use a diffusion model, which is what an image generation model is called. And you know, how it’s changed kind of going forward is just there hasn’t been this massive eruption of these tools. And I think purely because you have to train the models and the best of the models are already out there.
So for you to catch up to that like would require a lot of firstly training material, which, I don’t want to get into that sort of realm because there’s a lot of debate as to how these models have been trained and where they get the content from which is understandable. But if you’re starting out now and creating an image model, I think you will be left in the dirt. There’s no way you’re going to catch up to what’s happening now.
But for the most part, Midjourney is still one of the best. Nano Banana being Google Gemini’s image model is really, that is phenomenal. I think the big difference there is it’s almost a combination of a large language model as well as a diffusion model blended into one. So you can generate an image, but then you can edit the image using text. So if you take a snapshot of me sitting over here and you say, “place a cowboy hat on top of his head”, without having to use software like Photoshop or Canva, you can place a hat on my head within a still shot.
Which again, that’s just not the type of technology that we had two years ago, and now it’s really just dominating and very, very powerful.
Rich: So you used the phrase ‘diffusion model’, comparing it to kind of an LLM. What is a diffusion model exactly?
Ross: That’s a good question.
Rich: Perhaps the layman’s definition of a diffusion model.
Ross: Yeah. A diffusion model is basically an application or a program or an AI application that allows you to type text into a prompt box or just a little input box. And whatever you’ve written in the prompt, it’ll then generate an image as close as possible to what the prompt is.
So if you type a man walking on a beach, it then uses a process called ‘diffusion’. The diffusion process basically just takes the image and it blasts a whole bunch of noise at it. So if you’ve ever used Photoshop and you take the grain or the noise generator, if you crank that all the way up – exactly how the technology works, I will not be able to tell you – but it basically just reverses that process. So it blasts noise and it reverses back to what the prompt is.
So it starts blasts with what is called a seed. That seed that then generates this massive block of noise. And then slowly but surely, it’s got these two little kind of, think of it as a creative director and a painter, they’re having a conversation based on what the prompt is. Which is essentially the brief, and they’re like, how close are we getting to this image?
So the brief comes in, two men are walking on a beach, and these two start talking to each other. Looking at the image, are we close? Are we further away, close, further away? How do we guide it? And yeah, I’m kind of butchering what very amazing smart people have done to eventually create, what in my opinion is, magic. Like how this has been worked out is just absolutely crazy to me.
Rich: Ross, you emphasize teaching fundamentals over tools, even though we have been talking about tools. Why is that distinction important when it comes to AI generated visuals?
Ross: I think that there is a fundamental understanding that you have to have between the difference between a large language model and a diffusion model. Now, those sound like they’re two very new terms which didn’t exist sort of five years ago. But I think understanding that the difference between the two, and also that there is a different way of engaging with the two, there maybe not engaging, but there’s a different way of giving instructions to the two.
So for example, a large language model like ChatGPT or Gemini, you’ll jump on and start speaking to it with very natural language. You’ll say to it, “Hey Gem, or Hey Chat”, you speak to it in very clear English or whatever your native tongue is. So you’ll tell it to perform something, and you’ll write down all your thoughts and then hit the ‘go’ button and it’ll respond with something that looks like a script or a response.
With the diffusion model, the difference is you’re not having a conversation with it. Every prompt that you send it, basically that prompt goes in and when it develops the image, what comes out is a closed-off session. You don’t then start a conversation with it, particularly in Midjourney’s case. So when you say “a man walking on a beach” and you hit ‘run’, it develops four images, but those four images that come outside of you downloading the image or editing the image in a different application, you don’t have much control over what that image looks like going past that. So it’s a closed off conversation.
And I think that is what a lot of people struggle with. And I know this because every time I train people, I see that they say to Midjourney, for example, or whichever image tool they’re using, please change the color of the person’s shirt to x. What they don’t understand, and I think it’s normal for them to do this because they’ve been using ChatGPT as probably the first AI tool that most people use, or a large language model. So now jumping into the diffusion model space, you’re actually directing using tokens or key phrases instead of using natural language to explain what it is you want on the screen.
So when you say, “man walking on the beach, sunset, 4K, moody”, for example, that is not how you speak to a human. That’s not how you type anything via email or telegram to anybody. But that is how you get the best results when working with a diffusion model. And I think that’s just one of the fundamentals that I think people struggle with. But once they click, okay, cool. So I’m engaging slightly differently with this entity compared to a large language model.
Rich: So you touched on one of the key fundamentals right there. I’m wondering what are some of the other key fundamentals we need to understand before we can start getting some real value from these AI image tools?
Because we’ve all seen AI images that we can’t tell from reality, and we’ve all seen ones that get labeled as AI slop, that as soon as we see them, we know that they were generated by AI.
Ross: Yeah, so I think that it comes with practice. It’s really knowing the limitations of the tools. So for example, Midjourney is not amazing at putting text into an image, but Nano Banana is excellent at doing that. So that is one of many examples of just knowing which tool to use.
Also, the prompt structure I think is quite important. I’ll be honest, most of the tools as they’ve matured, you can type a very simple prompt. You can say “a portrait of a woman or a detailed photograph of a woman standing in a coffee shop” and you will not be able to tell. Well, the average person would not be able to tell whether that is a photograph or it is a generated image. And that is just what would naturally have happened with all of these tools. They would have got better to the point of trying to achieve as much realism as possible.
But if I wanted something that was illustrative, I would then add that into the prompt. Instead of saying “high quality photo”, I would say “illustration or retro style illustration”, using specific colors to develop that image. So I think that those sort of nuances and knowing what sort of commands to use with practice, and obviously also knowing the interface. Because they are very standard, basic sort of user interfaces. But knowing what certain buttons do and what parameters do, it just comes down to practice really.
But I think that most people, at the moment, you can make an image in ChatGPT, you can make one in Gemini, in Midjourney, all of them are going to come out really good. But where the trick comes in, where it becomes very difficult, is someone says, “Cool, we love this image. Can you take that character you’ve just created and now put them into a different scenario? Can you change their shirt? Can you change the sunglasses that they’re wearing? Can you put them into a car?” And moving from that to the sort of storyboarding phase, if you want to call it that, I think that is where it becomes abundantly clear that people don’t know that next sort of step and how to approach that. And that’s just one of the things that we tackle as teachers.
Rich: So would you say that some of the problems or some of the challenges that the average user has right now is the fact that they’re treating these diffusion models like LLMs? And obviously the models seem to be advancing, so maybe that won’t be a problem in the future. It’s not so much a problem with Nano Banana, but it’s about understanding the types of commands that these diffusion models, these AI image generators need. Is that going to help us get to that next level?
Ross: It’s possible. I think that how intuitive the tools become is definitely the next, well, I think it’s probably been like that from the beginning, but they will become easier to use. My only, I’m not saying it will happen, but my guess is that you’re going to get a lot of the content that looks quite the same.
And a small example is someone I saw recently posted something about how Nano Banana uses the same face if you don’t physically describe what the face needs to be like. So it’s a generic face, and you pointed out, look at what the eyes look like. And this was him creating images over a course of a week with different prompts but keeping the man’s face, for example, which is quite a generic prompt. As opposed to describing the man has a long, gray beard, he’s in his fifties, he has rings on his forehead, and has a receding hairline, or whatever the case is. Those are the details that I don’t think people know to add in order to get a different style of image.
But going forward, that is going to be left for those who are willing and curious enough to want to try and know how to create better imagery. But for the most part, I think there is going to be a generic, kind of standard AI look at maybe on close inspection you’d be able to see. But the tools will get better. I think a lot of these platforms and a lot of the creators are selling this whole dream of, it only took me two seconds to make this entire storyboard, and it took me 30 minutes to make a Hollywood movie. And it’s just hype because I don’t think that it did.
And you could do this with one prompt, maybe you could, but if you really pay a lot of attention to the quality of the content that’s come out, it’s not amazing. But the amazing quality, the amazing, amazing and very good quality content that is coming out is people that understand a little bit about film. Maybe not a little bit, but quite a bit about film, about photography, about maybe animation, technical understanding of lighting environment, set design. Maybe some fashion people know what looks good and how a garment looks on someone. Be able to describe that as the skill, and I think that is something that allows people that have got a deep understanding of a specific field of expertise.
For example, like graphic design or fashion or architecture. They know what makes good fashion architecture and graphic design, but the average person would just type in, “create me a poster” or “make me a cool dress that has blue sequins on it.” I think that’s the fundamental difference between what is going to be good quality content and bad.
Rich: And that’s a really interesting point, Ross, because I have seen this across a number of different disciplines. Where anybody with very little skills can create something that is okay using AI, but it’s the people who have years of experience or a specific perspective on whatever it is that they’re good at, can suddenly use AI to levels that no one else is touching.
And so I think that that’s one of the things you’re talking about. The people who are going to get the most out of it, at least right now, have a background in visuals, in design, and know how to say something that is very specific, rather than “man walking down a beach”.
Ross: Exactly. No, you’re totally right. And what’s interesting about most of the people that we have taught is the people that have been in industry for, let’s say 10 years plus, and it doesn’t matter what industry, let’s just use creative visual industry.
So anyone in film animation, graphic design art direction, those are the people that once they see what is possible with the tools, they realize and they’re like, ah, I do all of what I need to know in order to make a good piece of graphic design or a good piece of fashion video, whatever the case is. And you’re exactly right. It comes down to taste. And if you’ve got bad taste, whether you’re creating with AI or without AI, your taste is still going to be bad.
So I think it’s about knowing what it is and using AI to enhance that, and just speed things up a little bit.
Rich: Yeah. You’ve talked about the difference between describing what you want versus describing what something looks like. Can you walk us through that shift? Because I think that’s really valuable for people who maybe don’t have as much experience with the visuals as you are but are trying to get better images out of these tools.
Ross: Can you repeat the question?
Rich: Yeah. You kind of talked a little bit about is there a difference between describing what you want, like a man walking down the beach, versus describing what something looks like. You’ve talked about the generic faces that AI will do if you don’t give it more specifics. Is there something in the difference there?
Ross: Yes, absolutely. So I think that describing what it is you want is like the subject. So a man walking on a beach, it will return four images of a man walking on a beach. The beach is going to be very generic, so is the man, so is what he is wearing. But the more you’re able to describe to the model what it is you want to see in different areas of the image itself.
So if you say, “a perfectly symmetrical shot of a man with a blue shirt, wearing a cowboy hat and red socks”, the more detail you go into it, the closer you are going to get to the image in your head that you actually want.
And I think this is where it becomes a little bit frustrating for people because they’re like, but why is it not listening to my prompt? The reality is, it’s listening to exactly what the prompt is. The difference is you don’t know how to use a specific language in order to get it to create the image that you want.
And this is where I think a lot of people don’t really think about it until they are shown, but you don’t need to know all of those details because you can use a large language model, which has an infinite amount of knowledge about every field of expertise that has ever been out there. So if you tell it, “you’re a creative director with 25 years’ experience in photography, write me a Midjourney prompt that creates an image that looks like X new expanded in your basic English or your basic language.” It will return a prompt in exactly the right format that Midjourney will then be able to produce something. So that is how you take what it is you are asking it to produce, to what it is you see in your head, by using multiple tools.
Rich: That’s interesting. So I don’t have any experience in the design world, but I can hire somebody for a specific project, bring them in and all their expertise. And using the LLM is an intermediary step. I’m not going to have as much skill as perhaps you would because of all your years of experience, but I am tapping into the LLMs language model so that it can gimme the words to describe to Midjourney or Nano or whatever my tool would be, and I’m going to get better results that way by having that interim step of the LLM creating a richer, more detailed prompt than I would be able to generate on my own. Is that what I’m hearing from you?
Ross: Exactly. That. Someone pointed out to me the other day that just an analogy, they said, with a large language model, you now have a thousand interns at your disposal. And I love that phrase, because that’s exactly, to your point, exactly what it is.
You now have someone who can research, I mean, I say ‘intern’, I mean interns are not paid a lot of money and still take a lot longer to do the work. And the work’s not great. It’s almost like you have a super intern that is way smarter than you, way faster than you, and can just produce at a scale and nonstop for 24 hours a day.
But if you know how to ask that intern the right questions, and this comes back to just being human, right? I mean, you can have a creative director who is okay at the job if they don’t know how to explain to the intern or the sort of mid or beginner level staff member that this is what they want to see.
This person is going to go away and come back and show them something, and they’re like, yeah, but you’re not listening to what I’m saying. And the reality is, maybe you don’t know how to explain what you want. And I think that is exactly where it is.
And not to say that copywriters or someone who’s an author or reads more is going to be better at this. On the contrary, actually. Because it’s just people who are able to break down what it is that they want to see on the screen in a format that the diffusion model or large language model – in this case, Nano Banana – is able to translate into an image that best suits the prompt itself.
Rich: That definitely makes a lot of sense. Although as you’re talking, I’m not trying to put you on the spot here, but it feels like a lot of these tools like Midjourney and Nano Banana, and I’m sure the others, I could upload an image – whether I created the image, somebody else created the image – and say, “this is the style I want”, or “this is what I want the character to look like”.
Is that a shortcut to some of the experience we’re talking about, or do you still feel that even with some of these new, built-in tools, do you feel like that’s a way that we can skip over the step, or do you feel that the expertise is still a critical component in terms of getting the results you’re looking for?
Ross: I think the expertise comes down to how well you’re able to use a large language model to get to reverse engineer. I haven’t said this a lot, but I think AI’s main function or its main strength is reverse engineering.
So I know a little bit about music, a little bit about art, a little bit about film, but not enough to go and in any of those disciplines become a very good artist. And what I do know is I can take a piece of art or a piece of music, I can give it to a large language model, and I can say, “break down what makes this a good song.” “What makes this a good design?” “What makes this a good painting?” And break it down into tokens or phrases that a diffusion model would be able to replicate that style that way.
I mean, this is essentially exactly how every artist, be it a musician or even an actor, they’ve watched people that have done things before them. That is their training data. They’ve then taken that training data, maybe not in the same mechanical or computer driven way that an AI tool would do it. But taking all that data in, analyzing it, and over time learning and crafting, but essentially copying what a certain person or an artist that they admired and enjoyed their work, they try to emulate that in some way. I mean, there’s that book, Steal Like an Artist, which is, you’ve got to know how to hide your sources.
And if you look at anyone’s work, this is even before AI, you can see where people have taken certain things from other people. I’ve heard Eric Clapton say on numerous occasions, he stole this from this person, this from this person. Dave Grohl said, “I got this riff from that person.”. And I think now with AI, it’s just really speeding that process up.
So I think understanding that if you know how to reverse engineer something and you can do that by dropping it into, it’s not as simple as just dropping it in and saying, “Please reverse engineer.” What makes this a great thing? It’s iterative. It takes time because the model can easily break down what the physical… so if you take an image of Salvador Dali, you drop it in and you say, “Please describe what this image looks like and give me a style without using the artist’s name.” Okay, it’s going to say ‘surrealist painting’ something, something.
I seriously lack the vocabulary to fully describe what might be in that prompt, but it’ll break it down for you in a way that you would never be able to do. Maybe you would over time, but now you’ve got a base in a text format. Which is very important because text is what the large language model and diffusion model understands. You have this base style written out in very clear, descriptive language that you can now use as your base to start developing more images.
And then what happens is you take that style and you say, a man walking on a beach in this specific style. You drop that whole style in there. And then you make an image and you’re like, okay, is that further away or closer to what it is I had in my head? Is it very Salvador Dali or is it kind of straying a little bit? And this is where the taste and creative direction comes in, because you are still the person, you’re still the entity asking the LLM or asking your new one of your a thousand interns, what can I do to change this or make it a little bit more x, make it a little bit less surrealistic, make it slightly more bold, change the colors to this, and then it will… this is me now, hypothetically speaking to a large language model, saying, cool, do these things to the prompt so that when I put it back into the diffusion model, Midjourney or… this, also, what I’m speaking about here happens all inside Nano Banana by itself. You don’t have to leave Nano Banana or leave Gemini to work within Nano Banana, which makes it a very powerful technology.
But essentially, that is the process and that is where it becomes iterative for you to take one piece of art off the internet. You just have one image, drop it into a large language model and tell it, “Please make me 10 images that are kind of in the style of this.” There’s not enough creative direction there. It’ll create something, but then you have to tell it, okay, that’s not quite what I was looking for. Go in this direction, change this, make the colors less that. And this is where your understanding of that field is very important.
Rich: So I want to bring up a couple of points in your last comment that really struck me. One is the whole idea of reverse engineering, and I completely agree with you on this. And one of the things I really liked about Midjourney when it was in Discord was that you could see what other people were using for prompts. And whether you’re reverse engineering an image or you’re looking at other people’s prompts, it starts to give you the language that you may be lacking in how to describe an image accurately to, like you said, get the image that’s in your head out onto the screen. So that’s just a great learning opportunity right there.
And the other thing you mentioned and said at the end is iteration. This is not a one shot and you’re done sort of thing, and that the professionals may need to go back multiple times to really hone in what that character, what the scene is going to look like. Would you agree with that?
Ross: Yeah, absolutely. And it is about iteration. And I think that it’s for that reason that there’s a lot of negativity around AI at the moment. And yeah, I mean, whether it stays and how long it’s going to stay for is all up for debate. But I think that what happens, people that are willing to iterate, people that are willing to sit with the problem and be curious and try and use the tool to get… it’s like I didn’t get it on the first go. I didn’t get it on the second go. My credit’s running out, but I’m going to buy more and I’m going to work at it until I get the thing right.
When they produce something that they know, okay, cool, I was able to produce this image or video. And then they have the guts, let’s call it that, to put it out on social media. That is what you’re seeing as the final piece. Someone else sees it, they’re like, oh, they used AI to do it. They try it, they try once and there’s no iteration, there’s no understanding of what the tools do, there’s no understanding of prompt structure or the difference between a large language model and a diffusion model. It produces something and they’re like, eh, this is not great. And then that’s where it ends.
And then that becomes a lot easier for people to hate on the technology because they aren’t able to use it in the way that someone who’s willing and curious enough to sit with it for and also pay the price. Which a lot of the time it comes down to credits, not thousands of dollars, but you know, it’s money. It costs money and time. And that, I think, is what’s causing this massive divide at the moment. That’s just the nature of a new tool, right? I looks like it’s easy to use, you try to use it, you can’t. Are you going to learn how to use it, or are you just going to decide that it’s not for you? And you either hate on it or just let it be.
Rich: Ross, I want to wrap up with what is one truth that’s being floated around about AI image generation that you feel is just complete bunk.
Ross: Oh, that it’s easy.
Rich: Fair enough.
Ross: It’s everyone thinks that, oh, it’s so easy to do X. If it was so easy, everyone would be doing it. And the reality is the AI slop sort of hashtag, or just AI slop vibe that’s going around at the moment is a result of people not being excited about the technology using it, not iterating over it.
They are sending out the first image or video that gets produced. And I don’t want to knock anyone doing that because I think it’s important that we go through this phase of exploration and we go through this phase of creating. And yes, there is going to be a lot of content that comes out, but you need to just consider that if you’ve never done something like this before, chances are, you know, the quality of the content that you’re putting out is not going to be good. And you need to just be kind of humble and look at it and go, okay, well maybe you can’t see the difference and maybe you’re just like, well, I’m okay with what this looks like.
But I do believe that if you do, if you just keep at it and also try and understand that yes, it’s easy to write a prompt, but it’s not easy to generate an image that is compelling because it’s not just one prompt. It might be multiple prompts with image references and reverse engineering styles and environments. And that’s what it takes. So I think the fact that people are saying that AI is easy, I think is very incorrect.
Rich: Alright. Ross, this has been great. If people want to learn more about you or they’re interested in any of your classes or courses, where can we send them online?
Ross: Yeah, so zenrobot.ai, that is our web address that should link you to some of the programs that we have. We have a masterclass, a bootcamp, which is an in-person training seminar, as well as our Gen AI Academy.
But reaching out to me directly on LinkedIn is also a safe option. I’m Ross Symons, S-Y-M-O-N-S, and my email address, do you mind if I leave my email address? Is that okay?
Rich: If it’s okay with you, it’s okay with us.
Ross: Yeah, yeah, ross@zenrobot.ai, that’s where you can contact me.
Rich: Awesome. And we’ll have all those links in the show notes. Ross, this has been really helpful. Thank you so much for your time today.
Ross: Awesome. Thank you, Rich.
Show Notes:
Ross Symons and his team at Zen Robot are focused on helping marketers and creatives harness AI for content, campaigns, and storytelling. With a background in advertising, animation, and digital development, Ross blends creativity with emerging technology to deliver practical, hands-on training. Check out zenrobot.ai for resources and workshops, and connect with Ross on LinkedIn for insights on AI.
Rich Brooks is the President of flyte new media, a web design & digital marketing agency in Portland, Maine, and founder of the Agents of Change. He’s passionate about helping small businesses grow online and has put his nearly 30 years of experience into the book, The Lead Machine: The Small Business Guide to Digital Marketing.