How to Build AI Products That Don’t Flop

So Devin, the AI programmer is a lie. And the AI pin is the worst product Marques has ever reviewed.

Image with a heading that says, "don't end up like this" and screenshots of two video screenshots. One of them is about Devon being debunked, and the AI pin as being the worst product ever reviewed.

AI product demos are faking it and flopping left and right. So how can you build an AI product that doesn't flop?

We've had great success lately building AI products that get actual users and make actual revenue.

Let me show you what worked (and didn't work) for us that could apply to building new products or even just new features within an existing company.

What not to do

An image with a line that reads thin GPT rapper, and it has a red X-mark by it. The second line says real tech and has a green checkmark next to it.

The first one, for the love of goodness, is don't just build a thin GPT wrapper. Build something real.

First, let's not forget — generative AI is, at best, 95% accurate. At worst, it's completely wrong and hallucinating.

But even worse than that, it's great at tricking you into thinking it's right when it's not.

Image with a caption that says "Gen AI is, at best, 95% accurate" with a note on a ChatGPT code snippet that says "doesn't exist". This image shows how AI can confidently generate nonsensical code.

The anti-pattern to avoid is making an LLM the brains of your product.

If it's the core thing that's actually providing the value, and you just wrap some UI on top, you're in great danger of building something bad that's error-prone or too opaque and hard to customize.

Image with a header that says don't be a thin LLM rapper. Beneath that is a diagram on the left is a red X above the words your product then there is a vertical line just to the right of that and on the right side of the vertical line is a thin box with the words, your thin wrapper in it and underneath that a larger box that says OpenAI.

Or if you're lucky enough to find something that works, anybody can easily compete with you, and you have no differentiation.

And OpenAI can release what you have as a feature anytime, if not OpenAI, then some other startup entirely.

This graphic has a title of LLM's should not be the brain of your product. Then there are three items, each preceded by a red X-mark. The first is prone to errors. The second is opaque and hard to customize. The third is easy to replicate.

Hint: Devin and the software in the AI pin are thin LLM wrappers. Don't be like them.

How to build AI products right

Instead, AI needs to be more like the icing on the cake — you built some real technology, and an AI becomes some essential glue. And if it fails, it doesn't make or break the product, but when it exceeds, it really adds value to your users in a unique and powerful way.

The title of this image is AI should be the icing on the cake. On the left is a green checkmark above the words your product, the vertical line to the right of your product and then two rectangles one is very small and has the words a sprinkle of AI in it, and beneath that one is another larger rectangle that says real tech.

A real world example: building a code automation product

Let me stop talking in vague and abstract terms and let's give some real-world examples from our own learnings. For our use case, I want to dress code automation — as in, how can we use AI to make coding faster and easier?

By removing tedious and redundant work, developers can move faster and enjoy what they're doing. And if we do this well enough, we can set our goal to help people who aren't developers at all have some degree of coding superpowers.

The title of this image is wrong approach: Devin. Then there is a screenshot from a video from Devin, and underneath the screenshot are three lines, each beginning with a red X-mark. at the beginning the first says, solve all at once, the second says lie, the third says Ray's Mucho, $$$$.

Let's start with the wrong way to solve this problem. I would not recommend doing what Devin did, where they're trying to act like they've solved everything at once — the AI can do everything programmers can do and be paid for it.

Giving very misleading information and demos that are full of lies, and using that to raise a ridiculous amount of VC money.

If we ignore the problems of lies and deceit, they're still making one fundamental problem — they're trying to boil the ocean. They're trying to make an everything AI. Everything products are pretty much never a good idea.

Image that says, don't boil the ocean with a line that goes from specialized to generalized, and there's an arrow closer closer to the specialized end of the spectrum.

The better approach is to start specialized. Really nail some group of people's problem in a consistent and well-executed way.

And then incrementally expand your solution to be more generalized, to appeal to more people.

Image that has a header that says start simple. Then there is a line representing the spectrum from a junior dev to a senior dev. The arrow is pointed at the end near this junior dev.

And for our case, when talking about automating development, we want to start simple.

Let's not start with the most advanced problems in computer science. Let's instead start with the tedious work that you normally will have a junior developer do.

So let's make a list of what junior devs tend to do — writing tests, bug fixes, updating documentation, HTML and CSS, and turning designs into code.

This image has a header that reads junior Dev jobs. Then it lists a number of tasks which are test writing book, fixing HTML/CSS convert designs to code manual QA, and doc updates.

This can give us a sense of what types of problems we might want to solve. But then the question is, where are you strong?

If you're an existing business, what does your product do? Or what customers do you serve?

In our case with Builder.io, we already make a product where you can drag and drop with your React or other framework components, so people like marketers, designers, and product managers can create pages and other things within your existing apps.

And in our case, one of the most common things we noticed people doing is always reproducing Figma designs in Builder.io by hand.

Wouldn't it be so much nicer if they could just hit a button in Figma and pull the design into Builder.io where they can publish the update to the existing app via our API, or grab code they can copy and paste into the codebase?

But then the question becomes where to start. Because LLMs are not good at everything. In fact, they're bad at a lot of things. If not most things. And we don't want to build a product that only works sometimes, we need our product to work pretty much every time.

Here's a short list of things that I have found LLMs to be pretty reliable at — consistently being at least pretty good at these types of tasks.

The header of this image is where LLM's are generally reliable beneath. That is a list of three items, each with a checkmark before them. The first is summarizing text and code. The second is tweaking text and code, and the third is in line, suggesting text and code.

You might say, oh, well you could fine-tune an LLM to fill in some gaps. In our experience, don't count on that. Fine-tuning doesn't help very much, and you definitely won't teach an LLM to be able to do or be good at something new it's not already pretty good at.

So we went about trying to find out how could we use an LLM to take a design and output code. And our results were utter garbage. It was very, very, very bad at this.

He reads LLM didn't help then there is a red X. Beneath the red X is a flow from design to open AI to output code and at the bottom is the phrase "produced garbage".

We tried every technique under the sun, from fine-tuning to prompt engineering to whatever else you'd imagine, and nothing worked even a little bit.

So, what we did instead is we trained our own models.

We started by trying to solve the problem with no AI at all, by finding out how far pure code could get us, and when we found specific problems that were very hard to code by hand, we started training specialized models geared exactly at that problem.

The header reads training our own models did with a green checkmark beneath it. Beneath the green checkmark is a diagram that shows the flow of design to code, and from code to model and from model to code, and then an output code.

For more depth on that topic, I have two detailed posts on this already: One about our overall approach to AI, and how we learned not to use LLMs for everything and train our own models, and another one on how exactly to train your own models.

In many cases, it's not nearly as hard as it might seem. Now in our case, we could turn designs into output code very reliably, very quickly, very cheaply, and we could customize anything based on any feedback we got.

We only had one remaining challenge that was hard to train our own models for, which was taking our default output code and cleaning it up to be better named, better structured, and generally refactored.

But if you remember, tweaking and modifying code is something LLMs are actually pretty good at. And so this became our final step — taking the output code we generated with our own code and models, and then cleaning it up.

The header reads refined with an LLM. Beneath that is a diagram that starts with output code then goes to anthropic, then goes to final code. At the bottom of the diagram is a green checkmark in front of the phrase, final code cleanup.

When at this step, we tried tons of LLMs, and for our use case, Anthropic was the best by a pretty good margin.

And this landed us with a pretty robust toolchain — we had coded our own models and used an LLM for the icing on top, and the whole system worked quite well.

But then, funny enough, soon after multimodal LLMs came out, a new wave of thin wrappers was released. People rapping GPT 4 vision to upload a screenshot and get cut out the other side.

And these at first glance actually looked pretty good. Sometimes the output went haywire and sometimes it was not bad. But they were all terrible in one key way.

He reads revenge of the thin wrappers with a line underneath it. Then there is the phrase screenshot to code prompt to code. Then there is a screenshot from GitHub of a read me of screenshot to code.

And this is why you have to remember to always get customer feedback. We put our AI toolchain in people's hands, and we learned one critical thing that people need that no LLM wrapper is going to easily be able to do. People don't want generic output code.

Image of the phrase, get customer feedback with lines, extending from it like radiating light.

They don't want every design or screenshot to generate new code every time. That's kind of ridiculous.

Even if the code were high quality, if every time you took a design and made new code and never reused existing components or existing code, your codebase would get so bloated and unmanageable so fast.

The header reads people don't want this then there is a question where are my components? There's a lot of code, and it's hard to understand with arrows that seem to be pointing out the components that are hard to decipher among all the code.

And every one of these tools operated that way, including ours.

Tools like these need to understand your existing components, your existing design systems, the libraries you use, and generate code like what you would actually write.

Importing those components and using them just like you. Anything else is really just a cool demo and not that useful.

Header reads "they want this". Then there is a phrase that says reuse. What I have with two arrows. The first arrow is pointing to an import of a hero component. The second arrow is pointing to that component, the hero component, being used in the code.

So the question comes up, how do you teach an LLM about your design system? And more importantly, we don't want an LLM to just guess at our design system. We need true determinism here.

We need to make sure when a design is designed one way, it always turns into components in your code in a very specific way. We don't want random, unexpected outputs throughout the system. So, how do you make an LLM deterministic?

A thinking emoji is featured in the center of this image, and at the top is a question that reads how do you teach an LLM about your design system the next question at the bottom reads how do you make an LLM deterministic?

And the answer is...you don't.

And thank goodness we were not a thin LLM wrapper. Remember, you don't want the LLM to be the brains of your product. Because when you need to add some new feature that LLMs aren't good at, you're left without options.

But in our case, our models can understand these. We built the tooling to scan and find the components in your code, map them to components you have in Figma, and give you deterministic mapping functions that we generate and you can edit.

So we added new models to the chain and made a deterministic system to turn designs using design systems into code exactly as you would have written.

Well, at least close enough that with a few edits, you're pretty much done.

And while the thin wrappers can continue to bang their heads against trying to figure out how to offer a similar feature, we can keep taking new feedback and improving our product at a much faster rate than they can.

Conclusion

"we always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next 10." —Bill Gates

Bill Gates has a very good quote about this — that we always overestimate the change that will occur in the next two years, and underestimate the change that will occur in the next 10.

AI is phenomenal technology, but LLMs are not going to solve all of your problems tomorrow. What you need is an incremental approach.

The header reads "right approach? Incremental" then there are two boxes one is larger and has the word code written in it beside that is a smaller box that has the letters AI in it then there is an arrow below the two boxes with the word time next to it, the arrow is pointing to another smaller and larger box. Where code is now in the small box, and AI is in the larger box.

Today you need mostly code and then add AI for specialized problems. Over time, you can continue to add more AI as needed, and maybe in 10 years, your product will be mostly AI.

But you'll find success doing it. You'll get real customers who pay you real money, and you'll be sure you're solving real problems, while other people try to boil the ocean and ultimately fail.

And hey, don't take my word for any of this. Unlike Devin, we ship our products for everyone to try before we make claims that we've changed the world.

You can try out our product and let me know if you think it's a flop, like the AI pin, or actually speeds up your workflow as expected. Tweet at me your feedback anytime, I try to reply to everyone.