How to design high quality AI agents for instant productivity, reduced support & training costs

In this issue I give examples of how agents are hugely reducing friction when using new tools, with a subsequent massive reduction in support and training costs.

Feb 12, 2025

Welcome new subscribers to Making AI Agents (MAIA), where you can trust me to avoid the hype and share my direct experiences building and researching AI agents.

Also, each issue is designed to have lasting value, so if you missed out on previous issues, feel free to catch up here.

Grab a coffee, settle down, and sit back. This is going to be a good one! Feedback welcome, link at end 👇

Recap: agents vs bots vs scripts: know the difference

Tired: bots 🤖 and scripts 📝
Wired: agents 🕵️‍♀️🔎 (ok not that kind of agent)

What's the real difference between bots, scripts, and agents? Fundamentally:

AI agents are better than bots and scripts because of their ability to take plain Engilsh instruction to autonomously create their own plans, act on them with tools, reflect and revise.

Just for fun though, tools that run through a fixed set of steps are also called agents. (Actually as far as I can tell, expect every marketing department to bend over backwards to present some kind of “agentic” capability, no matter what it actually does. 🤦‍♂️)

Agents can be integrated into your existing products in many ways.

I’m seeing agents being deployed at many different levels, a lot of them being an additional method of accessing existing tools.

In-app “micro agents”: everywhere inside tools just when you have a question.
“Product” agents: automating substantial portions of a specific product or app
“Product bundle” agents: an agent that helps you bring together different apps in a product bundle.
“Cross-product” agents: the most common currently: wire together inputs and outputs for different services to create your own flow.

In-app “micro agents” everywhere inside your apps to reduce customer support

This is a really powerful retrofit to ANY product you own right now. It goes like this:

The existing UI offers a chat box users can give instructions to
The microagent carries out the task for you.

I think this is quite a shift. For example: an “Ask AI” button to help when someone gets stuck in The Unicorn Platform below. I wanted to make sure links opened up in a new window and it made the change in the appropriate section for me:

“Micro agents” I call them are embedded in every UI: Here’s a sneak preview of my agent maker directory using The Unicorn Platform: notice the “Ask AI” button: I can ask it how to do anything and it will do it directly. To have this fallback when the users lose their way is the future of user interface design.

Product agents: bootstrap instant productivity by letting the user just say what they want to do.

Another pattern emerging in popularity is a chatbot that drives the UI for you from the very beginning of the journey.

Your users don’t need to know anything about how to use the tool, they simply say what they want in plain English.
This means instant productivity and subsequent reduced training costs.

Example 1: Replit Agent

It starts with a prompt: Replit.com has an agent to help create fully-functioning web and native mobile prototypes and MVPs that it can deploy for you too. You can do an amazing amount with no technical knowledge — to a point.

Example 2: Smyth OS agent creator

Smyth OS is a recent discovery for making agents visually. Even better, Smyth OS has an agent to help you create an agent 🤯

Start of flow screen shot:

Smyth OS is a new entrant in the visual agent builder. It has an agent itself! This messes with my mind using agents to create agents 🤯

Smyth OS next screen: all the boxes and lines shown here for the agent were created from my prompt in front of my very eyes. I could get more done if I liked through the chatbot.

Here’s Smyth OS’s chatbot agent automatically creating the pieces of the agent. Pretty wizzy. Expect this to be the norm, everywhere, in every app you use.

Example 3: Microsoft Dynamics 365 built-in agents

Microsoft added an agent layer on top of very specific product areas of Dynamics 365: sales orders, supplier communications, financial reconciliaton, accounting reconciliation, and time & expenses.

These are:

Super niche agents
Incredibly boring for people who don’t do these tasks daily 🥱
Incredibly valuable for those who DO these tasks daily 🥳💰

In October 2024, Microsoft launched a suite of agents directly integrated into Dynamics 365. These ultra-niche agents are one huge area where productivity gains will matter.

Cross-product agents

This is the big one 🧨: cross-product are those that automate business workflows across multiple products from different vendors. Smyth OS is one I showed already that is a promising agent maker tool. Others include n8n.io, Zapier Agents, and Make.com, and there are many others. I’m building a directory of them all: stay tuned!

Future: product bundle agents will increase bundle value

This is speculation 🔮: I expect a standard part of any bundle of products to offer tightly integration agent workflows spanning across the products for customised yet highly valuable workflows, which could become a big selling point for buying the bundle over individual solutions.

DABstep: helping make higher quality agents by finding the right language model

So you want to make an agent. How do you know if it’s any good? The promise of agents is to:

Take plain English instructions from people, to
Do useful things with information they provide, by
Planning, acting on the plan and
Revising it until it's good enough.

Agents do this by using:

A language model (such as GPT-4, Claude 3.5, Llama 3.3), combined with
Tools such as file access, web search, sending emails, voice calls with customers, etc. These are called variously "integrations" or "agent tools".

But language models are hard to trust: they:

Are very expensive
Are evolving rapidly
Can be quite tricky to get consistent quality results from
Can take hours and hours of experimentation to get a “feel” for.

Benchmarks with leaderboards can help accelerate confidence in a language model for your agent:

They offer a broad baseline of real world use cases with reasonable ways to measure the results
They provide clear measures of how one language model performs against the others
They can be extended easily to support your own usage data.

The previous issue of Making AI Agents covered an evaluation benchmark called The Agent Company with 175 agent use cases over a range of every day organisational roles.

In this issue I dig into another agent benchmark called DABStep by HuggingFace & data specialist partner that is compelling in a number of ways.

DABStep is a benchmark tool that tests a set of 450 real world financial and commercial-oriented queries requiring the agent to create at least two steps to answer the question. This can be a very useful starting point for designing your own workflows as you have information about which language models perform better overall

DABstep is appealing because:

It has 450 realistic use cases from real world commercial and financial customer queries
It's easy run evaluations (for a programmer)
Extra bonus: to avoid bias, the leaderboard position determined by how the language model performs against unknown questions too

I'm an AI strategist: how can I use DABstep?

DABStep can give you a signal for which language models work best with financial and commercial plain English use cases, and natural language data queries in general. For strategy and planning the simplest way is to bookmark their leaderboard and when looking to build an agent, check out which ones are ranked the best. That'll give you a shortlist of which models to start experimenting with in your agent design.

Also:

Every use case is a plain English question (typically shorter than the long paragraph briefings of The Agent Company) designed to produce a "factoid"
Agent planning required: the steps needed to do the work are left to the agent, and the nature of the questions require the agent to make plans of at least two steps. ("multi-hop" is the term used for this)
Heterogeneous data types: the data is a mix of an operations manual, a very complex set of business rules, and some tabular data that the agents have access to, to figure out how to answer each of the 450 questions

I'm an AI Engineer: gimme the details.

It uses the HuggingFace Smolagents AI agent developer toolkit to create the agents and run the benchmarks
SmolAgents is a bit of a brain-bender:
- It's written in python (so far so good)
- uses LLM prompts to create plain English plans plans (yep, with you)
- … that generates MORE python (oh wait what?)
- … which is then executed (yes, no human eval, you have to trust the code is ok)
- The result is then evaluated: It could outright fail to run, or the result might not meet the plain English evaluation criteria
- Rinse and repeat: feed all that back into the generation loop and try again (with a hard limit on numbers of tries)
Using Smolagents and its novel on-the-fly code generation comes with risks. I'd DEFINITELY not recommend using this on external-facing products, although Python can be sandboxed fairly well these days (see Pyodide for instance)
The DABstep repo contains a Colab Notebook you can use to explore and the code. However it relies on HuggingFace's inference services which were down when I tried them (less 🤗 more like sad ☹️), so I created a pure python version of DABStep that you can use here that only relies on OpenAI. You need HuggingFace API keys (free) and OpenAI API keys (preload with a few bucks).
In DAPstep-py I also tried using various other open weight models through ollama but had to settle at least for now using 4o-mini. I'd have prefered to use Google Gemini but SmolAgents didn't yet have a Gemini connector (though it's really trivial to add).

DABStep uses Smolagents to create a multi-step plan to answer the plain-English text query.

Have your say!

Want this to be the best newsletter ever? I’d love to get your feedback.

Making AI Agents with Julian Harris

Discussion about this post