Let’s dig into a client of mine, Bolt Health. It is an employee wellness startup that “gets to know” it’s users through AI called large language models, or LLMs (think chatGPT).
My primary role was to help the team improve the AI responses. It gave generic, repetitive suggestions to the user.
Discovery
The first thing we did was to define the desired end state.
- The team wanted an AI system that could provide the user with useful suggestions.
- The underlying reason was to help the user address a problem, improve a habit or accomplish a goal
In this discovery, we identified two key areas to address.
- A single request to the AI system to handle this wasn’t going to work
- Replace the chat interface with a more guided UI
Luckily, solving the first area naturally addressed the second. Instead of a chat interface, the Bolt team used dedicated cards for each category.
(You may be thinking this isn’t “data” work, but stay with me 🙂
Decomposing the Request
The Bolt team wanted user suggestions to fit into four categories; social, physical, work, emotion; so we approached each category separately.
This meant writing more focused prompts, narrow set of user context included, and setting up a clear evaluation framework to measure response quality. Note: Optimizing user context process was out of scope
The evaluation framework was used in two ways
- Ongoing evaluation of performance
- Ability to compare previous evals vs. a new models or inputs (e.g a new “better” LLM is available)
Framework Example: Physical
We decided on three metrics outcomes to measure.
Wanted Response | Measurement | Average Last 7 Days | Coverage % |
Related to physical wellbeing | Is related to physical wellbeing | 95% (100% is goal) | 100% |
Not repetitive from previous | Was activity was suggested in the last 3 days? Yes / No (0, 1) | 20% (0% is goal) | 100% |
“Vibe Check” | team answered No or Yes (0, 1) | 50% (100% is goal) | 20% |
We used a few natural language processing techniques to automate the first two measurements. The “vibe check” was manually reviewed (which is a good thing).
The Outcome
I was able to apply my expertise to guide the Bolt team while developing the evaluation framework (using AI and standard analysis). Now the team is able to
- Understand how their product is working
- Confidently tweak model inputs or swap out models
- Prioritize which areas of their product need attention
A good side effect is the team has a lot of new ideas for improving their product going forward.