Let’s dig into a client of mine, Bolt Health. It is an employee wellness startup that “gets to know” it’s users through AI called large language models, or LLMs (think chatGPT).

My primary role was to help the team improve the AI responses. It gave generic, repetitive suggestions to the user.

Discovery

The first thing we did was to define the desired end state.

In this discovery, we identified two key areas to address.

  1. A single request to the AI system to handle this wasn’t going to work
  2. Replace the chat interface with a more guided UI

Luckily, solving the first area naturally addressed the second. Instead of a chat interface, the Bolt team used dedicated cards for each category.

(You may be thinking this isn’t “data” work, but stay with me 🙂

Decomposing the Request

The Bolt team wanted user suggestions to fit into four categories; social, physical, work, emotion; so we approached each category separately.

This meant writing more focused prompts, narrow set of user context included, and setting up a clear evaluation framework to measure response quality. Note: Optimizing user context process was out of scope

The evaluation framework was used in two ways

  1. Ongoing evaluation of performance
  2. Ability to compare previous evals vs. a new models or inputs (e.g a new “better” LLM is available)
Framework Example: Physical

We decided on three metrics outcomes to measure.

Wanted ResponseMeasurementAverage Last 7 DaysCoverage %
Related to physical wellbeingIs related to physical wellbeing95% (100% is goal)100%
Not repetitive from previousWas activity was suggested in the last 3 days? Yes / No (0, 1)20% (0% is goal)100%
“Vibe Check”team answered No or Yes (0, 1)50% (100% is goal)20%

We used a few natural language processing techniques to automate the first two measurements. The “vibe check” was manually reviewed (which is a good thing).

The Outcome

I was able to apply my expertise to guide the Bolt team while developing the evaluation framework (using AI and standard analysis). Now the team is able to

  1. Understand how their product is working
  2. Confidently tweak model inputs or swap out models
  3. Prioritize which areas of their product need attention

A good side effect is the team has a lot of new ideas for improving their product going forward.

Leave a Reply

Your email address will not be published. Required fields are marked *