from enum import Enum
from typing import Union
from pydantic import BaseModel, SerializeAsAny
class GroupOperator(str, Enum):
= "or"
OR = "and"
AND
class UserAgeOperators(str, Enum):
= ">"
GREATER_THAN = "<"
LESS_THAN = "="
EQUALS
class UserSex(str, Enum):
= "male"
MALE = "female"
FEMALE
class Node(BaseModel):
str
category:
operator: UserAgeOperatorsint
value:
class UserAgeNode(Node):
str = "user_age"
category: int
value: str
operator:
class UserSexNode(Node):
str = "user_sex"
category:
value: UserSexstr = "equals"
operator:
= Union[UserAgeNode, UserSexNode]
NodeTypes
class Expression(BaseModel):
group_operator: GroupOperator
nodes: SerializeAsAny[list[Union["Expression", NodeTypes]]
# to support multiple levels of nested expressions ]
I was lucky enough to be part of the first (and only?) cohort for the course from parlance labs and I was keen to get started on my use case for improving the the UX for a product feature I was working on. This is the first post of a series that will explore the use of LLMs to add ✨AI✨ add a natural language query feature and systematically evaluating and improving LLMs for this use case.
0.1 Introducing the product feature
To find users, a feature on the platform allows admins to filter users by certain attributes. Currently this is achieved with a drop down with multiple inputs that are progressively rendered, depending on previous choices. In the example above, when the user sends the request to search users that match this query, the following query is sent to the backend:
{
"group_operator": "and",
"children": [
{
"category": "user_sex",
"operator": "equals",
"value": "male"
},
{
"category": "user_age",
"operator": "greater_than",
"value": "18"
}
]
}
These expressions can be nested (infinitely!) to represent more complex queries:
{
"group_operator": "or",
"children": [
{
"group_operator": "and",
"children": [
{
"category": "user_sex",
"operator": "equals",
"value": "male"
},
{
"category": "user_age",
"operator": "greater_than",
"value": "18"
}
]
},
{
"group_operator": "and",
"children": [
{
"category": "user_sex",
"operator": "equals",
"value": "female"
},
{
"category": "user_age",
"operator": "greater_than",
"value": "21"
}
]
}
]
}
an example of a nested expression: the query above will find users who are (male AND above 18) OR (female and above 21)
The challenge with the existing UI is that it becomes difficult to find the type of filter you need, age and sex are easy enough to remember but the number of filter categories
continues to increase, especially as the product increases in complexity with the increase in number of customers with different use cases.
The goal of this project is to guide the user to choosing the correct filter, not to completely replace the existing UI. This hopefully avoids the trauma associated with the dreaded sparkles feature everyone seems to forcing onto users. The additional benefit is that it also provide us with useful training data so we can compare the generated filter with the actual filter sent to the backend after the user has had an opportunity to adjust the query with the dropdown UI, thus propelling our data flywheel to improve the model.
we want to achieve something like this in our product
1 Starting with Simple & Stupid
Before committing to this project, it’s important to gain a sense of what is possible with LLMs, so we can start by stripping down our use case to a small subset of the expression that we want to eventually support.
To get started, we’ll first just try with the two filter expressions shown above: ‘user_age’ and ‘user_sex’ and see what the chatgpt gives us:
Even with just two types of filters, this looks very promising given that there was no fancy prompt engineering or other tricks (which we will explore later). Most importantly, it shows that this use case is feasible and is worth exploring a bit further.
Filter expressions are simply json
objects so we can take advantage of the ecosystem around structured outputs for LLMs. To do this, we’ll create pydantic
models to represent these expressions.
OpenAI offers two routes to generating structured outputs: function calling and response format, they recommend to use function calling when building an application so we’ll use that to start with. To simplify the boilerplate around generating the function schemas, we’ll define the json schemas with pydantic models and use the instructor
package to generate and send these with our openai API request. Hamel has a great post breaking down how instructor
works and what it is actually doing behind the scenes.
We can then represent the above expressions with a pydantic model instead of writing raw json ourselves:
from pprint import PrettyPrinter
= PrettyPrinter(width=40)
pp
Expression(=GroupOperator.AND,
group_operator=[
nodes
UserAgeNode(="user_age", value=18, operator=UserAgeOperators.GREATER_THAN
category
)
],="json") ).model_dump(mode
{'group_operator': 'and',
'nodes': [{'category': 'user_age', 'operator': '>', 'value': 18}]}
we can generate then use this schema definition to make an API call to openai with function calling using instructor
’s patched client:
import instructor
import openai
from dotenv import load_dotenv
load_dotenv()
= instructor.from_openai(openai.OpenAI())
client
= """you are an assistant to help generate filter expressions to search for users for a query.
system_prompt Use the supplied tools to assist the user to generate a valid expression for the following query:"""
= []
messages "role": "system", "content": system_prompt})
messages.append({"role": "user", "content": "find me users over the age of 18"})
messages.append({
= client.chat.completions.create(
response ="gpt-4o-2024-08-06", messages=messages, response_model=Expression
model
)
pp.pprint(response)
Expression(group_operator=<GroupOperator.AND: 'and'>, nodes=[UserAgeNode(category='user_age', operator='greater_than', value=18)])
This gives us a valid pydantic model which we can then dump into the json format our query expects:
="json")) pp.pprint(response.model_dump(mode
{'group_operator': 'and',
'nodes': [{'category': 'user_age',
'operator': 'greater_than',
'value': 18}]}
2 Vibe checks
We can try out a few more examples and add some more filter categories to get a better sense of what llms are capable of and what kind of things they struggle with.
from datetime import date
from typing import Literal
class UserOccupationNode(Node):
"user_occupation"]
category: Literal[str
value: "equals"] = "equals"
operator: Literal[
class MartialStatuses(str, Enum):
= "single"
SINGLE = "married"
MARRIED = "not_single"
NOT_SINGLE = "divorced"
DIVORCED = "widowed"
WIDOWED
class UserMaritalStatusNode(Node):
"user_marital_status"] = "user_marital_status"
category: Literal[
value: MartialStatuses"equals"] = "equals"
operator: Literal[
class DateOperators(str, Enum):
= "after"
AFTER = "before"
BEFORE = "on"
ON
class UserJoinedDateNode(Node):
"user_joined_date"] = "user_joined_date"
category: Literal[
value: date
operator: DateOperators
= Union[
NodeTypes
UserAgeNode,
UserSexNode,
UserOccupationNode,
UserMaritalStatusNode,
UserJoinedDateNode,
]
class Expression(BaseModel):
group_operator: GroupOperatorlist[Union["Expression", NodeTypes]]] nodes: SerializeAsAny[
def generate_query(natural_language_query: str) -> Expression:
= """you are an assistant to help generate filter expressions
system_prompt to search for users for a query. Use the supplied tools to assist the user
to generate a valid expression for the following query:"""
= [
messages "role": "system", "content": system_prompt},
{"role": "user", "content": natural_language_query},
{
]
return client.chat.completions.create(
="gpt-4o-2024-08-06", messages=messages, response_model=Expression
model )
lets start with some simple single node queries just to make sure that the correct type of expression node is used:
= UserMaritalStatusNode
expected_node = generate_query("users who are married")
generated_query
= isinstance(generated_query.nodes[0], expected_node)
is_correct_node ="json"))
pp.pprint(generated_query.model_dump(modeprint("✅") if is_correct_node else print("❌")
{'group_operator': 'or',
'nodes': [{'category': 'user_marital_status',
'operator': 'equals',
'value': 'married'}]}
✅
we can bundle this code into a function so we can easily try a few prompts
def test_single_node(
str, expected_node: NodeTypes
natural_language_query: -> tuple[Expression, bool]:
) = generate_query(natural_language_query)
generated_query if len(generated_query.nodes) != 1:
# we only expect one node in the expression
return generated_query, False
return generated_query, isinstance(generated_query.nodes[0], expected_node)
def print_results(generated_query: Expression, is_correct_node: bool) -> None:
="json"))
pp.pprint(generated_query.model_dump(modeprint("✅") if is_correct_node else print("❌")
= test_single_node(
generated_query, is_correct_node "users who are married", UserMaritalStatusNode
) print_results(generated_query, is_correct_node)
{'group_operator': 'and',
'nodes': [{'category': 'user_marital_status',
'operator': 'equals',
'value': 'married'}]}
✅
= UserOccupationNode
expected_node = "users who are doctors"
query = test_single_node(
generated_query, is_correct_node
query, expected_node
) print_results(generated_query, is_correct_node)
{'group_operator': 'and',
'nodes': [{'category': 'user_occupation',
'operator': 'equals',
'value': 'doctor'}]}
✅
= UserSexNode
expected_node = "fetch all female users"
query
= test_single_node(
generated_query, is_correct_node
query, expected_node
) print_results(generated_query, is_correct_node)
{'group_operator': 'and',
'nodes': [{'category': 'user_sex',
'operator': 'equals',
'value': 'female'}]}
✅
Lets up the ante, and try some expressions where we expect multiple nodes. We can create a new test function here to reduce the repetition
from collections import Counter
def test_multi_node(
str, expected_nodes: list[NodeTypes]
natural_language_query: -> tuple[Expression, bool]:
) = generate_query(natural_language_query)
generated_query if len(generated_query.nodes) != len(expected_nodes):
# we got more or less nodes than we expected
return generated_query, False
# check if the nodes are the same and have the same count
= Counter(type(node) for node in generated_query.nodes) == Counter(
is_correct_nodes
expected_nodes
)return generated_query, is_correct_nodes
= test_multi_node(
generated_query, is_correct_nodes "joined in 2023", expected_nodes=[UserJoinedDateNode, UserJoinedDateNode]
)
print_results(generated_query, is_correct_nodes)
{'group_operator': 'and',
'nodes': [{'category': 'user_joined_date',
'operator': 'after',
'value': '2022-12-31'}]}
❌
interestingly the llm wasn’t able to correctly identify that we need two nodes to check that the joined date was after the end of 2022 AND before end of 2023
lets see if retrying helps:
= False
is_correct_nodes = 0
num_attempts while not is_correct_nodes:
+= 1
num_attempts = test_multi_node(
generated_query, is_correct_nodes "joined in 2023", expected_nodes=[UserJoinedDateNode, UserJoinedDateNode]
)
print(f"{num_attempts=}")
print_results(generated_query, is_correct_nodes)
num_attempts=1
{'group_operator': 'and',
'nodes': [{'category': 'user_joined_date',
'operator': 'after',
'value': '2023-01-01'},
{'category': 'user_joined_date',
'operator': 'before',
'value': '2023-12-31'}]}
✅
= test_multi_node(
generated_query, is_correct_nodes "find me female users over the age of 18", expected_nodes=[UserSexNode, UserAgeNode]
) print_results(generated_query, is_correct_nodes)
{'group_operator': 'and',
'nodes': [{'category': 'user_sex',
'operator': 'equals',
'value': 'female'},
{'category': 'user_age',
'operator': 'greater_than',
'value': 18}]}
✅
3 Integrating into the product
Even with such a simple and naive implementation, we can start testing out the system in the product to collect some data from users - i.e. what kind of natural language queries do our users try? It’s also a lot more motivating to improve the model performance when it correlates directly with user satisfaction (or dissatisfaction!)
4 Next Steps - Evals
We already introduced some simple tests in this post - test_single_node
and test_multi_node
look quite basic but the pydantic models (and instructor
) are doing a lot of heavy lifting here to ensure that our model responses are valid expressions.
In the next post, we’ll build up our suite of evaluations and trying out more complex prompts to push this base model to its limits and identify more expression where the model fails we can improve the model.