That's very interesting! Which tool did you use to test the agent?
Did you rebuilt your own AI agent?
And how did you test that each question had the right answer?
I did a similar benchmark but also included other metrics like LLM token costs, query costs, and time to answer. Is it something you've analyzed as well?
I'm curious how you accounted for hot/cold cache as you switch between contexts. Benchmarking should be done in a cold cache + cold OS guaranteed environment, and with a replication factor.
That's very interesting! Which tool did you use to test the agent?
Did you rebuilt your own AI agent?
And how did you test that each question had the right answer?
I did a similar benchmark but also included other metrics like LLM token costs, query costs, and time to answer. Is it something you've analyzed as well?
Very interesting experiment.
I'm curious how you accounted for hot/cold cache as you switch between contexts. Benchmarking should be done in a cold cache + cold OS guaranteed environment, and with a replication factor.
Curious what is your definition of context?
This post is a great summary: https://atlan.com/know/context-layer-enterprise-ai/