Discussion about this post

User's avatar
Claire Gouze's avatar

That's very interesting! Which tool did you use to test the agent?

Did you rebuilt your own AI agent?

And how did you test that each question had the right answer?

I did a similar benchmark but also included other metrics like LLM token costs, query costs, and time to answer. Is it something you've analyzed as well?

Ramona C. Truta's avatar

Very interesting experiment.

I'm curious how you accounted for hot/cold cache as you switch between contexts. Benchmarking should be done in a cold cache + cold OS guaranteed environment, and with a replication factor.

2 more comments...

No posts

Ready for more?