The trust/validation layer is the interesting part here. We run ~20 autonomous AI agents on BoTTube (bottube.ai) that create videos, comment, and
interact with each other - the hardest problem by far has been exactly what you're describing: knowing whether an agent's output is grounded vs
hallucinated. We ended up building a similar evidence-quality check where agents that can't back up a claim just abstain.
Curious how the routing score weights (70/20/10) were chosen - have you experimented with letting agents adjust those weights based on task type? For
something like content generation the capability match matters way more than latency, but for real-time data feeds you'd probably want to flip that.
Thanks for checking this out! 20 autonomous agents interacting with each other sounds intense that's exactly the kind of multi-agent coordination problem I am trying to make easier.
On the weights (70/20/10 for capability/latency/cost):
Honestly, those were empirically tuned from my own usage patterns. Started with equal weights, then noticed that capability mismatch was causing way more failures than slow responses or high costs. So I kept bumping capability weight until the "wrong tool selected" rate dropped.
You're spot on about task-type sensitivity though. I actually have additional weights for trust (15%) and semantic relevance (25%) that kick in during the ranking phase. But dynamic weight adjustment per task type is on the roadmap.
The idea would be something like:
- "real-time" or "live" in query → boost latency weight to 40%
- "cheap" or "budget" in query → boost cost weight to 30%
- "accurate" or "reliable" in query → boost trust weight to 25%
Haven't shipped it yet because I wanted to validate the static weights first. But your content generation vs real-time data example is exactly the use case.
On the trust layer - I do evidence-quality scoring where each API response includes a confidence field. APIs that return citations or source URLs get a trust boost. The abstention pattern you mentioned is interesting - I currently surface low-confidence results with a warning rather than hiding them, but abstention might be cleaner for agent-to-agent workflows.
Would love to hear more about how you handle trust scoring in BoTTube. Always looking for battle-tested patterns.
Axiomeer v2 is live.
Replaced all mock providers with 7 real, free APIs (weather, countries, exchange rates, dictionary, books, Wikipedia, math facts) zero API keys.
The pipeline now routes to the best provider, validates evidence, and generates grounded answers with no hallucination(tested on real + fake queries using llama2:7b). 83 tests passing (74 unit, 9 integration). Test results are in Test Images/v2-results.
On the weights (70/20/10 for capability/latency/cost):
Honestly, those were empirically tuned from my own usage patterns. Started with equal weights, then noticed that capability mismatch was causing way more failures than slow responses or high costs. So I kept bumping capability weight until the "wrong tool selected" rate dropped.
You're spot on about task-type sensitivity though. I actually have additional weights for trust (15%) and semantic relevance (25%) that kick in during the ranking phase. But dynamic weight adjustment per task type is on the roadmap.
The idea would be something like:
- "real-time" or "live" in query → boost latency weight to 40% - "cheap" or "budget" in query → boost cost weight to 30% - "accurate" or "reliable" in query → boost trust weight to 25%
Haven't shipped it yet because I wanted to validate the static weights first. But your content generation vs real-time data example is exactly the use case.
On the trust layer - I do evidence-quality scoring where each API response includes a confidence field. APIs that return citations or source URLs get a trust boost. The abstention pattern you mentioned is interesting - I currently surface low-confidence results with a warning rather than hiding them, but abstention might be cleaner for agent-to-agent workflows.
Would love to hear more about how you handle trust scoring in BoTTube. Always looking for battle-tested patterns.