Social Intelligence Benchmark(gertlabs.com) |
Social Intelligence Benchmark(gertlabs.com) |
Qwen3.5-9B was able to do this after a lot of time.
Qwen3.6-35B-A3B failed because it kept insisting that it needs to know n, but didn't try to figure it out by messing other agents.
Granite 4.1 9B failed completely because it was just writing non-descriptive massages like "Request to know the order" to the other agents and not replying to anyone.