Abstract: The paradigm of using large models as evaluators (LLM-as-a-Judge) has shown potential in multiple tasks, but has not been fully explored in tool invocation scenarios, especially for ...