1. Define evaluation criteria (instruction following, completeness, tool efficiency, reasoning quality, coherence).
2. Create test cases covering simple, medium, complex, and edge case scenarios.
3. Run direct scoring evaluation using LLM-as-judge with chain-of-thought justification.
4. Mitigate position bias using techniques like position swapping.
5. Perform human evaluation to catch edge cases and subtle misunderstandings.
6. Analyze evaluation results to identify areas for improvement.
7. Iterate on prompts, context, and agent architecture based on evaluation feedback.