DevelopmentAvançado

Evaluation-Driven Development for Python LLM Applications

porgithub·github· v1.0.0 · atualizado em 2026-04-10

Score

Add instrumentation, build golden datasets, write eval-based tests, run them, root-cause failures, and iterate — Ensure your Python LLM application works correctly. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM. Use for making sure an LLM application works correctly, catching regressions after prompt changes, fixing unexpected behavior, or validating output quality before shipping.

llm-testingeval-driven-developmentpythonquality-assurancellm-integrationtest-automationprompt-engineering

0Stars

0Forks

0Usos

Fork

Documento do Skill

SKILL.mdeval-driven-dev/workflow

Startup checks: — Upgrade the `pixie-qa` package using the appropriate package manager.

Project setup: — Create a `pixie_qa` directory at the project root to store all generated files.

Code analysis: — Read the user's code to identify the core LLM-calling function and its dependencies.

Run harness creation: — Build a script to invoke the LLM-calling function and capture traces.

Trace capture: — Run the app with representative inputs to generate real traces.

Test creation: — Write eval-based tests to assess LLM response quality and agent routing decisions.

Dataset creation: — Build golden datasets to compare against the LLM's output.

Test execution: — Run the tests and analyze the results.

Telemetria de Agentes

Execuções

total

Taxa de Sucesso

últimos 30d

Latência Média

0.0s

p50

Alucinação

0.0%

detecção

Tokens Entrada

avg 0/exec

Tokens Saída

avg 0/exec

Uso por Plataforma

Skills Relacionados

Similar aAPI Test Generator

60%

Similar aTest Data Generator

60%

Co-executedTest Data Generator

49%

Árvore do Skill

Evaluation-Driven Development for Python LLM Applications

eval-driven-dev

Fases Cognitivas6

1.SENSE

2.CONTEXTUALIZE

3.HYPOTHESIZE

4.EVALUATE

5.ACT

6.REFLECT

Triggers8

set up evals for my LLM applicationadd tests to my Python LLM projectcreate a QA pipeline for my LLM appdebug why my LLM application is failingimprove the quality of my LLM applicationfix failing tests in my LLM projectbenchmark my LLM applicationevaluate my LLM application

Avaliar este Skill

Score Breakdown

⭐Avaliação Humana0%

🤖Sucesso de Agentes0%

🕐Atualidade100%

🔗Saúde de Dependências100%

🕸️Centralidade no Grafo0%

🛡️Segurança48%

CompositeScore = α·Humano + β·Agente + γ·Recência + δ·Deps + ε·Centralidade + ζ·Segurança

Instalação

$ synaptic mcp download eval-driven-dev

$ synaptic skills detail eval-driven-dev

$ synaptic skills live eval-driven-dev

Dependências

pixie-qa

Links

GitHub Repository