Passo-a-passo detalhado do skill, referenciando as fases cognitivas:
1SENSE — Receber dataset e analisar tarefa
```python
import pandas as pd
df = pd.read_csv("data.csv")
print(f"Shape: {df.shape}")
print(f"Target distribution:\n{df['target'].value_counts(normalize=True)}")
print(f"Missing values:\n{df.isnull().sum()}")
```
2CONTEXTUALIZE — EDA e preprocessing
```python
from sklearn.model_selection import train_test_split
X = df.drop("target", axis=1)
y = df["target"]
# Split ANTES de qualquer preprocessing para evitar leakage
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
```
3HYPOTHESIZE — Selecionar e configurar pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', XGBClassifier(
n_estimators=500, learning_rate=0.05,
max_depth=6, scale_pos_weight=class_ratio,
early_stopping_rounds=50,
)),
])
```
4EVALUATE — Cross-validation com StratifiedKFold
```python
from sklearn.model_selection import StratifiedKFold, cross_validate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_validate(pipeline, X_train, y_train, cv=cv,
scoring=['roc_auc', 'f1_weighted'],
return_train_score=True)
print(f"AUC: {scores['test_roc_auc'].mean():.4f} ± {scores['test_roc_auc'].std():.4f}")
print(f"Overfitting gap: {scores['train_roc_auc'].mean() - scores['test_roc_auc'].mean():.4f}")
```
5RECOMMEND — Hyperparameter tuning e export
```python
import optuna
def objective(trial):
params = {
'model__max_depth': trial.suggest_int('max_depth', 3, 9),
'model__learning_rate': trial.suggest_float('lr', 1e-3, 0.3, log=True),
'model__n_estimators': trial.suggest_int('n_est', 100, 1000),
}
pipeline.set_params(**params)
return cross_val_score(pipeline, X_train, y_train, cv=3, scoring='roc_auc').mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
# Export
import joblib
pipeline.fit(X_train, y_train)
joblib.dump(pipeline, 'model.pkl')
```
6REFLECT — Validar e documentar
Confirmar ausência de data leakage: rodar todas as transformações DENTRO do Pipeline
Verificar que features de produção matcham as features de treino
Reportar telemetria via mcp-skillschain