Passo-a-passo detalhado do skill, referenciando as fases cognitivas:
1SENSE — Carregar e inspecionar
```python
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# Detectar encoding automaticamente
import chardet
with open('data.csv', 'rb') as f:
encoding = chardet.detect(f.read())['encoding']
df = pd.read_csv('data.csv', encoding=encoding, sep=None, engine='python')
print(f"Shape: {df.shape}")
print(df.dtypes)
```
2CONTEXTUALIZE — Análise descritiva
```python
# Resumo completo
print(df.describe(include='all').round(2))
# Missing values
missing = df.isnull().mean().mul(100).round(1).sort_values(ascending=False)
print(missing[missing > 0])
# Duplicatas
print(f"Duplicatas: {df.duplicated().sum()} ({df.duplicated().mean()*100:.1f}%)")
```
3HYPOTHESIZE — Detectar outliers e padrões
```python
from scipy import stats
numeric_cols = df.select_dtypes('number').columns
# Z-score para outliers
z_scores = df[numeric_cols].apply(stats.zscore)
outliers = (z_scores.abs() > 3).sum()
print(f"Outliers por coluna (Z > 3):\n{outliers[outliers > 0]}")
# Correlações
corr = df[numeric_cols].corr()
high_corr = corr.unstack().sort_values(ascending=False)
high_corr = high_corr[(high_corr < 1) & (high_corr > 0.8)]
print(f"Correlações altas (> 0.8):\n{high_corr}")
```
4RECOMMEND — Gerar relatório completo
```python
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="EDA Report", explorative=True)
profile.to_file("eda_report.html")
print("Relatório gerado: eda_report.html")
```
5RECOMMEND — Pipeline de preprocessing
```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_cols.tolist()),
('cat', categorical_transformer, categorical_cols.tolist()),
])
```
6REFLECT — Validar e documentar
Verificar ausência de NaN após preprocessing: `assert X_transformed.isna().sum().sum() == 0`
Salvar pipeline: `import joblib; joblib.dump(preprocessor, 'preprocessing.pkl')`
Reportar telemetria via mcp-skillschain