# Data Workflow

This guide explains how data flows through the ensemble genetic algorithm pipeline, from initial input to final model evaluation. It covers supported formats, preprocessing steps, train/test splitting strategies, and cross-validation workflows.

See also: {doc}`./Configuration_Guide` for hyperparameter configuration examples that impact data handling.

---

## Data Input Formats Supported

### Primary Format: CSV (Comma-Separated Values)

The framework is designed around a single primary data format:

#### Requirements

| Requirement | Details |
|------------|---------|
| **File extension** | `.csv` only |
| **Row limit** | None (optimized for large datasets) |
| **Encoding** | UTF-8 with BOM support |
| **Header row** | Required (first row must contain column names) |

#### Sample CSV Structure

```csv
patient_ID,age,male,factor_A,factor_B,disease_outcome_var_1
P001,45,1,2.3,0.8,0
P002,32,0,1.9,0.5,1
P003,67,1,3.1,1.2,0
```

### Alternative Input Methods

#### 1. Programmatic Data Loading

For in-memory data frames (Notebooks):

```python
import pandas as pd
from ml_grid.pipeline.read_in import read

# Load from DataFrame
df = pd.read_csv("data/my_data.csv")

# Or create synthetic data for testing
from ml_grid.util.synthetic_data_generator import SyntheticDataGenerator

synth_gen = SyntheticDataGenerator(n_features=10, n_samples=100)
df = synth_gen.generate_df()
```

#### 2. Sampling for Testing

Use the `test_sample_n` and `column_sample_n` parameters:

```python
# Sample 100 rows for debugging
ml_grid_object = data_pipe(
    file_name="data/large_dataset.csv",
    test_sample_n=100,
    column_sample_n=5,
    ...
)

# Sample specific columns (ensures outcome_var_1 is always included)
ml_grid_object = data_pipe(
    file_name="data/large_dataset.csv",
    column_sample_n=5,  # Includes 'age', 'male' + 3 random columns
    ...
)
```

### Data Validation

The input pipeline performs automatic validation:

| Check | Description |
|-------|-------------|
| **File existence** | Ensures CSV path is valid |
| **Numerical types** | Verifies all columns can be converted to float |
| **Outcome variable** | Checks for `outcome_var_1` suffix and binary values |
| **Missing headers** | Validates column names are present |

---

## Preprocessing Steps

The data pipeline automatically executes a sequence of preprocessing steps. These can be configured via the `grid_params` section in `config.yml`.

### Step 1: Initial Feature Selection

Filters columns based on configuration:

```python
# Configuration in config.yml
grid_params:
  # Include/exclude features based on column names containing these words
  feature_toggles: {
    "include": ["factor", "biochemical"],
    "exclude": ["ID", "date"]
  }
```

### Step 2: Correlation Filtering

Removes highly correlated features to reduce multicollinearity.

#### How It Works

1. Computes Pearson correlation matrix for all features
2. Iteratively removes one feature from each high-correlation pair (> threshold)
3. Keeps the feature with lower variance (more stable)

#### Configuration

```yaml
grid_params:
  corr: [0.95, 0.98]              # Correlation thresholds to test
```

| Threshold | Use Case |
|-----------|----------|
| `0.90` | Very strict - keeps most features but risks multicollinearity |
| `0.95` | Standard balance for most datasets |
| `0.98` | Aggressive - removes redundant features, fewer columns |

#### Visualization Method

After preprocessing, use:

```python
from ml_grid.pipeline.data_correlation_matrix import handle_correlation_matrix

corr_drops = handle_correlation_matrix(
    local_param_dict={"corr": 0.95},
    drop_list=[],
    df=original_df
)

print(f"Dropped due to correlation: {corr_drops}")
```

### Step 3: Missing Value Handling

Columns with excessive missing data are removed.

#### Configuration

```yaml
grid_params:
  percent_missing: [100, 99, 95]  # Thresholds for column removal
```

| Threshold | Behavior |
|-----------|----------|
| `100` | Remove columns with >100% missing (all columns) - effectively disables filtering |
| `99.8` | Remove columns with >99.8% missing values (allows rare NaNs) |
| `95` | Aggressive - removes even moderately missing columns |

#### Imputation Strategy

Remaining missing values are handled internally:
- **Numerical features**: Mean imputation via `sklearn.SimpleImputer(strategy="mean")`
- **Base learners**: Many algorithms (XGBoost, etc.) handle NaN natively

### Step 4: Constant Column Removal

Features with zero variance across all samples are removed.

#### Why This Matters

Constant columns provide no predictive signal and waste computational resources.

```python
# Example of constant column detection
import pandas as pd

df_with_constant = pd.DataFrame({
    "feature1": [1, 2, 3, 4],     # Variable
    "feature2": [5, 5, 5, 5],     # Constant - will be dropped
    "outcome_var_1": [0, 1, 0, 1]
})

# After constant removal:
df_clean = df_with_constant.drop(columns=["feature2"])
```

### Step 5: Feature Scaling

Standardizes features to zero mean and unit variance (Optional).

#### Configuration

```yaml
grid_params:
  scale: [true, false]            # Whether to apply StandardScaler
```

#### When to Use Scaling

| Scenario | Recommended |
|----------|-------------|
| Neural Networks | `scale: true` |
| SVMs | `scale: true` |
| Distance-based methods (KNN) | `scale: true` |
| Tree-based models (RF, XGBoost) | `scale: false` (not required) |

### Step 6: Feature Importance Selection

Selects top features based on importance scores.

#### Configuration

```yaml
grid_params:
  n_features: [50, 100, "all"]    # Number of features to retain
```

#### Available Methods

The framework supports multiple importance scoring methods:

| Method | Base Model | When to Use |
|--------|-----------|-------------|
| `RandomForest` | Random Forest | General-purpose, robust |
| `XGBoost` | Gradient Boosting | High-performance scenarios |
| `LogisticRegression` | Logistic Regression | Linear relationships |

#### Feature Selection Process

1. Train temporary model on all features
2. Extract importance scores
3. Rank features by importance
4. Select top `n_features`

---

## Train/Test Split Options

The framework implements a **stratified hold-out validation** strategy with three splits.

### Standard Stratified Split

```mermaid
graph LR
    A[100% Data] --> B[75% Training Set]
    A --> C[25% Validation _orig]
    B --> D[75% of 75% = ~56.3% Training]
    B --> E[25% of 75% = ~18.8% Test]
    
    style D fill:#c9f7c9
    style E fill:#ffd7b5
    style C fill:#ffcccc
    
    subgraph "Final Splits"
        D[X_train]
        E[X_test]
        C[X_test_orig]
    end
```

#### Data Split Naming Convention

| Split | Purpose | Usage |
|-------|---------|-------|
| `X_train`, `y_train` | Training new models | During GA evolution |
| `X_test`, `y_test` | Evaluation during GA | Fitness calculation |
| `X_test_orig`, `y_test_orig` | Final validation | Unseen data evaluation |

#### Split Statistics

For a 1,000-sample dataset:
- **Training**: ~563 samples (used for GA training)
- **Testing**: ~188 samples (for GA fitness evaluation)
- **Validation** (`_orig`): ~250 samples (final hold-out)

### Resampling Strategies

Controlled via `resample` parameter: `null`, `"undersample"`, `"oversample"`

#### 1. No Resampling (`null`)

Standard stratified split with class distribution preserved:

```python
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=1, stratify=y
)
```

**Use Cases**: Balanced datasets, no class imbalance issues

#### 2. Undersampling (`"undersample"`)

Reduces majority class before splitting:

```python
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)

# Then perform stratified split
```

**Use Cases**: Severe class imbalance (e.g., 1% positive class)

#### 3. Oversampling (`"oversample"`)

Adds synthetic minority samples (SMOTE) after train/test split:

```python
from imblearn.over_sampling import RandomOverSampler

# Split first to prevent data leakage
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
    X, y, test_size=0.25, random_state=1
)

# Oversample ONLY the training set
ros = RandomOverSampler(random_state=1)
X_train_orig, y_train_orig = ros.fit_resample(X_train_orig, y_train_orig)
```

**Use Cases**: Moderate imbalance, want to preserve all original data

---

## Cross-Validation Workflows

While the main GA uses a single hold-out validation set for efficiency, cross-validation can be integrated for final model evaluation.

### Standard CV Strategy (For Final Evaluation)

After finding the best ensemble via GA, evaluate using 10-fold CV:

```python
from sklearn.model_selection import RepeatedKFold, cross_validate

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_validate(
    best_ensemble_model,
    X_train,
    y_train,
    cv=cv,
    scoring=['roc_auc', 'accuracy', 'precision', 'recall'],
    n_jobs=-1
)
```

### Genetic Algorithm-Specific CV Integration

The GA fitness function can incorporate CVscores for more robust evaluation:

| Approach | Description | Trade-offs |
|----------|-------------|------------|
| **Single split** (default) | Use hold-out validation for speed | Faster but noisier estimate |
| **K-fold CV in fitness** | Average K splits per individual | More accurate but 10x slower |
| **3-折 CV only for best N** | Use CV only on top candidates | Balance of accuracy/speed |

### Parameter Grid Cross-Validation

After GA finds optimal base learners, fine-tune hyperparameters via GridSearchCV:

```python
from sklearn.model_selection import GridSearchCV
from ml_grid.pipeline.grid_search_cross_validate import grid_search_crossvalidate

# Extract best ensemble configuration
best_params = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 10, 15]
}

grid_search = GridSearchCV(
    estimator=XGBoost(),
    param_grid=best_params,
    cv=5,
    scoring='roc_auc'
)

grid_search.fit(X_train, y_train)
```

### Cross-Validation Visualization

```mermaid
graph TD
    A[All Data] --> B[K-Fold Split]
    
    subgraph Fold 1
        B --> C1[Train set]
        B --> D1[Test set - Fold 1]
        C1 --> E1[Model Training]
        E1 --> F1[Predictions]
        F1 --> G1[AUC Score]
    end
    
    subgraph Fold 2
        B --> C2[Train set]
        B --> D2[Test set - Fold 2]
        C2 --> E2[Model Training]
        E2 --> F2[Predictions]
        F2 --> G2[AUC Score]
    end
    
    subgraph ... more folds
    end
    
    G1 --> H[Average AUC across K folds]
    G2 --> H
    
    style H fill:#c9f7c9,stroke:#333
```

---

## Data Pipeline Diagrams

### Complete Data Flow

```mermaid
flowchart TB
    subgraph "Input Layer"
        A[CSV File] --> B[data_pipe]
        B --> C{Sampling enabled?}
        C -->|Yes| D[test_sample_n / column_sample_n]
        C -->|No| E[Full dataset load]
    end
    
    subgraph "Preprocessing Pipeline"
        D --> F[Initial features]
        E --> F
        
        F --> G[Drop correlated features<br/>corr threshold]
        G --> H[Drop missing values<br/>percent_missing]
        H --> I[Remove other outcomes]
        I --> J[Kill constant columns]
    end
    
    subgraph "Safety Net"
        J --> K{Any features left?}
        K -->|No| L[Retain minimal set]
        K -->|Yes| M[Use filtered set]
    end
    
    subgraph "Train/Test Split"
        L --> N[75/25 split]
        M --> N
        
        N --> O[X_train / y_train]
        N --> P[X_test / y_test]
        N --> Q[X_test_orig / y_test_orig]
        
        O --> R{Resample?}
        R -->|undersample| S[Under-sample all]
        R -->|oversample| T[Over-sample train only]
        R -->|null| U[No resampling]
    end
    
    subgraph "Final Processing"
        S --> V[Post-split cleaning]
        T --> V
        U --> V
        
        V --> W{Scale?}
        W -->|Yes| X[StandardScaler]
        W -->|No| Y[Skip scaling]
        
        X --> Z[Feature selection if requested]
        Y --> Z
        
        Z --> AA[X_train_final]
        Z --> AB[X_test_final]
        Z --> AC[X_test_orig_final]
    end
    
    style A fill:#e1f5fe,stroke:#333
    style F fill:#fff9c4,stroke:#333
    style O fill:#b9f6ca,stroke:#333
    style P fill:#ffd7b5,stroke:#333
    style Q fill:#ffcccc,stroke:#333
```

### Feature Transformation Log

The pipeline maintains a detailed log of feature changes:

| Step | Features Before | Features After | Removed | Reason |
|------|----------------|----------------|---------|--------|
| Initial Load | 100 | 100 | 0 | Base count |
| Drop Correlated | 100 | 92 | 8 | corr > 0.95 |
| Drop Missing | 92 | 90 | 2 | >99% missing |
| Drop Other Outcomes | 90 | 90 | 0 | Not present |
| Drop Constants | 90 | 85 | 5 | Zero variance |
| Split | 85 | 85 | 0 | No feature removal |
| Post-Split Clean | 85 | 83 | 2 | Became constant after split |
| Final Features | - | **83** | **17** | Total |

---

## Summary

This data workflow guide covered:
1. **Input formats**: CSV requirements and alternative loading methods
2. **Preprocessing steps**: Correlation filtering, missing value handling, scaling
3. **Train/test splits**: Stratified hold-out with 3-way split options
4. **Resampling strategies**: Undersample/oversample configuration
5. **Cross-validation workflows**: Integration for final evaluation

The data pipeline is designed to be fully automated and configurable via `config.yml`, making it easy to adapt to different dataset characteristics while maintaining reproducibility through random state management.

See {doc}`./Implementation_Guide` for complete implementation walkthrough with setup, execution examples, and result interpretation.