Data Workflow Guide

Learn how Notebook, ML Experiments, Orchestration, Lineage, and Metadata work together as a complete data platform.

The Data Platform Workflow

graph TD
    A["EXPLORE<br/>━━━━━<br/>Notebook<br/>Load & Analyze"] --> B["EXPERIMENT<br/>━━━━━━<br/>ML Experiments<br/>Train Models"]
    B --> C["AUTOMATE<br/>━━━━━━<br/>Orchestration<br/>Run Pipelines"]
    C --> D["TRACK<br/>━━━━<br/>Lineage + Metadata<br/>Monitor Flow"]
    D --> E["SHARE<br/>━━━━<br/>Documentation<br/>Collaborate"]
    E -.-> A
    
    style A fill:#1f97d4,stroke:#0b5394,stroke-width:2px,color:#fff
    style B fill:#ff9900,stroke:#ec7211,stroke-width:2px,color:#fff
    style C fill:#37475a,stroke:#1f1f1f,stroke-width:2px,color:#fff
    style D fill:#1f97d4,stroke:#0b5394,stroke-width:2px,color:#fff
    style E fill:#ff9900,stroke:#ec7211,stroke-width:2px,color:#fff

Phase 1: Explore Data in Notebook

Start here when:

  • You have a new dataset
  • Want to understand patterns
  • Testing hypotheses
  • Prototyping analysis

Notebook Is Your Lab

What you do:
✓ Load data from databases
✓ Write analysis code
✓ Visualize findings
✓ Test approaches
✓ Iterate quickly

Tools:
- Python/SQL cells
- Instant feedback
- Live visualizations
- Easy data exploration

Example: Customer Analysis

# Cell 1: Load customer data
import pandas as pd
customers = pd.read_sql("""
  SELECT * FROM customers 
  WHERE created_at > '2024-01-01'
""", connection)

# Cell 2: Explore
print(f"Total customers: {len(customers)}")
print(customers.head())

# Cell 3: Analyze
churn_rate = (customers[customers['status'] == 'inactive'].shape[0] / len(customers)) * 100
print(f"Churn rate: {churn_rate}%")

# Cell 4: Visualize
import matplotlib.pyplot as plt
customers.groupby('region')['revenue'].sum().plot(kind='bar')
plt.show()

Next step: If analysis works and is valuable → Move to ML Experiments

Phase 2: Track Experiments

Move here when:

  • Analysis is solid
  • Want to try variations
  • Need to compare results
  • Building a model

ML Experiments Track Your Progress

What you do:
✓ Try multiple approaches
✓ Log metrics from each run
✓ Compare performance
✓ Keep best version
✓ Document what worked

Tools:
- Parameter logging
- Metric tracking
- Automatic comparison
- Model versioning

Example: Building Customer Churn Model

from credvault import experiment

# Start experiment
exp = experiment.start('Churn Prediction')

# Try approach 1: Simple logistic regression
exp.log_params({'model': 'logistic_regression', 'features': 10})
model1 = train_logistic_regression(data)
accuracy1 = evaluate(model1)
exp.log_metric('accuracy', accuracy1)

# Try approach 2: Random forest with more features
exp.log_params({'model': 'random_forest', 'features': 25})
model2 = train_random_forest(data)
accuracy2 = evaluate(model2)
exp.log_metric('accuracy', accuracy2)

# Compare in ML Experiments dashboard
# Decide: Random forest (92% accuracy) > Logistic (87%)
exp.save_as_model('v1.0-production')

Results in ML Experiments:

Run 1: Accuracy 87% ← Logistic Regression
Run 2: Accuracy 92% ← Random Forest (BEST)

Next step: Model works well → Automate it with Orchestration

Phase 3: Automate with Orchestration

Move here when:

  • Model/analysis is finalized
  • Need to run regularly
  • Process many datasets
  • Multiple steps involved

Orchestration Runs on Schedule

What you do:
✓ Define workflow steps
✓ Schedule daily/hourly
✓ Handle dependencies
✓ Manage retries
✓ Monitor execution

Pipeline:
1. Extract data from databases
2. Transform and validate
3. Run trained model
4. Load results
5. Generate reports

Example: Daily Churn Prediction Pipeline

9:00 AM:
  ↓
Step 1: Extract fresh customer data (5 min)
  ↓
Step 2: Validate data quality (2 min)
  ↓
Step 3: Run churn model (3 min)
  ↓
Step 4: Load predictions to database (1 min)
  ↓
Step 5: Generate report & send to team (1 min)
  ↓
9:12 AM: Done! Results ready

In Orchestration:

name: Daily Churn Prediction

schedule: "0 9 * * *"  # 9 AM every day

steps:
  - name: extract
    query: SELECT * FROM customers WHERE active = true
    
  - name: validate
    checks:
      - no_nulls: email
      - unique: customer_id
    
  - name: predict
    model: churn_v1.0
    input: extract.output
    
  - name: load
    destination: predictions_daily
    input: predict.output
    
  - name: report
    template: churn_daily_report
    recipients: [team@company.com]

Next step: Pipeline running → Track data flow with Lineage

Phase 4: Track with Lineage

Use alongside Orchestration to:

  • Understand data flow
  • Debug data issues
  • See impact of changes
  • Ensure quality

Lineage Maps Your Data Journey

Raw Customer DB
    ↓ (Extract)
Cleaned Data
    ↓ (Transform)
Features Table
    ↓ (ML Model)
Predictions
    ↓ (Load)
Production Ready
    ↓ (Reports & Dashboards)
Business Insights

Example: Churn Prediction Lineage

Production Database
├─ customers table
│   └─ Extract step (daily)
│       ↓
├─ features_cleaned
│   └─ Transform step (validation)
│       ↓
├─ model_churn_v1.0
│   └─ Train (from ML Experiments)
│       ↓
├─ predictions
│   └─ Score step (daily 9 AM)
│       ↓
└─ churn_risk_report
    └─ Report generation
        └─ Sent to Sales team

Questions Lineage Answers:

"Why are predictions different this week?"
→ Check lineage: Data source changed? Transform updated?

"Which reports use customer data?"
→ Check lineage: Trace forward from customers table

"If I delete this column, what breaks?"
→ Check lineage: See all downstream dependencies

"Where did this number come from?"
→ Check lineage: Trace backward to source

Phase 5: Document in Metadata

Document everything for team reuse:

  • What data exists
  • How to use it
  • Who owns it
  • Data quality
  • Business meaning

Metadata Catalog

Asset: churn_predictions
Owner: Data Science Team
Description: Daily churn risk scores for all customers
Update frequency: Daily at 9:15 AM

Schema:
- customer_id: Unique customer identifier
- churn_risk: Probability (0-100%)
- risk_level: Category (low/medium/high)

Quality SLA: 99% completeness
Last updated: 2024-06-12 09:15 UTC

Used by:
- Sales Dashboard
- Customer Retention Team
- Marketing Campaigns

Related:
- churn_model_v1.0 (ML model)
- customers (source table)

Complete Example: End-to-End

Week 1: Exploration Phase

Monday: Load customer data in Notebook
Tuesday: Analyze churn patterns
Wednesday: Test different models
Thursday: Simple visualization
Friday: Present findings

Week 2: Experimentation Phase

Monday: Set up ML Experiments
Tuesday-Thursday: Try 15 different model approaches
Friday: Compare all runs, select best

Week 3: Production Phase

Monday: Build Orchestration pipeline
Tuesday: Test pipeline end-to-end
Wednesday: Run in production
Thursday: Verify results in database
Friday: Set up dashboards from results

Week 4: Operations Phase

Every day: Pipeline runs at 9 AM
Monitor: Check Lineage for data quality
Update: Metadata with new insights
Share: Results with stakeholders

Common Workflows

Scenario 1: Improving an Existing Model

Current: ML model runs daily, accuracy 85%

Step 1 (Notebook):
- Load current predictions
- Analyze errors
- Identify patterns
- Test improvements

Step 2 (ML Experiments):
- Try new feature engineering
- Test new algorithm
- Compare accuracy
- New approach: 92%

Step 3 (Orchestration):
- Update pipeline with new model
- Test end-to-end
- Deploy when ready
- Keep old as fallback

Step 4 (Lineage):
- Monitor data flows
- Verify impact across dashboards
- Document changes

Step 5 (Metadata):
- Update model documentation
- Note improvement (85% → 92%)
- Record change date

Scenario 2: Adding New Data Source

Business Need: Include recent transactions in churn model

Step 1 (Notebook):
- Load transactions data
- Explore patterns
- Test impact on model

Step 2 (ML Experiments):
- Retrain with new feature
- Compare old vs new
- Validate improvement

Step 3 (Orchestration):
- Add new query to pipeline
- Extract transactions
- Join with existing features

Step 4 (Lineage):
- See new data source
- Verify it flows to predictions
- Check downstream impact

Step 5 (Metadata):
- Document transactions table
- Add to data catalog
- Explain usage

Scenario 3: Debugging Data Issues

Problem: Dashboard shows wrong numbers

Step 1 (Metadata):
- Find data source
- Check last update time
- See data quality metrics

Step 2 (Lineage):
- Trace backward from dashboard
- Find transformation that feeds it
- Check for recent changes

Step 3 (Orchestration):
- Check pipeline logs
- See if step failed
- Look at error messages

Step 4 (Notebook):
- Load raw data
- Verify manually
- Test transformations
- Find the issue

Step 5 (Fix & Rerun):
- Fix code
- Rerun pipeline
- Verify results
- Update documentation

Best Practices Across Phases

Notebook Phase

  • ✓ Use descriptive cell names
  • ✓ Document your thinking
  • ✓ Keep experiments organized
  • ✓ Save reusable code snippets
  • ✗ Don't push to production directly

Experiments Phase

  • ✓ Log all parameters
  • ✓ Compare fairly (same data)
  • ✓ Save model artifacts
  • ✓ Document assumptions
  • ✗ Don't use in production until tested

Orchestration Phase

  • ✓ Test thoroughly first
  • ✓ Set up monitoring
  • ✓ Plan for failures
  • ✓ Document workflow
  • ✗ Don't change production pipelines without testing

Lineage & Metadata Phase

  • ✓ Keep documentation current
  • ✓ Document changes
  • ✓ Tag data appropriately
  • ✓ Set quality rules
  • ✗ Don't let documentation lag behind code