Getting Started

Getting Started

Basic Setup

from empowerspark.target.type.dynamics_v2 import Dynamics

# Initialize client
dynamics_client = Dynamics()

# Configure database and log table names
dynamics_client.__artifact_db__ = "empower_ddu"
dynamics_client.__log_table_name__ = "batch_log"

# Configure with secrets
dynamics_client.configure("dataverse-connection-string")

Secret Configuration in KeyVault

The base secret should contain:

{
    "dynamics_url": "<your-dynamics-url>",
    "azure_tenant": "<tenant-id>",
    "client_id": "<client-id>",
    "client_secret": "<client-secret>",
    "dynamics_erp": "<erp-name>",
    "extra_clients_secret_name": "<name-of-secret-containing-extra-clients>"
}

Additional service principals should be stored in a separate secret with format:

OAuthClientID=<client-id>;OAuthClientSecret=<client-secret>;

Worker Count Calculation

Worker count should be configured based on the number of service principals:

  • Base formula: number_of_service_principals * 50
  • Each service principal can handle up to 52 concurrent connections
  • Example: With 32 service principals, use: 32 * 50 = 1,600 workers

Important Notes

  1. Always re-initialize the dynamics client between loads to prevent connection issues:
dynamics_client = get_dynamics_client(secret_name, config, num_of_dynamics_clients)
  1. Base configuration example:
base_config = {
    "advanced.performance.batch_size": 50_000,
    "advanced.columns.exclude_columns": ["owningteam_id", "owningteam_logicalname"]
}

Core Configuration Settings

These are the most commonly used settings:

SettingDescriptionDefaultNotes
advanced.performance.batch_sizeNumber of records per batch10,000Recommended: 50,000
advanced.performance.num_workersNumber of concurrent operations32Calculate based on SPs
advanced.performance.id_based_strategy_enabledUse ID-based upsert strategyFalseRecommended when possible
advanced.debug.log_queriesEnable query loggingFalseFor troubleshooting
advanced.debug.sampling_enabledEnable data samplingFalseFor testing
advanced.debug.sample_sizeNumber of records to sample1,000Used with sampling

Logging and Monitoring

Core Log Tables

  1. Batch Log Table (batch_log)

    • Contains execution metrics for each batch
    • Key columns:
      • run_id: Unique identifier for the run
      • partition_id: Batch identifier
      • success_inserts/updates: Successful operations count
      • failed_inserts/updates: Failed operations count
      • skipped: Records that didn't need updating
      • processing_time: Execution duration
  2. Query Log Table (<table_name>_query_log)

    • Records executed SQL queries
    • Useful for debugging data changes
  3. Batch Mapping Table (<table_name>_batch_mapping)

    • Maps records to batches
    • Used for tracking record processing

Example Log Query

SELECT 
    run_id,
    SUM(partition_size) as total_processed,
    SUM(success_inserts) as success_inserts,
    SUM(failed_inserts) as failed_inserts,
    MIN(start_time)
FROM empower_ddu.batch_log
GROUP BY run_id
ORDER BY MIN(start_time) DESC

Operation Modes

Available Modes

  1. Insert

    • Adds new records only
    • Fastest operation
    • Use when data is guaranteed new
  2. Upsert

    • Updates existing records and inserts new ones
    • More complex operation
    • Two strategies:
      • ID-based (preferred when possible)
      • Business key-based

Processing Stages

  1. Pre-processing

    • Data normalization
    • Key validation
    • Record classification
  2. Processing

    • Batch execution
    • Error handling
    • Retry logic
  3. Post-processing

    • Logging
    • Cleanup
    • Metrics collection

Best Practices

  1. ID Strategy Usage

    config = {
        "advanced.performance.id_based_strategy_enabled": True
    }
    
    • Faster than business key strategy
    • Reduces database load
    • Use when ID is available
  2. Batch Size Optimization

    • Start with 50,000 records per batch
    • Monitor performance metrics
    • Adjust based on error rates
  3. Connection Management

    • Re-initialize client between loads
    • Use multi-pool for better resource utilization
    • Configure worker count appropriately
  4. Data Validation

    • Use sanitization rules for data cleanup
    • Validate business keys before processing
    • Handle null values appropriately

Troubleshooting Guide

Common Scenarios

  1. Rate Limit Errors

    config = {
        "advanced.performance.batch_size": 25_000,  # Reduce batch size
        "advanced.performance.deferred_retry_enabled": True
    }
    
  2. Memory Issues

    • Enable sampling for testing
    • Reduce batch size
    • Check worker count
  3. Slow Performance

    • Enable ID-based strategy if possible
    • Optimize batch size
    • Check connection pooling

Debugging Process

  1. Enable detailed logging:
config = {
    "advanced.debug.log_queries": True,
    "advanced.debug.log_batch_mapping": True,
    "advanced.debug.log_skips_missed": True
}
  1. Use sampling for initial testing:
config = {
    "advanced.debug.sampling_enabled": True,
    "advanced.debug.sample_size": 5_000
}
  1. Review log tables for errors:
SELECT __ddu_error, COUNT(*)
FROM empower_ddu.<table_name>
WHERE __ddu_run_id = '<run_id>'
GROUP BY __ddu_error

Key Improvements

Compared to the previous package:

  1. Enhanced Performance

    • Improved connection pooling
    • Optimized batch processing
    • Better resource utilization
  2. Better Error Handling

    • Detailed error logging
    • Automatic retry mechanism
    • Rate limit handling
  3. Advanced Features

    • ID-based upsert strategy
    • Data sanitization rules
    • Improved debugging tools
  4. Monitoring Capabilities

    • Detailed metrics
    • Query logging
    • Batch tracking

Extended Configuration Reference

See Extended Configuration Documentation for a complete list of all available configuration options and their detailed descriptions.