Getting Started
Getting Started
Basic Setup
from empowerspark.target.type.dynamics_v2 import Dynamics
# Initialize client
dynamics_client = Dynamics()
# Configure database and log table names
dynamics_client.__artifact_db__ = "empower_ddu"
dynamics_client.__log_table_name__ = "batch_log"
# Configure with secrets
dynamics_client.configure("dataverse-connection-string")
Secret Configuration in KeyVault
The base secret should contain:
{
"dynamics_url": "<your-dynamics-url>",
"azure_tenant": "<tenant-id>",
"client_id": "<client-id>",
"client_secret": "<client-secret>",
"dynamics_erp": "<erp-name>",
"extra_clients_secret_name": "<name-of-secret-containing-extra-clients>"
}
Additional service principals should be stored in a separate secret with format:
OAuthClientID=<client-id>;OAuthClientSecret=<client-secret>;
Worker Count Calculation
Worker count should be configured based on the number of service principals:
- Base formula:
number_of_service_principals * 50
- Each service principal can handle up to 52 concurrent connections
- Example: With 32 service principals, use:
32 * 50 = 1,600
workers
Important Notes
- Always re-initialize the dynamics client between loads to prevent connection issues:
dynamics_client = get_dynamics_client(secret_name, config, num_of_dynamics_clients)
- Base configuration example:
base_config = {
"advanced.performance.batch_size": 50_000,
"advanced.columns.exclude_columns": ["owningteam_id", "owningteam_logicalname"]
}
Core Configuration Settings
These are the most commonly used settings:
Setting | Description | Default | Notes |
---|---|---|---|
advanced.performance.batch_size | Number of records per batch | 10,000 | Recommended: 50,000 |
advanced.performance.num_workers | Number of concurrent operations | 32 | Calculate based on SPs |
advanced.performance.id_based_strategy_enabled | Use ID-based upsert strategy | False | Recommended when possible |
advanced.debug.log_queries | Enable query logging | False | For troubleshooting |
advanced.debug.sampling_enabled | Enable data sampling | False | For testing |
advanced.debug.sample_size | Number of records to sample | 1,000 | Used with sampling |
Logging and Monitoring
Core Log Tables
-
Batch Log Table (
batch_log
)- Contains execution metrics for each batch
- Key columns:
run_id
: Unique identifier for the runpartition_id
: Batch identifiersuccess_inserts/updates
: Successful operations countfailed_inserts/updates
: Failed operations countskipped
: Records that didn't need updatingprocessing_time
: Execution duration
-
Query Log Table (
<table_name>_query_log
)- Records executed SQL queries
- Useful for debugging data changes
-
Batch Mapping Table (
<table_name>_batch_mapping
)- Maps records to batches
- Used for tracking record processing
Example Log Query
SELECT
run_id,
SUM(partition_size) as total_processed,
SUM(success_inserts) as success_inserts,
SUM(failed_inserts) as failed_inserts,
MIN(start_time)
FROM empower_ddu.batch_log
GROUP BY run_id
ORDER BY MIN(start_time) DESC
Operation Modes
Available Modes
-
Insert
- Adds new records only
- Fastest operation
- Use when data is guaranteed new
-
Upsert
- Updates existing records and inserts new ones
- More complex operation
- Two strategies:
- ID-based (preferred when possible)
- Business key-based
Processing Stages
-
Pre-processing
- Data normalization
- Key validation
- Record classification
-
Processing
- Batch execution
- Error handling
- Retry logic
-
Post-processing
- Logging
- Cleanup
- Metrics collection
Best Practices
-
ID Strategy Usage
config = { "advanced.performance.id_based_strategy_enabled": True }
- Faster than business key strategy
- Reduces database load
- Use when ID is available
-
Batch Size Optimization
- Start with 50,000 records per batch
- Monitor performance metrics
- Adjust based on error rates
-
Connection Management
- Re-initialize client between loads
- Use multi-pool for better resource utilization
- Configure worker count appropriately
-
Data Validation
- Use sanitization rules for data cleanup
- Validate business keys before processing
- Handle null values appropriately
Troubleshooting Guide
Common Scenarios
-
Rate Limit Errors
config = { "advanced.performance.batch_size": 25_000, # Reduce batch size "advanced.performance.deferred_retry_enabled": True }
-
Memory Issues
- Enable sampling for testing
- Reduce batch size
- Check worker count
-
Slow Performance
- Enable ID-based strategy if possible
- Optimize batch size
- Check connection pooling
Debugging Process
- Enable detailed logging:
config = {
"advanced.debug.log_queries": True,
"advanced.debug.log_batch_mapping": True,
"advanced.debug.log_skips_missed": True
}
- Use sampling for initial testing:
config = {
"advanced.debug.sampling_enabled": True,
"advanced.debug.sample_size": 5_000
}
- Review log tables for errors:
SELECT __ddu_error, COUNT(*)
FROM empower_ddu.<table_name>
WHERE __ddu_run_id = '<run_id>'
GROUP BY __ddu_error
Key Improvements
Compared to the previous package:
-
Enhanced Performance
- Improved connection pooling
- Optimized batch processing
- Better resource utilization
-
Better Error Handling
- Detailed error logging
- Automatic retry mechanism
- Rate limit handling
-
Advanced Features
- ID-based upsert strategy
- Data sanitization rules
- Improved debugging tools
-
Monitoring Capabilities
- Detailed metrics
- Query logging
- Batch tracking
Extended Configuration Reference
See Extended Configuration Documentation for a complete list of all available configuration options and their detailed descriptions.
Updated about 2 months ago