Custom Interactive Cluster Configuration

Overview

This document outlines the process for integrating a custom interactive cluster into the Empower product, along with the necessary configuration steps.

Need for an Interactive Cluster

  • Efficiency: Instead of launching multiple job clusters, a single interactive cluster can be utilized, tailored to the required size.

  • Performance Boost: This results in improved product performance, lowering both execution time and costs.

Configuration Steps

  1. Create interactive cluster in corresponding databricks instance with the help of following script/notebook
import requests

DATABRICKS_URL = (
    "https://adb-613269140414450.10.azuredatabricks.net"  # Replace with databricks URL
)
DATABRICKS_API_KEY = "dapi3fa77cf20ff72f4bb00eb0175857c4f5"  # Replace with API token


def main():
    header = {"Authorization": f"Bearer {DATABRICKS_API_KEY}"}

    body = {
        "cluster_name": "EmpowerInteractiveCluster",
        "num_workers": 1,
        "spark_version": "14.3.x-photon-scala2.12",
        "data_security_mode": "SINGLE_USER",
        "single_user_name": "de0a3ca9-5e39-4b00-9813-0e298fe12ae4",  # Replace with respective ADF service principal 
        "instance_pool_id": "0927-105040-stop6-pool-el2max83",  # Replace with pool id - this is S Pool 
        "policy_id": "001890F816051BBF",  # Replace with EmpowerSparkJobPolicy Id
        "autotermination_minutes": 60,
    }

    resp = requests.post(
        f"{DATABRICKS_URL}/api/2.1/clusters/create",
        headers=header,
        timeout=60,
        json=body,
    )
    if resp.status_code != 200:
        print(f"Error creating cluster: {resp.text}")
        return

    out = resp.json()
    print(f"Created new cluster with ID: {out['cluster_id']}")


if __name__ == "__main__":
    main()
Cluster created by above script and retrieve the cluster id from cluster's url (highlighted)

Cluster created by above script and retrieve the cluster id from cluster's url (highlighted)

  1. Configure the following values in [state_config].[DatabaseToStepCommand] table.

    1. BatchVersion = 'v2'
    2. ClusterID = cluster ID from step 1
    3. JobConcurrency = -1
    4. SparkWorkers = 0
  2. How to confirm the Intractive Cluster is used for execution.
    Make sure the compute is EmpowerInteractiveCluster and run as Adf service principal

Notebook activity run parameters

Notebook activity run parameters

  1. How to use this feature

    a. Triggering from Main pipeline

    Initiate the main pipeline for the respective load group of the DataSource.

    b. Triggering from flows

    Triggering from Flows is currently not supported but will be available in future releases.

Supported Connectors

  • Spark-Based Connectors are supported.