Post-Deployment: Additional Setup

Additional recommended setup steps after Empower is deployed.

Certain features and elements of the data estate require additional administrator setup after deployment of the Azure environment. This section covers those concerns.

SCIM - Sync Users and Groups with Databricks

System for Cross-Domain Identity Management (SCIM) is an open standard protocol for automating the exchange of user identity information between identity domains and IT systems. SCIM ensures that employees added to your teams automatically have accounts created via Microsoft Entra ID / Active Directory in the data environment. User attributes and profiles are synchronized between your directory and the data plane, allowing users to be linked into groups in Databricks, as well as adding and removing users based on Entry group membership.

You should provision identities to your Databricks workspace using Microsoft Entra ID before wide production use. Please see the Databricks guide on SCIM provisioning to learn how.

Configuring User-Defined Routes for Databricks

The following documentation provides guidance on routing network traffic from your Azure Virtual Network to an external firewall using User-Defined Routes (UDRs) and Azure service tags.

User-Defined Routes (UDRs) are a feature in Azure that allows you to create custom routing paths within your virtual network. By default, Azure routes traffic between subnets, virtual networks, and on-premises networks, but there are scenarios where you might need to override these default routes to meet specific networking requirements.

Azure Service Tags represent a set of IP address prefixes for a specific Azure service, such as Databricks. Microsoft manages these tags and updates them automatically, removing the need for manual updates and reducing the risk of service outages.

To create the UDRs that will direct traffic from your Azure VNet through your external firewall, execute the following Azure CLI script.

  1. Save the following script as databricks-create-user-defined-routes.sh
#!/bin/bash

set -x
set -euo pipefail

function usage() {
    echo "DESCRIPTION
    This script creates a user-defined route (UDR) for Azure Databricks and associates it with a specified subnet.

SYNOPSIS
    $0 SUBSCRIPTION_ID RESOURCE_GROUP_NAME REGION VNET_NAME SUBNET_NAME ROUTE_TABLE_NAME

EXAMPLES
    $0 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX rg-emp-dev-001 centralus vnet-emp-dev-01 adbpublic rt-udr-databricks
"
}

function add_rt() {
    # Route traffic to out to internet
    local route_name=${1}
    local service_tag=${2}
    az network route-table route create \
        --subscription "${subscription}" \
        --resource-group "${resource_group_name}" \
        --route-table-name "${route_table_name}" \
        --name "${route_name}" \
        --address-prefix "${service_tag}" \
        --next-hop-type "Internet"
}

function cache_service_tags() {
    # Store service tags in a temporary file
    # to avoid multiple calls to the API
    local region=${1}
    local tmp_file=${2}
    az network list-service-tags \
        --location "${region}" > "${tmp_file}"
}

function get_service_tag_name() {
    # Get region-specific service tags 
    local region=${1}
    local tag=${2}
    local tmp_file=${3}
    jq --arg region "${region}" --arg "${tag}" '.values[] | select(.properties.region==$region) | select(.name | contains($tag+".")) | .name' < "${tmp_file}"
}

if [ ! $# -eq 6 ]; then
    usage
    exit 1
fi

subscription=$1
resource_group_name=$2
region=$3
vnet_name=$4
subnet_name=$5
route_table_name=$6

TMP_FILE=$(mktemp /tmp/az_service_tags.XXXXXX.json)
cache_service_tags "${region}" "${TMP_FILE}"

# Get region-specific service tags 
event_hub_service_tag=$(get_service_tag_name "${region}" "EventHub" "${TMP_FILE}")
sql_service_tag=$(get_service_tag_name "${region}" "Sql" "${TMP_FILE}")
storage_service_tag=$(get_service_tag_name "${region}" "Storage" "${TMP_FILE}")

# Add routes to UDR
add_rt "adb-servicetag" "AzureDatabricks"
add_rt "adb-eventhub" "${event_hub_service_tag}"
add_rt "adb-metastore" "${sql_service_tag}"
add_rt "adb-storage" "${storage_service_tag}"

# Associate routing table with subnet
az network vnet subnet update \
  --subscription "${subscription}" \
  --resource-group "${resource_group_name}" \
  --vnet-name "${vnet_name}" \
  --name "${subnet_name}" \
  --route-table "${route_table_name}"

# Clean up cache
rm -f "${TMP_FILE}"

  1. Replace the placeholders in the command below with your specific configuration details:

subscription="XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
resource_group_name="rg-empower-dev-001"
region="CentralUs"
vnet_name="vnet-emp-dev-01"
subnet_name="adbpublic"
route_table_name="rt-user-defined-routes"

. ./create-udr-databricks.sh \
    $subscription \
    $resource_group_name \
    $region \
    $vnet_name \
    $subnet_name \
    $route_table_name