- Print
- DarkLight
Azure Databricks
- Print
- DarkLight
To set up the Databricks Spark job:
In your warehouse, create a catalog namespace called sightmachine.
With the sm_cdc_script.py we provided, upload the file into your workspace on Databricks. The script is configured to sync the CDC data from an Azure Cloud Storage container into the Warehouse.
The Service Principal credentials to access the cloud storage need to be added to the cluster configuration.
Adding Secrets to Databricks.
Add the Service Principal Secret Value to the secrets API for Databricks Use the Databricks CLI to do this.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/tutorial
CLI example (Service Principal): databricks secrets create-scope <secret-scope> databricks secrets put-secret <secret-scope> sp_client_secret --string-value <secret-value> ex. databricks secrets create-scope smreleasetesting databricks secrets put-secret smreleasetesting sp_client_secret --string-value 'REDACT'
In the Databricks Warehouse UI, go to Workflows, click “Create Job”
Set “Task name” as “SightMachineETL”
Set “Type” as “Python Script”
Set “Source” as “Workspace”
Set “Path” as the path to the cdc_script.py file linked in step 2.
Set “Parameters” as the parameter to the location of the CDC data in cloud storage.
[“--file-path=abfss://<storage account container>@<storage account>.dfs.core/windows.net/wal2json”, “--context=<storage account>”]
E.g: ["--file-path=abfss://[email protected]/wal2json", "--context=smreleasetesting"]
For “Cluster” row, click on the “edit” button.
For “Databricks runtime version” Use DBR 14.3 LTS or later.
For “Driver Type”, select the instance size that will work for your use case, we strongly recommend using a Delta Cache Accelerated instance (d_ads) instance.
Click on “Advanced options”
We need to configure the Spark cluster with the credentials and Cloud Storage location information. Use the Databricks Secrets Vault to protect secrets. Spark Config values to add:
Service Principle: spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id> spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}} spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token E.g. spark.hadoop.fs.azure.account.auth.type.smreleasetesting.dfs.core.windows.net OAuth spark.hadoop.fs.azure.account.oauth.provider.type.smreleasetesting.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider spark.hadoop.fs.azure.account.oauth2.client.id.smreleasetesting.dfs.core.windows.net 4669e29c-ce6e-417f-a495-e331ae969ccb spark.hadoop.fs.azure.account.oauth2.client.secret.smreleasetesting.dfs.core.windows.net {{secrets/smreleasetesting/sp_client_secret}} spark.hadoop.fs.azure.account.oauth2.client.endpoint.smreleasetesting.dfs.core.windows.net https://login.microsoftonline.com/beb1d7f9-8e2e-4dc4-83be-190ebceb70ea/oauth2/token
Click Confirm button to save cluster configuration.
Click “Create task” to save the job configuration.
Edit the Job Title to something more meaningful than “New Job DATE”
Click “Run now” to run the task, the task will run continuously to sync data between Sight Machine Cloud Storage and DataBricks
Troubleshooting notes:
If the job fails with error:
Server failed to authenticate the request
This means there was something wrong with configuring the secrets so it was unable to read the files. Please double check that the secrets were set up correctly.
If the job fails with error:
NOTE: When using the `ipython kernel` entry point, Ctrl-C will not work. To exit, you will have to explicitly quit this process, by either sending "quit" from a client, or using Ctrl-\ in UNIX-like environments. To read more about this, see https://github.com/ipython/ipython/issues/2049
This means that the cluster provided does not have enough memory, please attempt to modify the Node type to a type with more memory