Azure Data Lake

TLTR: Clone this git project, set params and run 0_script.sh to deploy 1 ALDSgen2 hub and N Databricks spokes

data lake is a centralized repository of data that allows enterprises to create business value from data. Azure Databricks is a popular tool to analyze data and build data pipelines. In this blog, it is discussed how Azure Databricks can be connected to an ADLSgen2 storage account in a secure and scalable way. In this, the following is key:

  • Defense in depth: ADLSgen2 contains sensitive data and shall be secured using private endpoints and Azure AD (disabling access keys). Databricks can only access ADLSgen2 using private link and Azure AD
  • Access control: Business units typically have their own Databricks workspace. Multiple workspaces shall be granted access to ADLSgen2 File Systems using Role Based Access Control (RBAC)
  • Hub/spoke architecture: Only one hub network can access the ADLSgen2 account using private link. Databricks spoke networks peer to the hub network to simplify networking

Read More

Tags: Azure Data