TLTR: Clone this git project, set params and run 0_script.sh to deploy 1 ALDSgen2 hub and N Databricks spokes
A data lake is a centralized repository of data that allows enterprises to create business value from data. Azure Databricks is a popular tool to analyze data and build data pipelines. In this blog, it is discussed how Azure Databricks can be connected to an ADLSgen2 storage account in a secure and scalable way. In this, the following is key:
- Defense in depth: ADLSgen2 contains sensitive data and shall be secured using private endpoints and Azure AD (disabling access keys). Databricks can only access ADLSgen2 using private link and Azure AD
- Access control: Business units typically have their own Databricks workspace. Multiple workspaces shall be granted access to ADLSgen2 File Systems using Role Based Access Control (RBAC)
- Hub/spoke architecture: Only one hub network can access the ADLSgen2 account using private link. Databricks spoke networks peer to the hub network to simplify networking