Ingesting Data from Databricks

Connecting to Databricks

To connect to Databricks clusters or SQL warehouses, you need to set up authentication using OAuth 2.0 with a service principal and provide the following information:

Name: A friendly name for your connection to easily identify and reuse it for ingesting additional tables
Host: The hostname of your Databricks workspace (e.g., your-workspace.cloud.databricks.com)
HTTP Path: The HTTP path of your Databricks cluster or SQL warehouse (found in the connection details)
Catalog: The Unity Catalog name containing the data you want to ingest
Schema: The schema within the catalog where your tables are located
Client ID: The client ID for OAuth authentication (Application ID)
Client Secret: The client secret for OAuth authentication
OAuth Server: The OAuth 2.0 server endpoint for authentication

Prerequisites

Before connecting to Databricks, ensure that:

Your Databricks workspace is accessible from Vendia
You have valid OAuth credentials with appropriate permissions
The target cluster or SQL warehouse is running and accepting connections
Unity Catalog is enabled if accessing catalog-managed tables
Network connectivity allows HTTPS access to your Databricks workspace
OAuth 2.0 authentication is configured for your application

Authentication Setup

To set up OAuth authentication for Databricks:

Create a service principal in your Databricks workspace
Generate OAuth credentials (Client ID and Client Secret)
Configure permissions for the service principal on the required catalogs and schemas
Note the OAuth server endpoint for your workspace

Required Permissions

The OAuth application connecting to Databricks must have the following permissions:

SELECT privilege on the tables you want to ingest
USE CATALOG privilege on the target catalog
USE SCHEMA privilege on the target schema
Access to the specified cluster or SQL warehouse

Example Configuration

Here’s an example of a typical Databricks connection configuration:

Field	Example Value
Name	Production Databricks Lakehouse
Host	your-workspace.cloud.databricks.com
HTTP Path	/sql/1.0/warehouses/abc123def456
Catalog	analytics_catalog
Schema	sales_data
Client ID	12345678-1234-1234-1234-123456789012
Client Secret	********
OAuth Server	`https://your-workspace.cloud.databricks.com/oidc/v1/authorize`

Vendia Supported and Unsupported Databricks Data Types

Vendia Supported Databricks Data Types	Vendia Unsupported Databricks Data Types
BIGINT	ARRAY
BOOLEAN	BINARY
DATE	DayTimeIntervalType
DECIMAL	GEOGRAPHY
DOUBLE	GEOMETRY
FLOAT	INTERVAL
INT	MAP
LONG	NULL
SMALLINT	OBJECT
STRING	STRUCT
TIMESTAMP	VARIANT
TIMESTAMP_NTZ	VOID
TINYINT	YearMonthIntervalType

Best Practices

Security: Use service principals with OAuth 2.0 for programmatic access instead of personal access tokens
Permissions: Apply the principle of least privilege when granting catalog and schema permissions
Performance: Use SQL warehouses for better performance and cost optimization for analytical workloads
Testing: Test connectivity with a small sample table before ingesting large datasets
Catalog Management: Organize data using Unity Catalog for better governance and access control

Troubleshooting

If you encounter connection issues:

Authentication Failed: Verify that Client ID, Client Secret, and OAuth server URL are correct
Connection Refused: Check that the host and HTTP path are properly configured
Cluster Not Found: Ensure that the cluster or SQL warehouse is running and the HTTP path is valid
Catalog Not Found: Verify that the catalog name exists and Unity Catalog is enabled
Schema Not Found: Confirm that the schema exists within the specified catalog
Permission Denied: Check that the service principal has the required catalog and schema permissions
Network Issues: Ensure that firewall rules allow HTTPS connections to Databricks
OAuth Errors: Verify that OAuth configuration and token expiration settings are correct

Next Steps

After successfully connecting to your Databricks workspace, you can:

Select specific tables to ingest
Configure data transformations and mappings
Set up incremental data ingestion jobs
Schedule regular data synchronization tasks