Ingesting Data from Databricks
Connecting to Databricks
To connect to Databricks clusters or SQL warehouses, you need to set up authentication using OAuth 2.0 with a service principal and provide the following information:
- Name: A friendly name for your connection to easily identify and reuse it for ingesting additional tables
- Host: The hostname of your Databricks workspace (e.g.,
your-workspace.cloud.databricks.com
) - HTTP Path: The HTTP path of your Databricks cluster or SQL warehouse (found in the connection details)
- Catalog: The Unity Catalog name containing the data you want to ingest
- Schema: The schema within the catalog where your tables are located
- Client ID: The client ID for OAuth authentication (Application ID)
- Client Secret: The client secret for OAuth authentication
- OAuth Server: The OAuth 2.0 server endpoint for authentication
Prerequisites
Before connecting to Databricks, ensure that:
- Your Databricks workspace is accessible from Vendia
- You have valid OAuth credentials with appropriate permissions
- The target cluster or SQL warehouse is running and accepting connections
- Unity Catalog is enabled if accessing catalog-managed tables
- Network connectivity allows HTTPS access to your Databricks workspace
- OAuth 2.0 authentication is configured for your application
Authentication Setup
To set up OAuth authentication for Databricks:
- Create a service principal in your Databricks workspace
- Generate OAuth credentials (Client ID and Client Secret)
- Configure permissions for the service principal on the required catalogs and schemas
- Note the OAuth server endpoint for your workspace
Required Permissions
The OAuth application connecting to Databricks must have the following permissions:
SELECT
privilege on the tables you want to ingestUSE CATALOG
privilege on the target catalogUSE SCHEMA
privilege on the target schema- Access to the specified cluster or SQL warehouse
Example Configuration
Here’s an example of a typical Databricks connection configuration:
Field | Example Value |
---|---|
Name | Production Databricks Lakehouse |
Host | your-workspace.cloud.databricks.com |
HTTP Path | /sql/1.0/warehouses/abc123def456 |
Catalog | analytics_catalog |
Schema | sales_data |
Client ID | 12345678-1234-1234-1234-123456789012 |
Client Secret | **** |
OAuth Server | https://your-workspace.cloud.databricks.com/oidc/v1/authorize |
Vendia Supported and Unsupported Databricks Data Types
Vendia Supported Databricks Data Types | Vendia Unsupported Databricks Data Types |
---|---|
BIGINT | ARRAY |
BOOLEAN | BINARY |
DATE | DayTimeIntervalType |
DECIMAL | GEOGRAPHY |
DOUBLE | GEOMETRY |
FLOAT | INTERVAL |
INT | MAP |
LONG | NULL |
SMALLINT | OBJECT |
STRING | STRUCT |
TIMESTAMP | VARIANT |
TIMESTAMP_NTZ | VOID |
TINYINT | YearMonthIntervalType |
Best Practices
- Security: Use service principals with OAuth 2.0 for programmatic access instead of personal access tokens
- Permissions: Apply the principle of least privilege when granting catalog and schema permissions
- Performance: Use SQL warehouses for better performance and cost optimization for analytical workloads
- Testing: Test connectivity with a small sample table before ingesting large datasets
- Catalog Management: Organize data using Unity Catalog for better governance and access control
Troubleshooting
If you encounter connection issues:
- Authentication Failed: Verify that Client ID, Client Secret, and OAuth server URL are correct
- Connection Refused: Check that the host and HTTP path are properly configured
- Cluster Not Found: Ensure that the cluster or SQL warehouse is running and the HTTP path is valid
- Catalog Not Found: Verify that the catalog name exists and Unity Catalog is enabled
- Schema Not Found: Confirm that the schema exists within the specified catalog
- Permission Denied: Check that the service principal has the required catalog and schema permissions
- Network Issues: Ensure that firewall rules allow HTTPS connections to Databricks
- OAuth Errors: Verify that OAuth configuration and token expiration settings are correct
Next Steps
After successfully connecting to your Databricks workspace, you can:
- Select specific tables to ingest
- Configure data transformations and mappings
- Set up incremental data ingestion jobs
- Schedule regular data synchronization tasks