Skip to content

Ingesting Data from Databricks

Connecting to Databricks

To connect to Databricks clusters or SQL warehouses, you need to set up authentication using OAuth 2.0 with a service principal and provide the following information:

  • Name: A friendly name for your connection to easily identify and reuse it for ingesting additional tables
  • Host: The hostname of your Databricks workspace (e.g., your-workspace.cloud.databricks.com)
  • HTTP Path: The HTTP path of your Databricks cluster or SQL warehouse (found in the connection details)
  • Catalog: The Unity Catalog name containing the data you want to ingest
  • Schema: The schema within the catalog where your tables are located
  • Client ID: The client ID for OAuth authentication (Application ID)
  • Client Secret: The client secret for OAuth authentication
  • OAuth Server: The OAuth 2.0 server endpoint for authentication

Prerequisites

Before connecting to Databricks, ensure that:

  • Your Databricks workspace is accessible from Vendia
  • You have valid OAuth credentials with appropriate permissions
  • The target cluster or SQL warehouse is running and accepting connections
  • Unity Catalog is enabled if accessing catalog-managed tables
  • Network connectivity allows HTTPS access to your Databricks workspace
  • OAuth 2.0 authentication is configured for your application

Authentication Setup

To set up OAuth authentication for Databricks:

  1. Create a service principal in your Databricks workspace
  2. Generate OAuth credentials (Client ID and Client Secret)
  3. Configure permissions for the service principal on the required catalogs and schemas
  4. Note the OAuth server endpoint for your workspace

Required Permissions

The OAuth application connecting to Databricks must have the following permissions:

  • SELECT privilege on the tables you want to ingest
  • USE CATALOG privilege on the target catalog
  • USE SCHEMA privilege on the target schema
  • Access to the specified cluster or SQL warehouse

Example Configuration

Here’s an example of a typical Databricks connection configuration:

FieldExample Value
NameProduction Databricks Lakehouse
Hostyour-workspace.cloud.databricks.com
HTTP Path/sql/1.0/warehouses/abc123def456
Cataloganalytics_catalog
Schemasales_data
Client ID12345678-1234-1234-1234-123456789012
Client Secret****
OAuth Serverhttps://your-workspace.cloud.databricks.com/oidc/v1/authorize

Vendia Supported and Unsupported Databricks Data Types

Vendia Supported Databricks Data TypesVendia Unsupported Databricks Data Types
BIGINTARRAY
BOOLEANBINARY
DATEDayTimeIntervalType
DECIMALGEOGRAPHY
DOUBLEGEOMETRY
FLOATINTERVAL
INTMAP
LONGNULL
SMALLINTOBJECT
STRINGSTRUCT
TIMESTAMPVARIANT
TIMESTAMP_NTZVOID
TINYINTYearMonthIntervalType

Best Practices

  • Security: Use service principals with OAuth 2.0 for programmatic access instead of personal access tokens
  • Permissions: Apply the principle of least privilege when granting catalog and schema permissions
  • Performance: Use SQL warehouses for better performance and cost optimization for analytical workloads
  • Testing: Test connectivity with a small sample table before ingesting large datasets
  • Catalog Management: Organize data using Unity Catalog for better governance and access control

Troubleshooting

If you encounter connection issues:

  1. Authentication Failed: Verify that Client ID, Client Secret, and OAuth server URL are correct
  2. Connection Refused: Check that the host and HTTP path are properly configured
  3. Cluster Not Found: Ensure that the cluster or SQL warehouse is running and the HTTP path is valid
  4. Catalog Not Found: Verify that the catalog name exists and Unity Catalog is enabled
  5. Schema Not Found: Confirm that the schema exists within the specified catalog
  6. Permission Denied: Check that the service principal has the required catalog and schema permissions
  7. Network Issues: Ensure that firewall rules allow HTTPS connections to Databricks
  8. OAuth Errors: Verify that OAuth configuration and token expiration settings are correct

Next Steps

After successfully connecting to your Databricks workspace, you can:

  • Select specific tables to ingest
  • Configure data transformations and mappings
  • Set up incremental data ingestion jobs
  • Schedule regular data synchronization tasks