Autoloader Deep Dive: Incremental File Ingestion with Lakeflow Pipelines

Hello and welcome back!

In today’s post, we’ll explore file ingestion using Autoloader in Databricks, specifically how to efficiently process incoming files with incremental ingestion and pair it with Lakeflow Declarative Pipelines for transformations.


What is Autoloader?

Autoloader is a powerful Databricks feature designed for incremental file ingestion. If you’re receiving files from multiple sources—Azure, AWS, or Google Cloud—Autoloader provides an efficient way to upload and process new files as they arrive in your cloud storage.

When paired with Lakeflow Declarative Pipelines, you can automatically apply SCD Type 1 and Type 2 transformations, making it a robust solution for data architecture pipelines.

The main advantage? It’s easy to set up and maintain across multiple file sources without adding unnecessary complexity to your code.


Setting Up Autoloader with Lakeflow Pipelines

Here’s a step-by-step guide based on my demo:

1. Create an ETL Pipeline

  1. Go to Jobs & Pipelines in Databricks.
  2. Create a new ETL pipeline.
  3. Add your root folder, source code, catalog, and schema.

Note: I already had my pipeline set up, so I skipped the creation steps in the demo.

2. Bronze to Silver Architecture

I use a bronze-to-silver workflow:

  • Bronze table: Raw file ingestion (CSV, JSON, etc.)
  • Silver table: Processed view after transformations

Key steps in the code:

  • Specify file type (e.g., CSV).
  • Enable schema evolution to handle new columns dynamically.
  • Optionally, provide hints for specific columns (e.g., IDs, timestamps).

3. Creating Views and Silver Tables

  1. Create a staging view for bronze data.
  2. Use the command create auto CDC flow to transform the data into the silver table.
  3. Declare your primary key to handle updates accurately.

This allows you to maintain incremental updates with SCD Type 2 logic, tracking both inserts and updates over time.


Demo Highlights

  • Uploaded initial CSV files into a volume (instead of cloud storage for simplicity).
  • Demonstrated insert, update, and delete operations:
    • New records are appended to the bronze table.
    • Updates create new versions in the silver table, preserving history.
    • Deletes can be handled optionally using the operation parameter.

Caveat: Autoloader does not support Excel files directly. A workaround exists, but CSV is preferred for smooth ingestion.

import dlt
from pyspark.sql.functions import regexp_extract, current_timestamp, col, expr

# 1) Bronze streaming table (Autoloader)
@dlt.table(
    name="bronze_table",
    comment="Testing DLT with Autoloader CSV ingestion"
)
def source_table():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .load("/Volumes/workspace/autoloader/files/")
        .withColumn("timestamp",current_timestamp())
    )

# 2) Create a streaming view over bronze
@dlt.view(name="bronze_view", comment="Streaming view of bronze_table")
def bronze_view():
    return dlt.read_stream("bronze_table")

# 3) Create the target streaming table for CDC output
dlt.create_streaming_table(
    name="silver_table",
    comment="SCD2 target table populated via AUTO CDC"
)

# 4) Apply changes (SCD Type 2) from the view into the streaming target
dlt.create_auto_cdc_flow(
    target="silver_table",
    source="bronze_view",
    keys=["id"],
    apply_as_deletes=expr("operation = 'DELETE'"),
    stored_as_scd_type=2,
    sequence_by="timestamp"
)

Handling Different File Types

Autoloader supports:

  • CSV – default, works well with schema inference.
  • Text files – pay attention to delimiters; sometimes data may appear as a single column.
  • Binary files – change the fileFormat parameter to handle.

You can also:

  • Infer schema automatically or provide column hints.
  • Set up file notification services for instant ingestion when new files arrive.

Benefits of Using Autoloader

  1. Scalable: Handle large volumes of files efficiently.
  2. Incremental processing: Only new files are processed, improving performance.
  3. Minimal code: Typical pipelines require fewer than 50 lines of code.
  4. Flexible schema handling: Infer, provide hints, or fail safely on mismatches.
  5. Cost-effective: Reduces cloud costs by avoiding full directory scans.

Conclusion

Autoloader is a powerful, scalable solution for incremental file ingestion, especially when paired with Lakeflow Declarative Pipelines for automatic SCD transformations. While there are caveats (Excel files, delimiter issues), the performance, ease of setup, and minimal code requirements make it a highly practical option for modern data pipelines.

For a deeper dive, check the Databricks documentation on Autoloader and explore the various options for schema evolution and notifications.

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options


I hope this guide helps you set up efficient, incremental file ingestion in Databricks. Questions or suggestions? Drop them in the comments!

Leave a Reply

Your email address will not be published. Required fields are marked *