Read and Write Excel Files in Databricks

👋 Hey there! This video is for members only.

Please log in or create a free account below.

Login / Sign Up

Working with Excel files in Databricks can feel tricky, especially when you’re handling large datasets or dealing with formatting quirks. In this guide, we’ll explore how to read and write Excel files using Pandas and PySpark, highlight some common challenges, and share practical workarounds.

Reading Excel Files in Databricks

Databricks doesn’t natively support Excel files in the same way it handles formats like CSV or Parquet. Instead, you’ll need the openpyxl library installed.

%pip install openpyxl

Pandas: Ideal for small to medium Excel files or proof-of-concepts. Using pandas.read_excel() with the correct engine allows quick reading and display of data.
PySpark (pyspark.pandas): Best for large datasets, since it scales across a Spark cluster. The syntax is similar to Pandas but designed for distributed computing.

# Read with pandas
import pandas as pd 
df_pandas = pd.read_excel('/Volumes/workspace/excel/excel/pandas_excel.xlsx', engine='openpyxl')
display(df_pandas)

# Read with pyspark
import pyspark.pandas as ps 
df_pandas_spark = ps.read_excel('/Volumes/workspace/excel/excel/pandas_excel.xlsx', engine='openpyxl')
display(df_pandas_spark)

Writing Excel Files in Databricks

With Pandas, writing back to Excel requires converting data into a Spark DataFrame first. This allows you to save it as a Delta table for efficient storage and querying. PySpark DataFrames can be written directly without conversion, making them more efficient for large-scale pipelines.

# Write with pandas
df_spark = spark.createDataFrame(df_pandas)
df_spark.write.format('delta').mode('overwrite').saveAsTable('workspace.excel.pandas_excel')

# Write with pyspark
df_pandas_spark.to_spark().write.format('delta').mode('overwrite').saveAsTable('workspace.excel.spark_excel')

Challenges and Limitations

While the process works, there are a few drawbacks:

No Autoloader support for Excel, meaning incremental ingestion and schema evolution aren’t available.
Multiple sheets and formatting can cause issues when importing.
Performance limitations with Pandas on large files.

Workarounds

Convert Excel to CSV before ingestion to leverage Autoloader.
Use external libraries like com.crealytics.spark.excel, though setup can be tricky.
Handle messy column names and formatting with custom scripts for flexibility.

Conclusion

Reading and writing Excel files in Databricks is possible with Pandas and PySpark, but each has its strengths and limitations. Pandas is best for quick tests or smaller files, while PySpark excels in scalability and performance. With the right workarounds, you can integrate Excel into your Databricks data engineering workflows effectively.

Music:
Jasmine Tea by Filo Starquez https://soundcloud.com/filo-starquez
Creative Commons — Attribution - NoDerivs 3.0 Unported — CC BY-ND 3.0
Free Download / Stream: audiolibrary.com.co/filo-starquez/jasmine-tea
Music promoted by Audio Library https://youtu.be/_Lqes7fq5wQ
jellyfish by Tokiwave https://soundcloud.com/tokiwave
License: Creative Commons — Attribution 3.0 Unported — CC BY 3.0
Free Download / Stream: https://www.audiolibrary.com.co/tokiwave/jellyfish
Music promoted by Audio Library: https://youtu.be/KOOJwbtvq8g
Stopping By The Café by MusicbyAden https://soundcloud.com/musicbyaden
Creative Commons — Attribution-ShareAlike 3.0 Unported — CC BY-SA 3.0
Free Download / Stream: https://audiolibrary.com.co/musicbyaden/stopping-by-the-cafe
Music promoted by Audio Library https://youtu.be/0OB1ETSEh34