Writing PySpark logs in Apache Spark and Databricks

The closer your data product is getting to the production, the bigger is the importance of properly collecting and analysing logs. Logs help both during debugging in-depth issues and analysing the behaviour of your application.

For general Python applications the classical choice would be to use the built-in logging library which has all the necessary components and provides very convenient interfaces for both configuring and working with the logs.

For PySpark applications, the logging configuration is a little bit more intricate, but still very controllable — it’s just done in a slightly different way, contrary to the classical Python logging.

In this blogpost I would like to describe approach to effectively create and manage log setup in PySpark applications, both in local environment and on the Databricks clusters.

Click Here