The closer your data product is getting to the production, the bigger is the importance of properly collecting and analysing logs. Logs help both during debugging in-depth issues and analysing the behaviour of your application.
For general Python applications the classical choice would be to use the built-in logging library which has all the necessary components and provides very convenient interfaces for both configuring and working with the logs.
For PySpark applications, the logging configuration is a little bit more intricate, but still very controllable — it’s just done in a slightly different way, contrary to the classical Python logging.
In this blogpost I would like to describe approach to effectively create and manage log setup in PySpark applications, both in local environment and on the Databricks clusters.