JSON in Databricks and PySpark

In the simple case, JSON is easy to handle within Databricks. You can read a file of JSON objects directly into a DataFrame or table, and Databricks knows how to parse the JSON into individual fields. But, as with most things software-related, there are wrinkles and variations. This article shows how to handle the most common situations and includes detailed coding examples.

My use-case was HL7 healthcare data that had been translated to JSON, but the methods here apply to any JSON data. The three formats considered are:

  • A text file containing complete JSON objects, one per line. This is typical when you are loading JSON files to Databricks tables.
  • A text file containing various fields (columns) of data, one of which is a JSON object. This is often seen in computer logs, where there is some plain-text meta-data followed by more detail in a JSON string.
  • A variation of the above where the JSON field is an array of objects.

Getting each of these types of input into Databricks requires different techniques.

Read More