In the simple case, JSON is easy to handle within Databricks. You can read a file of JSON objects directly into a DataFrame or table, and Databricks knows how to parse the JSON into individual fields. But, as with most things software-related, there are wrinkles and variations. This article shows how to handle the most common situations and includes detailed coding examples.
My use-case was HL7 healthcare data that had been translated to JSON, but the methods here apply to any JSON data. The three formats considered are:
- A text file containing complete JSON objects, one per line. This is typical when you are loading JSON files to Databricks tables.
- A text file containing various fields (columns) of data, one of which is a JSON object. This is often seen in computer logs, where there is some plain-text meta-data followed by more detail in a JSON string.
- A variation of the above where the JSON field is an array of objects.
Getting each of these types of input into Databricks requires different techniques.