JSON in Databricks and PySpark

In the simple case, JSON is easy to handle within Databricks. You can read a file of JSON objects directly into a DataFrame or table, and Databricks knows how to parse the JSON into individual fields. But, as with most things software-related, there are wrinkles and variations. This article shows how to handle the most common situations and includes detailed coding examples.

My use-case was HL7 healthcare data that had been translated to JSON, but the methods here apply to any JSON data. The three formats considered are:

A text file containing complete JSON objects, one per line. This is typical when you are loading JSON files to Databricks tables.
A text file containing various fields (columns) of data, one of which is a JSON object. This is often seen in computer logs, where there is some plain-text meta-data followed by more detail in a JSON string.
A variation of the above where the JSON field is an array of objects.

Getting each of these types of input into Databricks requires different techniques.

Read More

JSON in Databricks and PySpark

Related posts

Recent posts