Anatomy of Semi-Structured (JSON) Data with PySpark

Introduction

This guide explores the fundamental concepts of handling semi-structured data like JSON within a PySpark environment. We'll delve into the various complex data types Spark uses to represent JSON and demonstrate how to parse, access, and flatten nested data structures for effective data analysis and transformation.

Data Types in Spark

Before parsing JSON, it's crucial to understand how Spark represents its data. JSON data often contains complex, nested structures. The three most common complex data types in Spark for handling this are Array, Struct, and Map.

Array

An Array in Spark is a list of homogenous elements, meaning all elements in the array must have the same data type.

Example JSON:

{
"numbers": [1, 2, 3, 4, 5, 6]
}

Read JSON into a Spark DataFrame:

input_json = """
{
"numbers": [1, 2, 3, 4, 5, 6]
}
"""
adf = spark.read.json(sc.parallelize([input_json]))
adf.printSchema()

Output:

root
|-- numbers: array (nullable = true)
| |-- element: long (containsNull = true)

Show the data:

adf.show(truncate=False)

Output:

+------------------+
|numbers |
+------------------+
|[1, 2, 3, 4, 5, 6]|
+------------------+

Flatten the array using explode:

from pyspark.sql.functions import explode
adf.select(explode("numbers").alias("number")).show()

Output:

+------+
|number|
+------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
+------+

Struct

A Struct is a grouped list of variables (like a record) where elements can have different data types. Access fields using dot notation.

Example JSON:

{
"car_details": {
  "model": "Tesla S",
  "year": 2018
}
}

Read into Spark:

input_json = """
{
"car_details": {
   "model": "Tesla S",
   "year": 2018
}
}
"""
sdf = spark.read.json(sc.parallelize([input_json]))
sdf.printSchema()

Output:

root
|-- car_details: struct (nullable = true)
| |-- model: string (nullable = true)
| |-- year: long (nullable = true)

Accessing fields:

# Using dot notation
sdf.select(sdf.car_details.model, sdf.car_details.year).show()

Output:

+-----------------+----------------+
|car_details.model|car_details.year|
+-----------------+----------------+
| Tesla S| 2018|
+-----------------+----------------+

Or with col:

from pyspark.sql.functions import col
sdf.select(col("car_details.model"), col("car_details.year")).show()

Output:

+-------+----+
| model|year|
+-------+----+
|Tesla S|2018|
+-------+----+

Map

A Map is a collection of key-value pairs (like a Python dictionary).

Example JSON:

{
"Car": {
  "model_id": 835,
  "year": 2008
}
}

Read with schema:

from pyspark.sql.types import StructType, MapType, StringType, IntegerType

input_json = """
{
"Car": {
  "model_id": 835,
  "year": 2008
}
}
"""

schema = StructType().add("Car", MapType(StringType(), IntegerType()))
mdf = spark.read.json(sc.parallelize([input_json]), schema=schema)
mdf.printSchema()

Output:

root
|-- Car: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)

Access elements by key:

mdf.select(mdf.Car["model_id"], mdf.Car["year"]).show()

Output:

+-------------+---------+
|Car[model_id]|Car[year]|
+-------------+---------+
| 835| 2008|
+-------------+---------+

Reading JSON Records

Spark accepts JSON data in the JSON Lines format, where each line is a separate, self-contained JSON record.

Example file:

{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}

Read into Spark:

df = spark.read.json(path, multiLine="false")

Inspect schema for nested structures (food ordering example):

root
|-- customerId: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- basket: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- grossMerchandiseValueEur: double (nullable = true)
| | | | |-- productId: string (nullable = true)
| | | | |-- productType: string (nullable = true)
| | |-- orderId: string (nullable = true)

In a real-world scenario, you would use the knowledge of Array, Struct, and Map types to parse and flatten this complex nested structure for further processing and analysis.