Published 2023. 10. 1. 17:57

Spark explode() 사용해서 List 로 된 컬럼을 행으로 분리하기

728x90

Spark Dataframe 에 다음과 같이 리스트 형태로 들어간 컬럼이 있을 것이다.

scala> val df = Seq(("Nam", List("A", "B", "C", "D"))).toDF("name", "grade")
df: org.apache.spark.sql.DataFrame = [name: string, grade: array<string>]

scala> df.show()
+----+------------+
|name|       grade|
+----+------------+
| Nam|[A, B, C, D]|
+----+------------+

이런 경우에 grade 라는 컬럼을 각 row 로 분리할 필요가 생길수도 있다.
이때, explode() 함수를 통해서 리스트를 각 row 로 분리해줄 수 있다.

원하는 column 을 explode() 함수를 통해 분리해서 펼쳐주면 다음과 같이 리스트가 각 row 로 분리된 것을 확인할 수 있다.

scala> df.select(explode(col("grade")).alias("score")).show()
+-----+
|score|
+-----+
|    A|
|    B|
|    C|
|    D|
+-----+

Spark 공식 문서 참고

https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.explode.html

pyspark.sql.functions.explode — PySpark 3.1.3 documentation

Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise. Examples >>> from pyspark.sql import Row >>> eDF = spark.createDa

spark.apache.org

728x90

저작자표시 비영리 변경금지 (새창열림)

'데이터 엔지니어링 > Spark' 카테고리의 다른 글

Pyspark Window function (0)	2024.03.14
Spark explode() 함수 사용시 주의할 점 (0)	2024.02.11
Spark User Defined Functions (UDFs) (0)	2023.10.01
Spark multi process error in macOS (0)	2023.09.27
Spark JDBC Data Source Option (0)	2023.09.26

Spark explode() 사용해서 List 로 된 컬럼을 행으로 분리하기

'데이터 엔지니어링 > Spark' 카테고리의 다른 글

티스토리툴바