PySpark SQL 시작하기

Programming/python

PySpark SQL 시작하기

방황하는 데이터불도저 2024. 1. 4. 19:50

https://spark.apache.org/docs/latest/sql-getting-started.html#getting-started

Getting Started - Spark 3.5.0 Documentation

spark.apache.org

공식문서를 참고하여 작성하였습니다.

1. Starting Point : SparkSession

먼저 java를 먼저 설치하고, java/bin 경로를 환경변수로 설정해야한다.

- java 설치

- 리눅스에서 java 환경변수로 설정하는 방법

셋팅을 완료하였다면 아래의 코드로 Session을 생성해준다.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

2. Creating DataFrames

SparkSession을 통해 기존의 RDD, Hive table, Spark data sources 에서 DataFrames를 만들어준다.

라고 설명이 되어있지만 복잡하게 생각할거없이, 아래의 코드로 내가 가진 json파일을 읽을 수 있다.

# JSON 파일 경로
root_path = "./aihub/13.한국어글자체/"
textwild_json = root_path+"04._Text_in_the_wild_230209_add/textinthewild_data_info.json"

df = spark.read.json(textwild_json, multiLine=True)
df.printSchema()

#결과

root
 |-- annotations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attributes: struct (nullable = true)
 |    |    |    |-- class: string (nullable = true)
 |    |    |-- bbox: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- image_id: string (nullable = true)
 |    |    |-- text: string (nullable = true)
 |-- images: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- file_name: string (nullable = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- width: long (nullable = true)
 |-- info: struct (nullable = true)
 |    |-- date_created: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- licenses: array (nullable = true)
 |    |-- element: string (containsNull = true)

내가 가진 json파일은 복잡한 구조를 띄고있어서 아래에 tutorial에 나와있는 show()기능은 별로 효율적이지 못하지만 위처럼 스키마를 보기엔 편리한 기능을 가지고 있다.

# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+

'Programming > python' 카테고리의 다른 글

[파이썬] Python에서 XML 파일 다루는법 - XPath에 대해서 알아보자. (0)	2024.05.16
Anaconda없이 Python 가상환경 만들기 (0)	2024.04.08
대용량 JSON파일 처리하기. python ijson (0)	2024.01.04
PyCharm에서 anaconda 가상환경 구축하기 (0)	2023.11.17
pip으로 opencv version 업데이트/재설치하기 (0)	2023.11.08

현재글PySpark SQL 시작하기

주니어입니다. 겸손하게 불도저처럼 나아가겠습니다☄️

Linux, 선형대수, 칸아카데미, 파이썬, 신경망모델, 인공지능, 선형대수학, 딥러닝, 텐서플로우, Python, 부스트코스, 데이터, 머신러닝, 벡터, 리눅스, tensor, TensorFlow, ML, 모두를위한선형대수학, linearalgebra,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

AI와 데이터의 모든 것