Python

[Python][TIL] ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ™œ์šฉ - numpy & pandas

breadz 2021. 7. 22. 14:58

๐Ÿ”” numpy ์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž

#๏ธโƒฃ numpy import

import numpy as np

 

#๏ธโƒฃ 1์ฐจ์› ๋ฐฐ์—ด

d1 ์ด๋ผ๋Š” 1์ฐจ์› ๋ฐฐ์—ด์ด ์žˆ๋‹ค๊ณ  ํ•  ๋•Œ

d1 = np.array([1,2,3,4])

์ฐจ์› dimension ์€ d1.ndim ์œผ๋กœ, ํ–‰๋ ฌ shape ์€ d1.shape ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค

print('dim = ' +str(d1.ndim), 'shape = '+str(d1.shape))

d1 ์ด๋ผ๋Š” ๋ฐฐ์—ด์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด๋„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. โžก arange(1,5)

d1_1 = np.array(np.arange(1,5))

 

#๏ธโƒฃ 2์ฐจ์› ๋ฐฐ์—ด

2์ฐจ์› d2 ๋ฐฐ์—ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ๋Œ€๊ด„ํ˜ธ ์•ˆ์— ์žˆ๋Š” ๋ฐฐ์—ด ์š”์†Œ๋“ค์ด ํ•˜๋‚˜์˜ ํ–‰์— ๋“ค์–ด๊ฐ„๋‹ค.

d2 = np.array([[1,2,3,4],
               [5,6,7,8]])

dimension๊ณผ shape์„ ๊ตฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. โžก 2์ฐจ์›, 2ํ–‰ 4์—ด

print('dim = '+str(d2.ndim), 'shape = '+str(d2.shape))

d2 ๋ผ๋Š” ๋ฐฐ์—ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด๋„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

d2_1 = np.array([np.arange(1,5),
                 np.arange(5,9)])

 

#๏ธโƒฃ 3์ฐจ์› ๋ฐฐ์—ด

๋‹ค์Œ๊ณผ ๊ฐ™์ด 2์ฐจ์› ๋ฐฐ์—ด ๋‘ ๊ฐœ๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ฒ˜๋Ÿผ ํ‘œํ˜„ํ•ด์„œ 3์ฐจ์› ๋ฐฐ์—ด์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

d3 = np.array([[[1,2,3],[4,5,6],[7,8,9],[10,11,12]],
               [[21,22,23],[24,25,26],[27,28,29],[30,31,32]]])

index 0๊ณผ index 1

dimension ๊ณผ shape ์„ ๊ตฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

print('dim = '+str(d3.ndim), 'shape = '+str(d3.shape))

d3 ๋ฐฐ์—ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด๋„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

d3_1 =np.array([[np.arange(1,4),np.arange(4,7),np.arange(7,10),np.arange(10,13)],
               [np.arange(21,24),np.arange(24,27),np.arange(27,30),np.arange(30,33)]])

index 0๊ณผ index 1

 


๐Ÿ”” pandas ์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž

#๏ธโƒฃ pandas import

import pandas as pd

 

#๏ธโƒฃ Series ๊ฐ์ฒด

- ์ผ์ฐจ์› ๋ฐฐ์—ด๊ณผ ๋‹ฌ๋ฆฌ ๊ฐ’๋ฟ ์•„๋‹ˆ๋ผ ๊ฐ ๊ฐ’์— ์—ฐ๊ฒฐ๋œ ์ธ๋ฑ์Šค ๊ฐ’๋„ ๋™์‹œ์— ์ €์žฅ

 

1. List ๋กœ Series ์ƒ์„ฑ

๋‹ค์Œ๊ณผ ๊ฐ™์€ list ๊ฐ€ ์žˆ๋‹ค.

data = ['2017', '2018', '2019', '2020']

Series ๋กœ ์ƒ์„ฑ์‹œ ๊ฐ ๊ฐ’์— ์—ฐ๊ฒฐ๋œ ์ธ๋ฑ์Šค๊ฐ€ ํ•จ๊ป˜ ๋ณ€์ˆ˜์— ์ €์žฅ๋œ๋‹ค.

se = pd.Series(data)

์ผ๋ฐ˜ list ์™€ Series ๋Š” ์ถœ๋ ฅ๋˜๋Š” ๊ฐ’์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹ค๋ฅด๋‹ค. 

 

2. ์ธ๋ฑ์Šค๋ช… ๋ณ€๊ฒฝ

๋‹ค์Œ๊ณผ ๊ฐ™์ด index ๋ช…์„ ์ˆ˜์ •ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

se = pd.Series(data, index = ['a', 'b', 'c', 'd'])

 

2. Dictionary ๋กœ Series ์ƒ์„ฑ

Series๋Š” dictionary์™€ ์ƒ๊น€์ƒˆ๊ฐ€ ๋ฒŒ์จ ์œ ์‚ฌํ•˜๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋”•์…”๋„ˆ๋ฆฌ๊ฐ€ ์žˆ๋‹ค.

dic_data = {'kim':35000, 'e':60000, 'nok':40000}

Dictionary๋กœ Series ์ƒ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•œ๋‹ค.

con_dictose = pd.Series(dic_data)

์›๋ž˜ ๋”•์…”๋„ˆ๋ฆฌ ๋ณ€์ˆ˜์™€ Series๋กœ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๊ฐ’์ด ๋‹ค๋ฅด๋‹ค.

 

2. ์ธ๋ฑ์Šค๋ช… ๋ณ€๊ฒฝ

์ด๊ฒƒ๋„ index ๋ช…์„ ๋งˆ์Œ๋Œ€๋กœ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

con_dictose.index = ['A', 'B', 'C']

#๏ธโƒฃ Dataframe ๊ฐ์ฒด

- pandas์˜ ๊ธฐ๋ณธ ์ž๋ฃŒ๊ตฌ์กฐ๋กœ, row์™€ column์˜ ์ธ๋ฑ์Šค๋ช…์ด ์žˆ๋‹ค.

- Series๋ฅผ ๋ถ™์—ฌ ๋งŒ๋“  data table

- MS Office์˜ ์—‘์…€ ํ”„๋กœ๊ทธ๋žจ๊ณผ ์—ฐ๋™์ด ์‰ฝ๋‹ค

 

1. Dataframe ์ƒ์„ฑ

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋”•์…”๋„ˆ๋ฆฌ ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋‹ค.

data = {'name':['Lee', 'Hwang', 'Kim', 'Choi'],
        'score':[100, 95, 80, 85],
        'grade':['A', 'A', 'B', 'B']
        }

dataframe ์„ ์ƒ์„ฑ

df = pd.DataFrame(data)

 

2. ์ปฌ๋Ÿผ ์ˆœ์„œ ๋ณ€๊ฒฝ

๋‹ค์Œ๊ณผ ๊ฐ™์ด name, grade, score ์ˆœ์„œ๋กœ ์ปฌ๋Ÿผ์˜ ์ˆœ์„œ๋„ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

df = pd.DataFrame(data, columns=['name', 'grade', 'score'])

 

3. ์ธ๋ฑ์Šค๋ช… ๋ณ€๊ฒฝ

์ธ๋ฑ์Šค๋ช…๋„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‘ ๊ฐ€์ง€ ๋ช…๋ น์–ด๋กœ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ฐฉ๋ฒ• 1)

df = pd.DataFrame(data, index = ['Lee', 'Hwang', 'Kim', 'Choi'])

๋ฐฉ๋ฒ• 2)

df.index = ['์ด', 'ํ™ฉ', '๊น€', '์ตœ']

4. ์ธ๋ฑ์Šค ๊ธฐ์ค€ ์ •๋ ฌ

์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

df.sort_index(axis=0)

์ธ๋ฑ์Šค๋ช…์ด '์ดํ™ฉ๊น€์ตœ'์—์„œ '๊น€์ด์ตœํ™ฉ'์œผ๋กœ ๋ณ€๊ฒฝ๋๋‹ค

5. ์ปฌ๋Ÿผ ๊ธฐ์ค€ ์ •๋ ฌ

์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

df.sort_values(by=['score'])

score ์ปฌ๋Ÿผ ๊ธฐ์ค€ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ๋๋‹ค

๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

df.sort_values(by=['score'], ascending=False)

score ์ปฌ๋Ÿผ ๊ธฐ์ค€ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ๋๋‹ค

์ปฌ๋Ÿผ ๋‘ ๊ฐœ ๊ธฐ์ค€ ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

df.sort_values(by=['grade','score'])

grade ์ปฌ๋Ÿผ ๊ธฐ์ค€์œผ๋กœ๋Š” ์˜ค๋ฆ„์ฐจ์ˆœ, score ์ปฌ๋Ÿผ ๊ธฐ์ค€์œผ๋กœ๋Š” ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ

df.sort_values(by=['grade','score'], ascending=[True, False])

 

6. ๊ทธ๋ฃนํ™” (SQL์˜ group by ๊ฐœ๋…)

๋‹ค์Œ๊ณผ ๊ฐ™์€ list๊ฐ€ ์žˆ๋‹ค.

student_list = [{'name': 'John', 'major': "Computer Science", 'sex':"male"},
                {'name': 'Nate', 'major': "Computer Science", 'sex':"male"},
                {'name': 'Abraham', 'major': "Physics", 'sex':"male"},
                {'name': 'Brian', 'major': "Psychology", 'sex':"male"},
                {'name': 'Janny', 'major': "Economics", 'sex':"female"},
                {'name': 'Yuna', 'major': "Economics", 'sex':"female"},
                {'name': 'Jeniffer', 'major': "Computer Science",'sex': "female"},
                {'name': 'Edward', 'major': "Computer Science",'sex': "male"},
                {'name': 'Zara', 'major': "Psychology", 'sex':"female"},
                {'name': 'Wendy', 'major': "Economics", 'sex':"female"},
                {'name': 'Sera', 'major': "Psychology", 'sex':"female"}]

์œ„์˜ list๋กœ dataframe ์ƒ์„ฑ

df = pd.DataFrame(student_list, columns = ['name', 'major', 'sex'])

major ๋ผ๋Š” ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

groupby_major = df.groupby('major')

data type์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. โžก pandas.core.groupby.groupby.DataFrameGroupBy

type(groupby_major)

๋ช‡ ๋ฒˆ์งธ ํ–‰๋“ค์ด ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ์˜€๋Š”์ง€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ช…๋ น์–ด๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

groupby_major.groups

์ด๊ฑธ for ๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ด๋ธ” ํ˜•์‹์œผ๋กœ ์ถœ๋ ฅํ•ด๋ณด๋ฉด

for name, group in groupby_major:
    print(name + ": " + str(len(group)))
    print(group)
    print()

๊ฐ ๊ทธ๋ฃน์˜ ์š”์†Œ๋“ค์˜ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ช…๋ น์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

df_major_cnt = pd.DataFrame({'count' : groupby_major.size()}).reset_index()

 

#๏ธโƒฃ ์—‘์…€ ํŒŒ์ผ ์ฝ์–ด์˜ค๊ธฐ

Y2018 = pd.read_excel('์ €์žฅ๊ฒฝ๋กœ/์—‘์…€ ํŒŒ์ผ๋ช….xlsx', 'Y2018')
Y2019 = pd.read_excel('์ €์žฅ๊ฒฝ๋กœ/์—‘์…€ ํŒŒ์ผ๋ช….xlsx', 'Y2019')
Y2020 = pd.read_excel('์ €์žฅ๊ฒฝ๋กœ/์—‘์…€ ํŒŒ์ผ๋ช….xlsx', 'Y2020')

2018, 2019, 2020๋…„๋„ ๋‚ ์”จ ์ •๋ณด ์—‘์…€ ํŒŒ์ผ์„ ํŒŒ์ด์ฌ์œผ๋กœ ์ฝ์–ด์™”๋‹ค.

1. ์„œ๋กœ ๋‹ค๋ฅธ ์—‘์…€ ํŒŒ์ผ ํ•ฉ์น˜๊ธฐ (๋งˆ์น˜ SQL์˜ union all๊ณผ ๊ฐ™๋‹ค)

๋ฐฉ๋ฒ•1)

Y_total = pd.concat([Y2018,Y2019,Y2020], axis=0,ignore_index = True)

๋ฐฉ๋ฒ•2)

2018๋…„๋„์— 2019๋…„๋„๋ฅผ ํ•ฉ์น˜๊ณ , 2020๋…„๋„๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ํ•ฉ์นœ๋‹ค.

Y_total2 = Y2018.append(Y2019,ignore_index = True)
Y_total2 = Y_total2.append(Y2020,ignore_index = True)

 

2. ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ธฐ (csv ํŒŒ์ผ์„ ์ œ๊ณตํ•ด์•ผ ํ•  ๋•Œ)

Y_total.to_csv('์ €์žฅ๊ฒฝ๋กœ/์—‘์…€ ํŒŒ์ผ๋ช….csv')

๋งŒ์•ฝ header ๋ฅผ ์ง€์šฐ๊ณ  ์‹ถ๋‹ค๋ฉด

Y_total.to_csv('์ €์žฅ๊ฒฝ๋กœ/์—‘์…€ ํŒŒ์ผ๋ช….csv', header = False, index =False)

 

3. ์ด์–ด์ง„ ํŠน์ • ํ–‰ ์ถ”์ถœ (ํŒŒ์ด์ฌ์˜ ์Šฌ๋ผ์ด์‹ฑ์ฒ˜๋Ÿผ)

# ์ฒซ ํ–‰๋ถ€ํ„ฐ 22ํ–‰๊นŒ์ง€๋งŒ ์ถ”์ถœ
a1 = Y_total[:23]
# 23ํ–‰๋ถ€ํ„ฐ 46ํ–‰๊นŒ์ง€ ์ถ”์ถœ
a2 = Y_total[23:47]

 

4. ์ด์–ด์ง€์ง€ ์•Š์€ ํŠน์ • ํ–‰ ์ถ”์ถœ

# ๋งจ ์œ—์ค„ ๋ฝ‘์•„
b1 = Y_total.loc[0]
# 0๋ฒˆ 4๋ฒˆ์ค„ ๋ฝ‘์•„
b2 = Y_total.loc[[0,4]]

 

5. ํ•„ํ„ฐ๋ง (SQL์˜ where์ ˆ์— ์กฐ๊ฑด ์“ฐ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ)

d1 = Y_total[Y_total.๊ธฐ์˜จ >= 35]
d2 = Y_total[Y_total.๊ธฐ์˜จ <= -15]
d3 = Y_total[(Y_total.๊ธฐ์˜จ >=35)&(Y_total.์ง€๋ฉด์˜จ๋„ >=40)]

d4 = Y_total[Y_total.์ผ์‹œ == '2018-01-01 01:00:00']

๋งŒ์•ฝ ์ปฌ๋Ÿผ์ด ๋‚ ์งœ๊ด€๋ จ์ด๋ผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์—ฐ๋„, ์›”, ์ผ, ์‹œ๊ฐ„์„ ์žก์•„๋†“์„ ์ˆ˜ ์žˆ๋‹ค.

Y_total['Year'] = Y_total['์ผ์‹œ'].dt.year 
Y_total['Month'] =Y_total['์ผ์‹œ'].dt.month 
Y_total['Day'] =Y_total['์ผ์‹œ'].dt.day 
Y_total['time'] =Y_total['์ผ์‹œ'].dt.time 
Y_total['hour'] =Y_total['์ผ์‹œ'].dt.hour

๊ณ ๋ ‡๊ฒŒ ์žก์•„๋†“๊ณ  11์›” 3์ผ์ž์˜ ๋ฐ์ดํ„ฐ๋“ค๋งŒ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ๋‹ค.

d5 = Y_total[(Y_total.Month == 11)&(Y_total.Day == 3)]

 

6. merge (SQL์˜ join์ฒ˜๋Ÿผ)

์•„๋ž˜ ๋‘ ์—‘์…€ํŒŒ์ผ์€ state ์ปฌ๋Ÿผ์ด ๋™์‹œ์— ์กด์žฌํ•œ๋‹ค. state ์ปฌ๋Ÿผ์„ ์ธ๋ฑ์Šค๋กœ ๊ฐ€์ ธ์™€ dataframe์œผ๋กœ ์ €์žฅํ•œ๋‹ค.

alco = pd.read_csv("์ €์žฅ๊ฒฝ๋กœ/์—‘์…€ ํŒŒ์ผ๋ช….csv", index_col="State")
popu = pd.read_csv("์ €์žฅ๊ฒฝ๋กœ/์—‘์…€ ํŒŒ์ผ๋ช….csv", index_col="State")

 

state ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‘ ์—‘์…€ํŒŒ์ผ์„ merge ํ•œ๋‹ค.

us_mer1 = pd.merge(alco.reset_index(), popu.reset_index()).set_index("State")

 

7. ์ธํ„ฐ๋„ท ์ฃผ์†Œ๋กœ ์—‘์…€ ํŒŒ์ผ ๊ฐ€์ ธ์˜ค๊ธฐ

 

#๏ธโƒฃ ๊ทธ๋ž˜ํ”„ ํ˜•์ƒํ™”

matplotlib.pyplot import

# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ ค์ฃผ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
import matplotlib.pyplot as plt
# figure๋ฅผ ์ž๋™์œผ๋กœ ์œˆ๋„์šฐ ์ฐฝ์œผ๋กœ ๋„์šฐ๋Š” ๋ช…๋ น์–ด
%matplotlib auto

 

๋‹ค์Œ๊ณผ ๊ฐ™์ด x ์™€ y ๋ณ€์ˆ˜์—, ์—‘์…€๋กœ ๊ฐ€์ ธ์˜จ sheet ์˜ ํŠน์ • ์ปฌ๋Ÿผ๋“ค๋งŒ ์ถ”์ถœํ•ด์„œ ์ €์žฅํ•œ๋‹ค.

x = Y_total['์ผ์‹œ']
y1 = Y_total['๊ธฐ์˜จ']
y2 = Y_total['์ง€๋ฉด์˜จ๋„']
y3 = Y_total['ํ’์†']

 

1. ๊ทธ๋ž˜ํ”„ ํ˜•์ƒํ™”

plt.figure()

์œ„์˜ ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰ํ•˜๋ฉด Figure์˜ ์œˆ๋„์šฐ ์ฐฝ์ด ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ๋‹ค.

 

plt.plot(x,y1*0.5)
plt.plot(x,y2)

๋‘ ์ปฌ๋Ÿผ ๊ฐ’์„ ๊ฐ๊ฐ x์ถ•, y์ถ•์œผ๋กœ ์žก๊ณ  ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ทธ๋ ค์งˆ ์ˆ˜ ์žˆ๋‹ค.

1. ์ผ์‹œ์™€ ๊ธฐ์˜จ์˜ 0.5๋ฐฐ / 2. ์ผ์‹œ์™€ ์ง€๋ฉด์˜จ๋„์˜ ๊ทธ๋ž˜ํ”„ ํ˜•์ƒํ™”

 

2. ๊ฒฉ์ž๋ฌด๋Šฌ

plt.grid()

 

3. ์ถ•์ด๋ฆ„ ์ถ”๊ฐ€

plt.xlabel('time')
plt.ylabel('temp')