KT AICE Associate 특강 3일차 - 데이터 분석, 전처리부터 딥러닝 과정까지.

KT AICE Associate 특강 3일차 - 데이터 분석, 전처리부터 딥러닝 과정까지.

2023. 7. 12. 13:22ㆍ자격증/KT-AICE Associate

모델링 프로세스.

데이터 가져오기
데이터 분석
X , y 나누기
머신러닝 모델링
딥러닝 모델링
딥러닝 성능 평가

sns.load_dataset('iris')

아이리스를 가져옴. 근데 원래 seaborn에서 가져왔었나?? 아닌 것 같은데. 아무튼 편리하다.

dir(iris)

이거 하면 iris라는 변수에서 쓸 수 있는 메서드를 볼 수 있다.

(iris['species'].value_counts()).plot(kind='bar')

겁나 신기하네. seaborn으로 불러와서 이렇게 value_counts()를 하는 것만으로도 불러와진다. 대신 iris의 value_counts는 괄호 없어도 됨.

레이블 인코딩 단계.

le = LabelEncoder()
y = le.fit_transform(y)

le.classes_

디시전 트리 실습.

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

dt = DecisionTreeClassifier()
dt.fit(X, y)
dt.score(X, y)

모르면 검색해본다.

1. 데이터 가져오기

!pip install seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

boston = pd.read_csv('boston.csv')

#가져온 데이터 앞 5개 확인
boston.head(5)

# 가져온 boston 데이터의 기술적 통계 확인해 보기 : 각 컬럼의 평균,분산, Min, Max등등
boston.describe()

2. 데이터분석

boston['MEDV']

# 연속형값 MEDV 집값 컬럼 내용보기
boston['MEDV']

# histplot() 함수 활용 MEDV 집값 컬럼의 분포가 어떻게 되는지 히스토그램으로 확인
sns.histplot(boston['MEDV'])

# 판다스 plot 함수 활용해 산점도(scatter) 그래프 시각화
plt.scatter(x='RM', y='MEDV', data=boston)

# 위의 ["MEDV", "RM", "AGE", "CHAS"] 4개 컬럼 값들을 출력
boston[["MEDV", "RM", "AGE", "CHAS"]]

#pariplot 함수
plt.figure(figsize=(10,10))
sns.pairplot(boston[["MEDV", "RM", "AGE", "CHAS"]])
plt.show()

# 상관 관계 행렬 : 컬럼간의 관계를 숫자로 표시 corr() 함수 활용
boston[["MEDV", "RM", "AGE", "CHAS"]].corr()

# seaborn의 heatmap 히트맵 활용하여 컬럼들간의 관계를 확인
sns.heatmap(annot=True, cmap='Blues', data=boston[["MEDV", "RM", "AGE", "CHAS"]].corr())
plt.figure(figsize=(10,10)), sns.set(font_scale=0.6)

# MEDV 집값과 상관관계가 가장 높은 컬럼은 무엇인가요?
# corr() 함수 활용 > MEDV 컬럼 값들을 구하고(마지막 제외) > 절대값 > 오름차순 정렬
boston.corr()['MEDV'].abs().sort_values(ascending=False)

3. X, y 나누기

# Boston 데이터 뒤 5개 보기
boston.tail(5)

# X 분리 : 판다스 drop 함수 활용
X = boston.drop('MEDV', axis=1)

# y 분리 : 'MEDV' 컬럼값만 분리
y = boston['MEDV']

# Series, DataFrame 형태를 numpy array 변경하기
X = X.values
y = y.values

# MinMaxScaler 스케일링
from sklearn.preprocessing import MinMaxScaler

# 1. MinMaxScaler 함수 정의 : mmx
mmx = MinMaxScaler()
X = mmx.fit_transform(X)

# 학습 데이터 shape 확인
X.shape

4. 머신러닝 모델링

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassification

# 1. DecisionTreeRegressor 모델 정의 -> dt 저장
dt = DecisionTreeRegressor()
dt.fit(X,y)
dt.score(X,y)

# 1. RandomForestRegressor 모델 정의 -> rf 저장
rf = RandomForestRegressor()
rf.fit(X,y)
rf.score(X,y)

# 150	라인 샘플 데이터 와 정답 출력
print(X[149])
print(y[149])

# 150	라인 샘플 데이터을 모델 입력해서 예측하기
# rf 모델의 predict 함수 활용
pred = rf.predict([X[149]])
print(pred)

5. 딥러닝 모델링

# 딥러닝 필요한 라이브러리 가져오기
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Sequential 모델 만들기 --> model 변수 저장
# input layer : (13, )
# hidden layer : 6 unit , activation='relu'
# output layer : 1 unit
model = Sequential()
model.add(Dense(6, activation='relu', input_shape = (13,)))
model.add(Dense(1))

# 모델 컴파일 : compile
model.compile(loss = 'mse', 
              optimizer = 'adam', 
              metrics = ['mse','mae'])

# 모델 학습 : fit
# X, y, epochs=10, batch_size=8
# 학습결과 저장 : history
model.fit(X, y, epochs=10, batch_size=8)

# epochs 횟수 증가하여 모델 학습 : fit
history = model.fit(X, y, epochs=50, batch_size=8)

6. 딥러닝 성능 평가

plt.plot(history.history['loss'], 'r')
plt.plot(history.history['mse'], 'b')
plt.title('Loss and Accuracy')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend(["Loss", "MSE"])
plt.show()

데이터 클랜징

df['text'] = pd.DataFrame(df['text'].str.replace('MMS스팸신고',''))
df['text'] = pd.DataFrame(df['text'].str.replace('\[Web발신\]',''))
df['text'] = pd.DataFrame(df['text'].str.replace('\[WEB발신\]',''))
df['text'] = pd.DataFrame(df['text'].str.replace('KISA신고메시지',''))
df['text'] = pd.DataFrame(df['text'].str.replace('KISA신고',''))
df.head()

#결측치 제거
df = df.dropna(axis=0)
df.info()
df['length'] = df['text'].apply(len)

#중복값 제거
df = df.drop_duplicates(keep='first', ignore_index=True)
df.info()

2. 토큰화 및 Dictionary 생성

#단어 토큰화
token_list = []
for text in df['text']:
    token_list.extend(mecab.nouns(text))
    
print(token_list[:20])

#내림차순 정렬
# collections모듈의 Counter클래스를 import 합니다.
from collections import Counter
vocab_collection = Counter(token_list)
vocab = vocab_collection.most_common(len(vocab_collection))
print(vocab[:20])

#Dictionary를 생성
nouns_dic = {}
i = 0
for (word, frequency) in vocab :
    i += 1
    nouns_dic[word] = i
    
# dict 생성 결과 확인
print(nouns_dic['통화'])
print(nouns_dic['안녕'])
print(nouns_dic['광고'])
print(nouns_dic['무료'])

#Dictionary 사이즈를 확인 후 데이터 저장
nouns_word_size = len(nouns_dic)
print("word_size : ", nouns_word_size)

import pickle
file=open("nouns_dic","wb") 
pickle.dump(nouns_dic, file) 
file.close()

3.스팸 데이터 [x 값] 인코딩**

#refactoring
nouns_x = list(map(lambda text: list(map(lambda x : nouns_dic[x], mecab.nouns(text))), df['text']))
nouns_x[0:10]

4. 패딩

max_length = 0
max_length = max([len(item) for item in nouns_x])
print("max_length : ", max_length)

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_x = pad_sequences(nouns_x, max_length)

5. 라벨 데이터 인코딩

y = df.apply(lambda r: 0 if r.label == 'ham' else 1 if r.label =='spam' else r.label, axis = 1)
y.value_counts()

728x90

'자격증 > KT-AICE Associate' 카테고리의 다른 글

광탈 (0)	2023.07.12
KT 특강 2일차. 연습문제. heatmap, StandardScaler, RandomForestClassifier, 딥러닝 (1)	2023.07.11
KT 특강 1일차. mecab, 딥러닝 keras 공부. (0)	2023.07.10

이스트진 블로그

이스트진 블로그

태그

최근글

댓글

공지사항

아카이브

'자격증 > KT-AICE Associate' 카테고리의 다른 글

관련글

티스토리툴바