[파이썬, Python] 머신러닝 - 5️⃣ 랜덤 포레스트(Random Forest)

728x90

SMALL

📄예제에 사용한 파일

1. hotel 데이터셋 살펴보기

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

hotel_df = pd.read_csv('/content/drive/MyDrive/KDT/Python/4.머신러닝 딥러닝/hotel.csv')

hotel_df.head()

✅ hotel_df의 정보 보기

hotel_df.info()

✅ 모델 학습에 필요 없는 피쳐 삭제하기

hotel_df.drop(['credit_card', 'name', 'email', 'phone-number', 'reservation_status_date'], axis=1, inplace=True)

✅ 수치형 컬럼의 정보 알아보기

hotel_df.describe()

✅ 예약 시점으로부터 체크인 될 때까지의 기간의 데이터가 얼마나 있는지 시각화해서 알아보기

sns.displot(hotel_df['lead_time'])

✅ 이상치 확인을 위해 boxplot으로 데이터 분포 알아보기

sns.boxplot(y=hotel_df['lead_time'])

▲ 700 이상의 데이터들은 연속적이지 않게 분포하기 때문에 이상치라고 판

✅ 예약 방법에 따른 취소여부의 데이터 수를 알아보기

sns.barplot(x=hotel_df['distribution_channel'], y=hotel_df['is_canceled'])

✅ 예약방법에 대하여 중복값을 제외한 데이터의 종류와 개수 알아보기

hotel_df['distribution_channel'].value_counts()

✅ 호텔 종류에 따른 취소 여부 알아보기

sns.barplot(x=hotel_df['hotel'], y=hotel_df['is_canceled'])

✅ 도착 예정 연도에 따른 취소 여부 알아보기

sns.barplot(x=hotel_df['arrival_date_year'], y=hotel_df['is_canceled'])

✅ 도착 예정 월에 대한 취소 여부 알아보기

plt.figure(figsize=(15,5))
sns.barplot(x=hotel_df['arrival_date_month'], y=hotel_df['is_canceled'])

✅ x축의 범위를 1월부터 12월까지 정렬하여 나타내기

import calendar

print(calendar.month_name[1])  # 1월에 있는 달에 대한 이름이 나옴
print(calendar.month_name[2])  # 2월에 있는 달에 대한 이름이 나옴
print(calendar.month_name[3])  # 3월에 있는 달에 대한 이름이 나옴

>>> January
February
March

# 월 별 이름을 담을 리스트
months = []

# 1~12까지 반복
for i in range(1, 13):
  months.append(calendar.month_name[i])
  
months
>>> ['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

✅ x축의 범위에 months 리스트로 정렬하기

plt.figure(figsize=(15,5))
sns.barplot(x=hotel_df['arrival_date_month'], y=hotel_df['is_canceled'], order=months)  # months(1~12월)의 순서대로 정렬

✅ 취소했다가 다시 예약한 적이 있는지에 대한 여부에 따른 취소 여부 알아보기

sns.barplot(x=hotel_df['is_repeated_guest'], y=hotel_df['is_canceled'])

✅ 요금 납부 방식에 따른 취소 여부 알아보기

sns.barplot(x=hotel_df['deposit_type'], y=hotel_df['is_canceled'])

✅ 'deposil_type'의 중복값을 제외한 데이터와 데이터 개수 알아보기

hotel_df['deposit_type'].value_counts()

✅ 각 피쳐들간의 상관관계에 대해 알아보기 위해 heatmap 이용하기

sns.heatmap()

coolwarm : 빨강, 파랑을 같이 쓰는 팔레트, 0보다 크면 빨강색, 작으면 파랑색으로 표시
vmax/vmin: -1~1까지 값으로 정규화
annot=True: 네모칸 안에 정규화 값을 작성

plt.figure(figsize=(15,15))
sns.heatmap(hotel_df.corr(), cmap='coolwarm', vmax=1, vmin =-1, annot=True)

▲ lead_time과 is_canceled는 정비례하는 연관성이 유의미하게 있고, arrival_date_year과 arrival_date_week_number은 반비례하는 연관성이 있다고 볼 수 있음

✅ hotel_df에 대한 정보 다시 알아보기

hotel_df.info()

✅ NaN값의 비율이 작기 때문에 NaN값은 제거 후 재저장하기

hotel_df = hotel_df.dropna()

✅ 어른들이 없이 예약된 데이터 알아보기(adult의 값이 0)

hotel_df[hotel_df['adults']  == 0]

✅ 성인, 아이, 아기를 포함하는 인원수라는 파생변수를 만들

hotel_df['people'] = hotel_df['adults'] + hotel_df['children'] + hotel_df['babies']
hotel_df.head()

✅ 인원수가 0명인 데이터를 삭제하기

hotel_df[hotel_df['people']==0]
hotel_df = hotel_df[hotel_df['people']!=0]  # 인원수가 0이 아닌 데이터만 저장

✅ 며칠을 묵었는지(total_nights)에 대한 파생변수를 만들기

hotel_df['total_nights'] = hotel_df['stays_in_week_nights'] + hotel_df['stays_in_weekend_nights']

✅ total_nights가 0인 데이터 알아보기

hotel_df[hotel_df['total_nights']==0]

▲ 숙박이 아니라 시간단위로 머물렀을 수도 있기 때문에 데이터를 삭제하지 않음

✅ 계절에 따른 취소여부를 알아보기 위해 월을 기준으로 계절(season)이라는 파생변수를 만들기

hotel_df['arrival_date_month'].apply(lambda x: 'spring' if x in ['March', 'April', 'May'] else 'summer' if x in ['June', 'July', 'August'] else 'fall' if x in ['September', 'October', 'November'] else 'winter')

✅ dictionary를 이용하여 파생변수 만들기

season_dic = {'spring':[3,4,5], 'summer':[6,7,8], 'fall':[9,10,11], 'winter':[12,1,2]}

new_season_dic = {}

for i in season_dic:  # season_dic의 길이만큼(4번) 반복하면서
  for j in season_dic[i]:  # 딕셔너리의 각 키에 대해 값들을 나타내는 리스트에 대해 반복문을 수행, j는 각 키값에 대한 value값을 나타냄
    # print(j)
    new_season_dic[calendar.month_name[j]] = i
    
new_season_dic

# map(): 시리즈(Series) 객체나 데이터프레임(DataFrame) 객체의 각 요소에 대해 지정된 함수를 적용하여 새로운 값을 반환하는 메서드
# new_season_dic 딕셔너리를 참조하여 해당하는 계절로 매핑
hotel_df['season'] = hotel_df['arrival_date_month'].map(new_season_dic)
hotel_df.head()

✅ 예약한 방의 타입과 배정된 방의 타입이 같으면 True, 아니면 False -> 0, 1로 변환하기

hotel_df['expected_room_type'] = (hotel_df['reserved_room_type'] == hotel_df['assigned_room_type']).astype(int)

✅ hotel_df 정보보기

hotel_df.info()

✅ 취소율(이전의 취소 횟수 / 취소된 횟수+취소안된 횟수) 파생변수 만들기

# 이전 예약에서 얼마나 많은 비율로 취소되었는지를 나타내는 지표
hotel_df['cancel_rate'] = hotel_df['previous_cancellations'] / (hotel_df['previous_cancellations'] + hotel_df['previous_bookings_not_canceled'])

✅ 취소율이 NaN인 데이터 ➡ 처음 방문한 사람들

hotel_df[hotel_df['cancel_rate'].isna()]

✅ 방문한 적은 있지만 취소한적이 없는 경우

hotel_df[~hotel_df['cancel_rate'].isna()]

✅ 취소한적이 한번이라도 있는경우

hotel_df[hotel_df['cancel_rate']>0]

✅ NaN의 경우 별도의 카테고리로 만들기 위해 -1로 채우기

hotel_df['cancel_rate'] = hotel_df['cancel_rate'].fillna(-1)
hotel_df.info()

✅ country 열의 데이터 타입 알아보기

hotel_df['country'].dtype  # dtype('O'): Object
>>> dtype('O')

✅ 'people' 열의 데이터 타입 알아보기

hotel_df['people'].dtype  
>>> dtype('float64')

✅ 데이터 타입이 object 인 컬럼을 list에 추가하기

obj_list = []

for i in hotel_df.columns:
  if hotel_df[i].dtype == 'O': # 컬럼들 중 dtype이 'O'인 경우
    obj_list.append(i)         # 리스트에 추가

✅ 데이터 타입이 object인 컬럼의 중복값을 제외한 데이터의 개수 알아보기

for i in obj_list:
  print(i, hotel_df[i].nunique())

✅ 'meal' 열의 중복값을 제외한 데이터와 개수 알아보기

hotel_df['meal'].value_counts()

✅ country와 meal 열을 제거

hotel_df.drop(['country', 'meal'], axis=1, inplace=True)

✅ obj_list에서도 제거

obj_list.remove("country")
obj_list.remove("meal")

✅ 원핫인코딩 하기

hotel_df = pd.get_dummies(hotel_df, columns=obj_list)
hotel_df.head()

✅ 학습데이터와 검증데이터 분리하기

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(hotel_df.drop('is_canceled', axis=1), hotel_df['is_canceled'], test_size=0.3, random_state=10)

2. 랜덤 포레스트(Random Forest)

Decision Tree는 매우 훌륭한 모델이지만 학습 데이터에 오버피팅(과대적합) 하는 경향이 있음(가지치기 같은 방법을 통해 부작용을 최소화할 수 있지만 부족함)
학습을 통해 구성해 놓은 다수의 나무들(Decision Tree)로 부터 분류 결과를 취합해서 결론을 얻는 방식의 모델
Decision Tree 기반의 Bagging 앙상블 모델
굉장히 인기 있는 모델이며, 사용성이 쉽고 성능도 꽤 우수한편

2-1. 앙상블(Ensemble) 모델

여러개의 머신러닝 모델을 이용해 최적의 답을 찾아내는 기법

1) 보팅(Voting)

다른 알고리즘 model을 조합해서 사용
모델에 대해 투표로 결과를 도출
분류를 할 때 voting이라는 하이퍼 파라미터를 사용

2) 배깅(Bagging)

같은 알고리즘 내에서 다른 sample 조합을 사용
샘플 중복 생성을 통해 결과를 도출
여러개의 dataset을 중첩을 허용하여 샘플링하고 분할하는 방식

3) 부스팅(Boosting)

이전 오차를 보완해가면서 가중치를 부여
성능이 매우 우수하지만, 학습을 여러번 하기 때문에 잘못된 레이블이나 아웃라이어에 대해 필요 이상으로 민감(데이터가 한쪽으로 쏠려서 아웃라이어쪽으로 학습하게 되면 이상한 결과가 도출될 수 있음)

4) 스태킹(Stacking)

여러 모델을 기반으로 예측된 결과를 통해 meta 모델(성능 좋은 모델)이 다시 한번 예측
성능을 극으로 끌어올릴 때 활용하지만 과대적합을 유발할 수 있음(특히 데이터셋이 적은 경우)

✅ 랜덤 포레스트 분류 실습

from sklearn.ensemble import RandomForestClassifier

# 객체 생성
rf = RandomForestClassifier()

# 학습
rf.fit(X_train, y_train)

# 예측
pred1 = rf.predict(X_test)

pred1
>>> array([0, 0, 0, ..., 0, 0, 0])

# 분류된 클래스 가능성
proba1 = rf.predict_proba(X_test)
proba1
>>> array([[1.  , 0.  ],
       [1.  , 0.  ],
       [0.73, 0.27],
       ...,
       [1.  , 0.  ],
       [0.88, 0.12],
       [0.97, 0.03]])
       
       
# 첫번째 테스트 데이터에 대한 예측 결과
proba1[0]
>>> array([1., 0.])

# 모든 테스트 데이터에 대한 호텔 예약을 취소할 확률만 출력
proba1[:, 1]   # 모든행, 1번열
>>> array([0.  , 0.  , 0.27, ..., 0.  , 0.12, 0.03])

3. ROC Curve

이진 분류의 성능을 측정하는 도구
민감도와 특이도로 그려지는 곡선을 의미
FPR(False Positive Rate)
- 특이도(거짓 양성 비율)
- FP / TN + FP
- 실제값은 음성이지만 양성으로 잘못 분류
TPR(True Positive Rate)
- 민감도(참인 양성 비율)
- TP / FN + TP
- 실제로 양성이고 양성으로 분류

4. AUC(Area Under th ROC Curve)

ROC 커브와 직선 사이의 면적을 의미
AUC값의 범위는 0.5 ~ 1이며, 값이 클수록 예측의 정확도가 높음

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

accuracy_score(y_test, pred1)  
>>> 0.8642297650130548   # 데이터가 한쪽으로 쏠리진 않음

# 혼돈 행렬
confusion_matrix(y_test, pred1)
>>> array([[20821,  1537],
       [ 3299,  9962]])
       
# 리포트로 보기
print(classification_report(y_test, pred1))

# roc_auc_score(모든 데이터에서 참일 확률을 넣어야함!)
roc_auc_score(y_test, proba1[:,1])

✅ 하이퍼 파라미터 튜닝 - max_depth=30을 적용했을 때

rf2 = RandomForestClassifier(max_depth=30, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:, 1])

>>> 0.9320285483491656

✅ 하이퍼 파라미터 튜닝 - max_depth=50을 적용했을 때

rf2 = RandomForestClassifier(max_depth=50, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:,1])

>>> 0.9303745518246758

✅ 하이퍼 파라미터 튜닝 - min_samples_split=5 로 적용했을 때

rf2 = RandomForestClassifier(min_samples_split=5, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:,1])

>>> 0.931436154565479

✅ 하이퍼 파라미터 수정 - min_samples_split=7 로 적용했을 때

rf2 = RandomForestClassifier(min_samples_split=7, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:,1])

>>> 0.9312578210627522

✅ 튜닝 후 정확도 비교

✅ 하이퍼 파라미터 최적의 값 적용

rf2 = RandomForestClassifier(min_samples_split=5,max_depth=30, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:,1])
>>> 0.9320878540030826

1. 하이퍼 파라미터에 넣는 속성을 콜라보했을 때 무조건 성능이 좋아지지 않는다.
2. 하이퍼 파라미터의 주로 사용되어지는 하이퍼 파라미터의 주어지는 값의 범위를 알고있을 필요가 있다.

5. 하이퍼 파라미터 최적의 값을 찾는 방법

GridSearchCV: 원하는 모든 하이퍼 파라미터를 적용하여 최적의 값을 찾음

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth': [None, 10, 30, 50],  # 범위값을 지정
    'min_samples_split': [2, 3, 5, 7, 20]
          }
          
# 객체 생성
rf3 = RandomForestClassifier(random_state=10)

# 최적의 하이퍼 파라미터를 찾는 객체 생성
grid_df = GridSearchCV(rf3, params, cv=5)  # cv: 데이터 교차검증 수

# 학습
grid_df.fit(X_train, y_train)

# best_params_: 최적의 성능을 만들 하이퍼 파라미터를 알려줌(params로 설정해놓은 범위 내에서)
grid_df.best_params_ 

# 최적의 하이퍼 파라미터 적용
rf3 = RandomForestClassifier(min_samples_split=3,max_depth=30, random_state=10)
rf3.fit(X_train, y_train)
proba3 = rf3.predict_proba(X_test)
roc_auc_score(y_test, proba3[:,1])
>>> 0.9322237694686447

6. 피쳐 중요도

Decision Tree에서 노드가 분기할 때 해당 피쳐가 클래스를 나누는데 얼마나 영향을 미쳤는지 표기하는 척도
0 이면 클래스를 구분하는데 해당 피쳐가 선택되지 않았다는 것이며, 1이면 해당 피쳐가 클래스를 완벽하게 나누었다는 것을 의미
normalized ndarray 를 리턴시켜줌(0 ~ 1의 수, 합이 1이 됨)
이 프로젝트에서 중요도를 알려주는 지표로 사용될 수 있음

proba3 = rf3.predict_proba(X_test)
proba3

>>> array([[0.99      , 0.01      ],
       [0.98278521, 0.01721479],
       [0.6161123 , 0.3838877 ],
       ...,
       [1.        , 0.        ],
       [0.92583825, 0.07416175],
       [0.9605    , 0.0395    ]])
       
# roc_auc_score
roc_auc_score(y_test, proba3[:,1])
>>> 0.9322237694686447


# 첫번째 feature는 0.1223~만큼의 가중치를 가짐(중요도가 12% 정도)
rf3.feature_importances_

✅ 피쳐중요도를 데이터프레임으로 만들기

feat_imp = pd.DataFrame({
    'features': X_train.columns,
    'importances':rf3.feature_importances_
})

feat_imp

✅ 가장 영향력 있는 feature 10개를 뽑아보기

top10 = feat_imp.sort_values('importances', ascending=False).head(10)
top10

✅ top10 그래프로 그려보기

plt.figure(figsize=(5,10))
sns.barplot(x='importances', y='features', data=top10)

728x90

LIST

'Python > ML&DL' 카테고리의 다른 글

[파이썬, Python] 머신러닝 - 7️⃣ KMeans (0)	2023.06.19
[파이썬, Python] 머신러닝 - 6️⃣ LightGBM (0)	2023.06.18
[파이썬, Python] 머신러닝 - 4️⃣ 서포트 벡터 머신(Support Vector Machine) (0)	2023.06.18
[파이썬, Python] 머신러닝 - 3️⃣ 로지스틱 회귀(Logistic Regression) (0)	2023.06.15
[파이썬, Python] 머신러닝 - 2️⃣ 의사결정나무(Decision Tree) (2)	2023.06.14

코딩하는 춘식이

[파이썬, Python] 머신러닝 - 5️⃣ 랜덤 포레스트(Random Forest)

1. hotel 데이터셋 살펴보기

2. 랜덤 포레스트(Random Forest)

2-1. 앙상블(Ensemble) 모델

3. ROC Curve

4. AUC(Area Under th ROC Curve)

5. 하이퍼 파라미터 최적의 값을 찾는 방법

6. 피쳐 중요도

'Python > ML&DL' 카테고리의 다른 글

티스토리툴바

[파이썬, Python] 머신러닝 - 5️⃣ 랜덤 포레스트(Random Forest)

1. hotel 데이터셋 살펴보기

2. 랜덤 포레스트(Random Forest)

2-1. 앙상블(Ensemble) 모델

3. ROC Curve

4. AUC(Area Under th ROC Curve)

5. 하이퍼 파라미터 최적의 값을 찾는 방법

6. 피쳐 중요도

'Python > ML&DL' 카테고리의 다른 글

관련글

티스토리툴바