decision tree 알고리즘 -> depth를 정해줘야함
결정 트리 학습법 - 위키백과, 우리 모두의 백과사전
위키백과, 우리 모두의 백과사전. 결정 트리 학습법(decision tree learning)은 어떤 항목에 대한 관측값과 목표값을 연결시켜주는 예측 모델로써 결정 트리를 사용한다. 이는 통계학과 데이터 마이닝,
ko.wikipedia.org
1. Load Titanic Datasets¶
In [1]:
import pandas as pd
train = pd.read_csv('data/titanic/train.csv', index_col='PassengerId')
print(train.shape)
print(train.info())
train.head()
Out[1]:
In [2]:
test = pd.read_csv('data/titanic/test.csv', index_col='PassengerId')
print(test.shape)
print(test.info())
test.head()
Out[2]:
In [3]:
# null data counting
train.isnull().sum()
Out[3]:
In [4]:
test.isnull().sum()
Out[4]:
2. Data PreProcessing (데이터 전처리)¶
- 문자열 데이터를 숫자로 변환
- One Hot Encoding
- null data 처리
2.1 성별(Sex) Encoding¶
- 'male' => 0 , 'female' => 1
In [5]:
train['Sex'].unique()
Out[5]:
In [6]:
train['Sex'].value_counts()
Out[6]:
In [7]:
# Sex 컬럼의 값을 변경
train.loc[train['Sex'] == 'male', 'Sex'] = 0
train.loc[train['Sex'] == 'female', 'Sex'] = 1
train['Sex'].unique()
Out[7]:
In [8]:
train.head(2)
Out[8]:
In [9]:
# Sex 컬럼의 값을 변경
test.loc[test['Sex'] == 'male', 'Sex'] = 0
test.loc[test['Sex'] == 'female', 'Sex'] = 1
test['Sex'].unique()
Out[9]:
2.2 Fare 컬럼의 null data 처리¶
In [10]:
test.loc[test['Fare'].isnull(),'Fare'] = 0
test.loc[test['Fare'].isnull()]
Out[10]:
2.3 Embarked 컬럼 처리¶
- One Hot Encoding
- C=0, S=1, Q=2 (X)
- C=[True,False,False], S=[False,True,False], Q=[False,False,True] (O)
- Embarked_C, Embarked_S, Embarked_Q 컬럼 3개 추가함
In [11]:
train['Embarked'].value_counts()
Out[11]:
In [12]:
train['Embarked_C'] = train['Embarked'] == 'C'
train['Embarked_S'] = train['Embarked'] == 'S'
train['Embarked_Q'] = train['Embarked'] == 'Q'
print(train.shape)
train[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].head()
train[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].tail()
Out[12]:
In [13]:
test['Embarked_C'] = test['Embarked'] == 'C'
test['Embarked_S'] = test['Embarked'] == 'S'
test['Embarked_Q'] = test['Embarked'] == 'Q'
print(test.shape)
test[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].head()
test[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].tail()
Out[13]:
2.4 Age 컬럼 처리¶
- null 값을 전체 나이의 평균 값을 계산해서 채워넣기.
In [14]:
mean_age = train['Age'].mean()
mean_age
Out[14]:
In [15]:
# Age 컬럼의 값이 null row를 평균 나이로 수정하기
train.loc[train['Age'].isnull(),'Age'] = mean_age
train.info()
In [16]:
test_mean_age = test['Age'].mean()
test.loc[test['Age'].isnull(),'Age'] = test_mean_age
test.info()
3. Data Visualization(시각화)¶
- countplot - 막대그래프, x축이나 y축 중에서 하나만 설정할 수 있다.
- barplot - 막대그래프, x축 y축 둘다 설정할 수 있다.
- pointplot - 선그래프
- distplot - 히스토그램(분포도)
- lmplot - 산점도(scatter plot)
In [17]:
%matplotlib inline
import seaborn as sns
In [18]:
# Embarked 컬럼에 대해서 countplot
sns.countplot(data=train, x='Embarked')
Out[18]:
In [19]:
# 생존여부와 Embarked 컬럼의 연관성
sns.countplot(data=train, x='Embarked', hue='Survived')
Out[19]:
In [20]:
sns.countplot(data=train, x='Pclass')
Out[20]:
In [21]:
sns.countplot(data=train, x='Pclass', hue='Survived')
Out[21]:
In [22]:
sns.countplot(data=train, x='Sex')
Out[22]:
In [23]:
sns.countplot(data=train, x='Sex', hue='Survived')
Out[23]:
In [24]:
import warnings
warnings.filterwarnings(action='ignore')
In [25]:
# Pclass와 Fare와의 연관관계
sns.barplot(data=train, x='Pclass', y='Fare')
Out[25]:
In [26]:
sns.barplot(data=train, x='Pclass', y='Fare', hue='Survived')
Out[26]:
In [27]:
sns.pointplot(data=train, x='Pclass', y='Fare', hue='Survived')
Out[27]:
In [28]:
sns.distplot(train['Age'], hist=True)
Out[28]:
In [29]:
sns.distplot(train['Fare'], hist=False)
Out[29]:
In [30]:
# Fare가 100$ 보다 작은 데이터 추출
low_fare = train.loc[train['Fare'] < 100]
print(low_fare.shape)
sns.distplot(low_fare['Fare'], hist=False)
Out[30]:
In [31]:
sns.lmplot(data=train, x='Age', y='Fare', hue='Survived')
Out[31]:
In [32]:
sns.lmplot(data=low_fare, x='Age', y='Fare', hue='Survived')
Out[32]:
4. Train & Predict¶
- Feature Engineering
- Model에서 사용할 feature(입력데이터)를 추출하기
- X_train, y_train, X_test 생성하기
- Decision Tree(의사결정트리)알고리즘 : DecisionTreeClassifier클래스 사용
In [33]:
train.columns
Out[33]:
In [34]:
feature_names = ['Pclass', 'Sex', 'Fare', 'Embarked_C', 'Embarked_S', 'Embarked_Q']
feature_names
Out[34]:
In [35]:
# X_train 생성
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
Out[35]:
In [36]:
# X_test 생성
X_test = test[feature_names]
print(X_test.shape)
X_test.head()
Out[36]:
In [37]:
# y_train 생성
label_name = 'Survived'
y_train = train[label_name]
print(y_train.shape)
y_train.head()
Out[37]:
In [38]:
# Decision Tree 알고리즘 객체 생성
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5)
model
Out[38]:
In [39]:
# 학습하기
model.fit(X_train, y_train)
Out[39]:
In [40]:
!pip show graphviz
In [44]:
from sklearn.tree import export_graphviz
import graphviz
export_graphviz(model, feature_names=feature_names, class_names=['Perished','Survived'], out_file='decision-tree.dot')
with open('decision-tree.dot') as file:
dot_graph = file.read()
graphviz.Source(dot_graph)
Out[44]:
In [45]:
# 예측하기
predictions = model.predict(X_test)
print(predictions.shape)
predictions
Out[45]:
5. Submission (제출하기)¶
In [46]:
submit = pd.read_csv('data/titanic/gender_submission.csv', index_col='PassengerId')
print(submit.shape)
submit.head()
Out[46]:
In [48]:
submit['Survived'] = predictions
print(submit.shape)
submit.head()
Out[48]:
In [49]:
# 제출할 csv파일 생성하기
submit.to_csv('data/titanic/titanic01.csv')
'Python > MachineLearning' 카테고리의 다른 글
Machine Learning(ML)_iris_data예제 (0) | 2020.08.18 |
---|---|
Machine Learning(ML) (0) | 2020.08.17 |