decision tree 알고리즘 -> depth를 정해줘야함
1. Load Titanic Datasets¶
In [1]:
import pandas as pd
train = pd.read_csv('data/titanic/train.csv', index_col='PassengerId')
print(train.shape)
print(train.info())
train.head()
Out[1]:
In [2]:
test = pd.read_csv('data/titanic/test.csv', index_col='PassengerId')
print(test.shape)
print(test.info())
test.head()
Out[2]:
In [3]:
# null data counting
train.isnull().sum()
Out[3]:
In [4]:
test.isnull().sum()
Out[4]:
2. Data PreProcessing (데이터 전처리)¶
- 문자열 데이터를 숫자로 변환
- One Hot Encoding
- null data 처리
2.1 성별(Sex) Encoding¶
- 'male' => 0 , 'female' => 1
In [5]:
train['Sex'].unique()
Out[5]:
In [6]:
train['Sex'].value_counts()
Out[6]:
In [7]:
# Sex 컬럼의 값을 변경
train.loc[train['Sex'] == 'male', 'Sex'] = 0
train.loc[train['Sex'] == 'female', 'Sex'] = 1
train['Sex'].unique()
Out[7]:
In [8]:
train.head(2)
Out[8]:
In [9]:
# Sex 컬럼의 값을 변경
test.loc[test['Sex'] == 'male', 'Sex'] = 0
test.loc[test['Sex'] == 'female', 'Sex'] = 1
test['Sex'].unique()
Out[9]:
2.2 Fare 컬럼의 null data 처리¶
In [10]:
test.loc[test['Fare'].isnull(),'Fare'] = 0
test.loc[test['Fare'].isnull()]
Out[10]:
2.3 Embarked 컬럼 처리¶
- One Hot Encoding
- C=0, S=1, Q=2 (X)
- C=[True,False,False], S=[False,True,False], Q=[False,False,True] (O)
- Embarked_C, Embarked_S, Embarked_Q 컬럼 3개 추가함
In [11]:
train['Embarked'].value_counts()
Out[11]:
In [12]:
train['Embarked_C'] = train['Embarked'] == 'C'
train['Embarked_S'] = train['Embarked'] == 'S'
train['Embarked_Q'] = train['Embarked'] == 'Q'
print(train.shape)
train[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].head()
train[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].tail()
Out[12]:
In [13]:
test['Embarked_C'] = test['Embarked'] == 'C'
test['Embarked_S'] = test['Embarked'] == 'S'
test['Embarked_Q'] = test['Embarked'] == 'Q'
print(test.shape)
test[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].head()
test[['Embarked', 'Embarked_C', 'Embarked_S', 'Embarked_Q']].tail()
Out[13]:
2.4 Age 컬럼 처리¶
- null 값을 전체 나이의 평균 값을 계산해서 채워넣기.
In [14]:
mean_age = train['Age'].mean()
mean_age
Out[14]:
In [15]:
# Age 컬럼의 값이 null row를 평균 나이로 수정하기
train.loc[train['Age'].isnull(),'Age'] = mean_age
train.info()
In [16]:
test_mean_age = test['Age'].mean()
test.loc[test['Age'].isnull(),'Age'] = test_mean_age
test.info()
3. Data Visualization(시각화)¶
- countplot - 막대그래프, x축이나 y축 중에서 하나만 설정할 수 있다.
- barplot - 막대그래프, x축 y축 둘다 설정할 수 있다.
- pointplot - 선그래프
- distplot - 히스토그램(분포도)
- lmplot - 산점도(scatter plot)
In [17]:
%matplotlib inline
import seaborn as sns
In [18]:
# Embarked 컬럼에 대해서 countplot
sns.countplot(data=train, x='Embarked')
Out[18]:
In [19]:
# 생존여부와 Embarked 컬럼의 연관성
sns.countplot(data=train, x='Embarked', hue='Survived')
Out[19]:
In [20]:
sns.countplot(data=train, x='Pclass')
Out[20]:
In [21]:
sns.countplot(data=train, x='Pclass', hue='Survived')
Out[21]:
In [22]:
sns.countplot(data=train, x='Sex')
Out[22]:
In [23]:
sns.countplot(data=train, x='Sex', hue='Survived')
Out[23]:
In [24]:
import warnings
warnings.filterwarnings(action='ignore')
In [25]:
# Pclass와 Fare와의 연관관계
sns.barplot(data=train, x='Pclass', y='Fare')
Out[25]:
In [26]:
sns.barplot(data=train, x='Pclass', y='Fare', hue='Survived')
Out[26]:
In [27]:
sns.pointplot(data=train, x='Pclass', y='Fare', hue='Survived')
Out[27]:
In [28]:
sns.distplot(train['Age'], hist=True)
Out[28]:
In [29]:
sns.distplot(train['Fare'], hist=False)
Out[29]:
In [30]:
# Fare가 100$ 보다 작은 데이터 추출
low_fare = train.loc[train['Fare'] < 100]
print(low_fare.shape)
sns.distplot(low_fare['Fare'], hist=False)
Out[30]:
In [31]:
sns.lmplot(data=train, x='Age', y='Fare', hue='Survived')
Out[31]:
In [32]:
sns.lmplot(data=low_fare, x='Age', y='Fare', hue='Survived')
Out[32]:
4. Train & Predict¶
- Feature Engineering
- Model에서 사용할 feature(입력데이터)를 추출하기
- X_train, y_train, X_test 생성하기
- Decision Tree(의사결정트리)알고리즘 : DecisionTreeClassifier클래스 사용
In [33]:
train.columns
Out[33]:
In [34]:
feature_names = ['Pclass', 'Sex', 'Fare', 'Embarked_C', 'Embarked_S', 'Embarked_Q']
feature_names
Out[34]:
In [35]:
# X_train 생성
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
Out[35]:
In [36]:
# X_test 생성
X_test = test[feature_names]
print(X_test.shape)
X_test.head()
Out[36]:
In [37]:
# y_train 생성
label_name = 'Survived'
y_train = train[label_name]
print(y_train.shape)
y_train.head()
Out[37]:
In [38]:
# Decision Tree 알고리즘 객체 생성
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5)
model
Out[38]:
In [39]:
# 학습하기
model.fit(X_train, y_train)
Out[39]:
In [40]:
!pip show graphviz
In [44]:
from sklearn.tree import export_graphviz
import graphviz
export_graphviz(model, feature_names=feature_names, class_names=['Perished','Survived'], out_file='decision-tree.dot')
with open('decision-tree.dot') as file:
dot_graph = file.read()
graphviz.Source(dot_graph)
Out[44]:
In [45]:
# 예측하기
predictions = model.predict(X_test)
print(predictions.shape)
predictions
Out[45]:
5. Submission (제출하기)¶
In [46]:
submit = pd.read_csv('data/titanic/gender_submission.csv', index_col='PassengerId')
print(submit.shape)
submit.head()
Out[46]:
In [48]:
submit['Survived'] = predictions
print(submit.shape)
submit.head()
Out[48]:
In [49]:
# 제출할 csv파일 생성하기
submit.to_csv('data/titanic/titanic01.csv')
'Python > MachineLearning' 카테고리의 다른 글
Machine Learning(ML)_iris_data예제 (0) | 2020.08.18 |
---|---|
Machine Learning(ML) (0) | 2020.08.17 |