0%

Kaggle课程问题记录

Kaggle course中遇到的问题

Intermediate Machine Learning

课程地址:Learn Intermediate Machine Learning Tutorials

Exercise: Categorical Variables

章节地址:Categorical Variables

问题描述:

在模型设置完毕后,由于使用了One-Hot编码,而test数据集中的数据在one-hot编码后与train数据集中的数据值不同, 导致one-hot编码后的数据维度不同

尝试措施:

先对整体数据进行One-Hot编码,在进行数据集的拆分。后来发现,跟train数据集没关系,有关系的是train数据集和test数据集的不同。于是又重新回到两个数据集,试图找出“哪些列存在不一样的值”,此处的不一样的值,指的是“取值范围”。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X_test = pd.read_csv('../input/test.csv', index_col='Id')
X = pd.read_csv('../input/train.csv', index_col='Id')

cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)
object_cols = [col for col in X.columns if X[col].dtype == "object"]
low_cardinality_cols = [col for col in object_cols if X[col].nunique() < 10]

cols_with_missing = [col for col in X_test.columns if X_test[col].isnull().any()]
X_test.drop(cols_with_missing, axis=1, inplace=True)
test_object_cols = [col for col in X_test.columns if X_test[col].dtype == "object"]
test_low_cardinality_cols = [col for col in test_object_cols if X_test[col].nunique() < 10]

different_cols = list(set(low_cardinality_cols)-set(test_low_cardinality_cols))
print(different_cols)

different_cols = list(set(object_cols)-set(test_object_cols))
print(different_cols)

OutPut:
['KitchenQual', 'Functional', 'Utilities', 'SaleType', 'MSZoning']
['KitchenQual', 'Exterior1st', 'Functional', 'Utilities', 'SaleType', 'MSZoning', 'Exterior2nd']

即,在train数据集下的[‘KitchenQual’, ‘Exterior1st’, ‘Functional’, ‘Utilities’, ‘SaleType’, ‘MSZoning’, ‘Exterior2nd’]这七列的数据取值范围与test数据集中存在差异,而“单一值数量大于五”的数据列中,[‘KitchenQual’, ‘Functional’, ‘Utilities’, ‘SaleType’, ‘MSZoning’]这五列存在差异,也即,train中有[‘KitchenQual’, ‘Exterior1st’, ‘Functional’, ‘Utilities’, ‘SaleType’, ‘MSZoning’, ‘Exterior2nd’]而test中没有,这就给one-hot编码带来了麻烦。

解决方案: 暂未解决

更新:在下一节的课程中,直接对test数据集处理了。。。我以为不能动test数据集。能动test集就简单多了,直接找出test集和train集的交集,把类别型变量和数值型变量分开就行了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

更新:在Data Leakage一节,讲到如果在调用train_test_split之前运行预处理(比如为丢失的值拟合填充),就会发生这种情况。。。每一节学习的都会叠过上一节的内容。