数据科学的面试的一些基本问题总结
数据派THU
共 7631字,需浏览 16分钟
·
2022-06-26 23:45
来源:Deephub Imba 本文约5000字,建议阅读10分钟
本文将介绍如何为成功的面试做准备的,以及可以帮助我们面试的一些资源。
代码开发基础
从表中选择某些列 连接两个表(内连接、左连接、右连接和外连接) 汇总结果(总和、平均值、最大值、最小值) 在 SQL 中使用窗口函数 日期处理
处理df(pandas),例如读取、加入、合并、过滤 操作日期和格式化日期 操作字符串,例如使用正则表达式、搜索字符串包含的内容 有效地使用循环 使用列表和字典 在 Python 中创建函数和类
了解数据结构和算法
大O符号 二进制搜索 数组和链表 选择排序 快速排序 冒泡排序 合并排序 哈希表
线性回归
Logistic 回归
聚类
随机森林和提升树
自编码器
梯度下降
独热编码与标签编码
标签编码 One-Hot 编码
# Import label encoder
from sklearn import preprocessing
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column ‘Country’.
data[‘Country’]= label_encoder.fit_transform(data[‘Country’])
print(data.head())
# importing one hot encoder
from sklearn from sklearn.preprocessing import OneHotEncoder
# creating one hot encoder object
onehotencoder = OneHotEncoder()
#reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object
X = onehotencoder.fit_transform(data.Country.values.reshape(-1,1)).toarray()
#To add this back into the original dataframe
dfOneHot = pd.DataFrame(X, columns = [“Country_”+str(int(i)) for i in range(data.shape[1])])
df = pd.concat([data, dfOneHot], axis=1)
#droping the country column
df= df.drop([‘Country’], axis=1)
#printing to verify
print(df.head())
VIF=1,多重共线性非常少 VIF<5,中度多重共线性 VIF>5,极端多重共线性(这是我们必须避免的)
超参数调优
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = [‘auto’, ‘sqrt’]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {‘n_estimators’: n_estimators,
‘max_features’: max_features,
‘max_depth’: max_depth,
‘min_samples_split’: min_samples_split,
‘min_samples_leaf’: min_samples_leaf,
‘bootstrap’: bootstrap}pprint(random_grid){‘bootstrap’: [True, False],
‘max_depth’: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
‘max_features’: [‘auto’, ‘sqrt’],
‘min_samples_leaf’: [1, 2, 4],
‘min_samples_split’: [2, 5, 10],
‘n_estimators’: [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(train_features, train_labels)
rf_random.best_params_{‘bootstrap’: True,‘max_depth’: 70,‘max_features’: ‘auto’,‘min_samples_leaf’: 4,‘min_samples_split’: 10,‘n_estimators’: 400}
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {‘bootstrap’: [True],‘max_depth’: [80, 90, 100, 110],‘max_features’: [2, 3],‘min_samples_leaf’: [3, 4, 5],‘min_samples_split’: [8, 10, 12],‘n_estimators’: [100, 200, 300, 1000]}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,cv = 3, n_jobs = -1, verbose = 2)
精度和召回
损失函数
最后总结
编辑:王菁
评论