作者:Arunn Thevapalan
如果你在处理现实数据的AI行业工作,那么你会理解这种痛苦。无论数据收集过程多么精简 ,我们用于建模的数据总是一片狼藉。
根据谷歌研究:“每个人都想做建模工作,而不是数据工作”——我对此感到非常愧疚。另外 ,本文介绍了一种叫做数据级联(data cascade)的现象,这种现象是指由底层数据问题引发的不利的后续影响的混合事件。实际上,该问题目前有三个方面 :
绝大多数数据科学技术并不喜欢清理和整理数据; 只有20%的时间是在做有用的分析; 数据质量问题如果不尽早处理,将会产生级联现象并影响后续工作。

加载一个混乱的数据集; 分析数据质量问题; 进一步挖掘警告信息; 应用策略来减轻这些问题; 检查在半清洗过后的数据的最终质量分析报告。
pip install ydata-quality

from ydata_quality import DataQualityimport pandas as pddf = pd.read_csv('../datasets/transformed/census_10k.csv')
这是一个漫长的过程,但是DataQuality引擎在抽取所有细节方面确实做的很好 。只要简单地创建主类并使用evaluate() 方法。
# create the main class that holds all quality modulesdq = DataQuality(df=df)
# run the testsresults = dq.evaluate()
Warnings: TOTAL: 5 warning(s) Priority 1: 1 warning(s) Priority 2: 4 warning(s)
Priority 1 - heavy impact expected: * [DUPLICATES - DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns.Priority 2 - usage allowed, limited human intelligibility: * [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables. * [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset. * [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 10 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order. * [DUPLICATES - EXACT DUPLICATES] Found 3 instances with exact duplicate feature values.
把所有的东西联系在一起,我们注意到有五个警告被识别出来,其中之一就是高优先级问题。它被“重复值”模块被检测出来,这意味着我们有一整个重复列需要修复。为了更深入地处理该问题,我们使用get_warnings() 方法。
[QualityWarning(category='Duplicates', test='Duplicate Columns', description='Found 1 columns with exactly the same feature values as other columns.', priority=, data={'workclass': ['workclass2']})]
数据质量的全貌需要多个角度分析,因此我们需要八个不同的模块。虽然它们被封装在DataQuality 类当中,但一些模块并不会运行,除非我们提供特定的参数。
from ydata_quality.bias_fairness import BiasFairness
#create the main class that holds all quality modulesbf = BiasFairness(df=df, sensitive_features=['race', 'sex'], label='income')
# run the testsbf_results = bf.evaluate()
Warnings: TOTAL: 2 warning(s) Priority 2: 2 warning(s)
Priority 2 - usage allowed, limited human intelligibility: * [BIAS&FAIRNESS - PROXY IDENTIFICATION] Found 1 feature pairs of correlation to sensitive attributes with values higher than defined threshold (0.5). * [BIAS&FAIRNESS - SENSITIVE ATTRIBUTE REPRESENTATIVITY] Found 2 values of 'race' sensitive attribute with low representativity in the dataset (below 1.00%).
bf.get_warnings(test='Proxy Identification')
[QualityWarning(category='Bias&Fairness', test='Proxy Identification', description='Found 1 feature pairs of correlation to sensitive attributes with values higher than defined threshold (0.5).', priority=2>, data=features
relationship_sex 0.650656
Name: association, dtype: float64)]
def improve_quality(df: pd.DataFrame): """Clean the data based on the Data Quality issues found previously.""" # Bias & Fairness df = df.replace({'relationship': {'Husband': 'Married', 'Wife': 'Married'}}) # Substitute gender-based 'Husband'/'Wife' for generic 'Married'
# Duplicates df = df.drop(columns=['workclass2']) # Remove the duplicated column df = df.drop_duplicates() # Remove exact feature value duplicates
return df
clean_df = improve_quality(df.copy())
*DataQuality Engine Report:*
Warnings: TOTAL: 3 warning(s) Priority 2: 3 warning(s)
Priority 2 - usage allowed, limited human intelligibility: * [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1360 ED values in the dataset. * [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables. * [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 9 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order.
*Bias & Fairness Report:*
Warnings: TOTAL: 1 warning(s) Priority 2: 1 warning(s)
Priority 2 - usage allowed, limited human intelligibility: * [BIAS&FAIRNESS - SENSITIVE ATTRIBUTE REPRESENTATIVITY] Found 2 values of 'race' sensitive attribute with low representativity in the dataset (below 1.00%).

A Data Scientist’s Guide to Identifyand Resolve Data Quality Issues
