使用Python分析姿态估计数据集COCO的教程-技术圈

点击上方“小白学视觉”，选择加"星标"或“置顶”

重磅干货，第一时间送达

本文转自：AI算法与图像处理

当我们训练姿势估计模型，比较常用的数据集包括像COCO、MPII和CrowdPose这样的公共数据集，但如果我们将其与不同计算机视觉任务（如对象检测或分类）的公共可用数据集的数量进行比较，就会发现可用的数据集并不多。

姿态估计问题属于一类比较复杂的问题，为神经网络模型建立一个合适的数据集是很困难的，图像中每个人的每个关节都必须定位和标记，这是一项琐碎而费时的任务。

最流行的姿态估计数据集是COCO数据集，它有大约80类图像和大约250000个人物实例。

如果你检查此数据集中的一些随机图像，你可能会遇到一些与要解决的问题无关的实例。学术界希望达到最高的精度，但在实际生产环境中并不总是如此。

在现实世界中，我们可能更感兴趣的是在非常特定的环境中工作良好的模型，例如行人、篮球运动员、健身房等。

让我们从COCO数据集中查看此图像：

你看到红点了吗？这是关键点：鼻子。

有时，你可能不希望网络看到仅包含头部一部分的示例，尤其是在帧的底部。

在这篇文章中，我会向你展示COCO数据集的一个示例分析

COCO数据集

COCO数据集是用于许多计算机视觉任务的大规模通用数据集。150万个对象实例，80个对象类别，25万人——这些都让这个数据集令人印象深刻。你可以在源站点上找到更多详细信息，在那里你还可以下载所有必需的文件：https://cocodataset.org/

数据集由图像文件和注释文件组成。注释文件是一个JSON，包含关于一个人（或其他一些类别）的所有元数据。在这里我们会找到边界框的位置和大小，区域，关键点，源图像的文件名等。

我们不必手动解析JSON。有一个方便的Python库可用使用，即pycocotools(https://github.com/cocodataset/cocoapi/tree/master/PythonAPI)

我们需要train2017.zip(https://cocodataset.org/#download)，val2017.zip(https://cocodataset.org/#download)，annotations_trainval2017.zip(https://cocodataset.org/#download)

具体来说，我们只需要人的注释。zip中有两个文件：annotations_trainval2017.zip:person_keypoints_train2017.json和person_keypoints_val2017.json

我建议将文件放在以下这个文件夹层次结构中：

dataset_coco
   |---annotations
         |---person_keypoints_train2017.json
         |---person_keypoints_val2017.json
   |---train2017
         |---*.jpg
   |---val2017
         |---*.jpg

下面是显示如何加载注释的代码：

from pycocotools.coco import COCO
...

train_annot_path = 'dataset_coco/annotations  /person_keypoints_train2017.json'
val_annot_path = 'dataset_coco/annotations/person_keypoints_val2017.json'
train_coco = COCO(train_annot_path) # 加载训练集的注释
val_coco = COCO(val_annot_path) # 加载验证集的注释
...
# 函数遍历一个人的所有数据库并逐行返回相关数据
def get_meta(coco):
    ids = list(coco.imgs.keys())
    for i, img_id in enumerate(ids):
        img_meta = coco.imgs[img_id]
        ann_ids = coco.getAnnIds(imgIds=img_id)
        # 图像的基本参数
        img_file_name = img_meta['file_name']
        w = img_meta['width']
        h = img_meta['height']
        # 检索当前图像中所有人的元数据
        anns = coco.loadAnns(ann_ids)

        yield [img_id, img_file_name, w, h, anns]

...

# 迭代图像
for img_id, img_fname, w, h, meta in get_meta(train_coco):
    ...
    # 遍历图像的所有注释
    for m in meta:
        # m是字典
        keypoints = m['keypoints']
        ...
...

首先，我们必须加载COCO对象，它是json数据的包装器（第6-7行）

在第11行，我们加载所有图像标识符。

在接下来的几行中，我们为每个图像加载元数据，这是一个包含图像宽度、高度、名称、许可证等一般信息的词典。

在第14行，我们加载给定图像的注释元数据，这是一个字典列表，每个字典代表一个人。

第27-32行显示了如何加载整个训练集（train_coco），类似地，我们可以加载验证集（val_coco）

将COCO转换为Pandas数据帧

让我们将COCO元数据转换为pandas数据帧，我们使用如matplotlib、sklearn 和pandas。

这可用使得数据的过滤、可视化和操作变得更加容易，此外，我们还可以将数据导出为csv或parquet等。

def convert_to_df(coco):
    images_data = []
    persons_data = []
    
    # 遍历所有图像
    for img_id, img_fname, w, h, meta in get_meta(coco):
        images_data.append({
            'image_id': int(img_id),
            'path': img_fname,
            'width': int(w),
            'height': int(h)
        })
        
        # 遍历所有元数据
        for m in meta:
            persons_data.append({
                'image_id': m['image_id'],
                'is_crowd': m['iscrowd'],
                'bbox': m['bbox'],
                'area': m['area'],
                'num_keypoints': m['num_keypoints'],
                'keypoints': m['keypoints'],
            })
            
    # 创建带有图像路径的数据帧
    images_df = pd.DataFrame(images_data)
    images_df.set_index('image_id', inplace=True)
    
    # 创建与人相关的数据帧
    persons_df = pd.DataFrame(persons_data)
    persons_df.set_index('image_id', inplace=True)
    return images_df, persons_df

我们使用get_meta函数构造两个数据帧—一个用于图像路径，另一个用于人的元数据。在一个图像中可能有多个人，因此是一对多的关系。

在下一步中，我们合并两个表（left join操作）并将训练集和验证集组合，另外，我们添加了一个新列source，值为0表示训练集，值为1表示验证集。

这样的信息是必要的，因为我们需要知道应该在哪个文件夹中搜索图像。如你所知，这些图像位于两个文件夹中：train2017/和val2017/

images_df, persons_df = convert_to_df(train_coco)
train_coco_df = pd.merge(images_df, persons_df, right_index=True, left_index=True)
train_coco_df['source'] = 0

images_df, persons_df = convert_to_df(val_coco)
val_coco_df = pd.merge(images_df, persons_df, right_index=True, left_index=True)
val_coco_df['source'] = 1

coco_df = pd.concat([train_coco_df, val_coco_df], ignore_index=True)

最后，我们有一个表示整个COCO数据集的数据帧。

图像中有多少人

现在我们可以执行第一个分析。

COCO数据集包含多个人的图像，我们想知道有多少图像只包含一个人。

代码如下：

# 计数

annotated_persons_df = coco_df[coco_df['is_crowd'] == 0]
crowd_df = coco_df[coco_df['is_crowd'] == 1]

print("Number of people in total: " + str(len(annotated_persons_df)))
print("Number of crowd annotations: " + str(len(crowd_df)))

persons_in_img_df = pd.DataFrame({
    'cnt': annotated_persons_df['path'].value_counts()
})
persons_in_img_df.reset_index(level=0, inplace=True)
persons_in_img_df.rename(columns = {'index':'path'}, inplace = True)

# 按cnt分组，这样我们就可以在一张图片中得到带有注释人数的数据帧

persons_in_img_df = persons_in_img_df.groupby(['cnt']).count()

# 提取数组

x_occurences = persons_in_img_df.index.values
y_images = persons_in_img_df['path'].values

# 绘图

plt.bar(x_occurences, y_images)
plt.title('People on a single image ')
plt.xticks(x_occurences, x_occurences)
plt.xlabel('Number of people in a single image')
plt.ylabel('Number of images')
plt.show()

结果图表：

如你所见，大多数COCO图片都包含一个人。

但也有相当多的13个人的照片，让我们举几个例子：

好吧，甚至有一张图片有19个注解（非人群）：

这个图像的顶部区域不应该标记为一个人群吗？

是的，应该，但是，我们有多个没有关键点的边界框！这样的注释应该像对待人群一样对待，这意味着它们应该被屏蔽。

在这张图片中，只有中间的3个方框有一些关键点。

让我们来优化查询，以获取包含有/没有关键点的人图像的统计信息，以及有/没有关键点的人的总数：

annotated_persons_nokp_df = coco_df[(coco_df['is_crowd'] == 0) & (coco_df['num_keypoints'] == 0)]
annotated_persons_kp_df = coco_df[(coco_df['is_crowd'] == 0) & (coco_df['num_keypoints'] > 0)]

print("Number of people (with keypoints) in total: " +
        str(len(annotated_persons_kp_df)))
print("Number of people without any keypoints in total: " +
        str(len(annotated_persons_nokp_df)))

persons_in_img_kp_df = pd.DataFrame({
    'cnt': annotated_persons_kp_df[['path','source']].value_counts()
})
persons_in_img_kp_df.reset_index(level=[0,1], inplace=True)
persons_in_img_cnt_df = persons_in_img_kp_df.groupby(['cnt']).count()
x_occurences_kp = persons_in_img_cnt_df.index.values
y_images_kp = persons_in_img_cnt_df['path'].values

f = plt.figure(figsize=(14, 8))
width = 0.4
plt.bar(x_occurences_kp, y_images_kp, width=width, label='with keypoints')
plt.bar(x_occurences + width, y_images, width=width, label='no keypoints')

plt.title('People on a single image ')
plt.xticks(x_occurences + width/2, x_occurences)
plt.xlabel('Number of people in a single image')
plt.ylabel('Number of images')
plt.legend(loc = 'best')
plt.show()

现在我们可以看到区别是明显的。

虽然COCO官方页面上描述有25万人拥有关键点，而我们只有156165个这样的例子。

他们可能应该删除了“带关键点”这几个字。

添加额外列

一旦我们将COCO转换成pandas数据帧，我们就可以很容易地添加额外的列，从现有的列中计算出来。

我认为最好将所有的关键点坐标提取到单独的列中，此外，我们可以添加一个具有比例因子的列。

特别是，关于一个人的边界框的规模信息是非常有用的，例如，我们可能希望丢弃所有太小规模的人，或者执行放大操作。

为了实现这个目标，我们使用Python库sklearn中的transformer对象。

一般来说，sklearn transformers是用于清理、减少、扩展和生成数据科学模型中的特征表示的强大工具。我们只会用一小部分的api。

代码如下：

from sklearn.base import BaseEstimator, TransformerMixin

class AttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, num_keypoints, w_ix, h_ix, bbox_ix, kp_ix):
        """
        :param num_keypoints: 关键点的数量
        :param w_ix: 包含图像宽度的列索引
        :param h_ix: 包含图像高度的列索引
        :param bbox_ix: 包含边框数据的列索引
        :param kp_ix: 包含关键点数据的列索引
        """
        self.num_keypoints = num_keypoints
        self.w_ix = w_ix
        self.h_ix = h_ix
        self.bbox_ix = bbox_ix
        self.kp_ix = kp_ix
        
    def fit(self, X, y=None):
        return self 
      
    def transform(self, X):
      
        # 检索特定列
        
        w = X[:, self.w_ix]
        h = X[:, self.h_ix]
        bbox = np.array(X[:, self.bbox_ix].tolist())  # to matrix
        keypoints = np.array(X[:, self.kp_ix].tolist()) # to matrix
        
        # 计算边框的比例因子
        
        scale_x = bbox[:,2] / w
        scale_y = bbox[:,3] / h
        aspect_ratio = w / h
        
        # 计算规模类别
        
        scale_cat = pd.cut(scale_y,
            bins=[0., 0.4, 0.6, 0.8, np.inf],
            labels=['S', 'M', 'L', 'XL'])
        
        return np.c_[X, scale_x, scale_y, scale_cat, aspect_ratio, keypoints]
      
      
# 用于添加新列的transformer对象

attr_adder = AttributesAdder(num_keypoints=17, ...)
coco_extra_attribs = attr_adder.transform(coco_df.values)

# 创建列发新列表

keypoints_cols = [['x'+str(idx), 'y'+str(idx), 'v'+str(idx)]
                        for idx, k in enumerate(range(num_keypoints))]
keypoints_cols = np.concatenate(keypoints_cols).tolist()

# 创建新的更丰富的数据z帧

coco_extra_attribs_df = pd.DataFrame(
    coco_extra_attribs,
    columns=list(coco_df.columns) +
        ["scale_x", "scale_y", "scale_cat", "aspect_ratio"] +
        keypoints_cols,
    index=coco_df.index)

38行代码，我们为每一行指定规模类别（S、M、L或XL）。计算方法如下：

如果scale_y在[0–0.4）范围内，则类别为S
如果scale_y在[0.4–0.6）范围内，则类别为M
如果scale_y在[0.6–0.8）范围内，则类别为L
如果scale_y在[0.8–1.0）范围内，则类别为XL

在第42行中，我们将原始列与新列进行合并。

第28行我们将关键点扩展到单独的列中。COCO数据集中的关键点数据由一个一维列表表示：[x0，y0，v0，x1，y1，…]，我们可以把这个列转换成一个矩阵：[num of rows]x[num of keypoints*3]，然后，我们可以不需要任何额外的努力就可以返回它（第42行）。

最后，我们创建一个新的数据帧（第58-63行）

鼻子在哪里？

我们通过检查图像中头部位置的分布来找到鼻子的坐标，然后在标准化的二维图表中画一个点。

呈现此图表的代码如下：

# 对水平图像进行关键点坐标标准化

horiz_imgs_df = coco_extra_attribs_df[coco_extra_attribs_df['aspect_ratio'] >= 1.]

# 获取平均宽度和高度-用于缩放关键点坐标

avg_w = int(horiz_imgs_df['width'].mean())
avg_h = int(horiz_imgs_df['height'].mean())

class NoseAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, avg_w, avg_h, w_ix, h_ix, x1_ix, y1_ix, v1_ix):
        self.avg_w = avg_w
        self.avg_h = avg_h
        self.w_ix = w_ix
        self.h_ix = h_ix
        self.x1_ix = x1_ix
        self.y1_ix = y1_ix
        self.v1_ix = v1_ix

        def fit(self, X, y=None):
        return self 

      def transform(self, X):
        w = X[:, self.w_ix]
        h = X[:, self.h_ix]
        x1 = X[:, self.x1_ix]
        y1 = X[:, self.y1_ix]

        # 标准化鼻子坐标，提供平均宽度和高度

        scale_x = self.avg_w / w
        scale_y = self.avg_h / h
        nose_x = x1 * scale_x
        nose_y = y1 * scale_y
        
        return np.c_[X, nose_x, nose_y]

      # 用于标准化鼻子坐标列的transformer对象

w_ix = horiz_imgs_df.columns.get_loc('width')
h_ix = horiz_imgs_df.columns.get_loc('height')
x1_ix = horiz_imgs_df.columns.get_loc('x0')  # 鼻子的x坐标在'x0'列中
y1_ix = horiz_imgs_df.columns.get_loc('y0')  # 鼻子的y坐标在'y0'列中
v1_ix = horiz_imgs_df.columns.get_loc('v0')  # 鼻头的可见性

attr_adder = NoseAttributesAdder(avg_w, avg_h, w_ix, h_ix, x1_ix, y1_ix, v1_ix)
coco_noses = attr_adder.transform(horiz_imgs_df.values)

# 使用标准化的数据创建新数据帧

coco_noses_df = pd.DataFrame(
    coco_noses,
    columns=list(horiz_imgs_df.columns) + ["normalized_nose_x", "normalized_nose_y"],
    index=horiz_imgs_df.index)

# 过滤-只有可见的鼻子

coco_noses_df = coco_noses_df[coco_noses_df["v0"] == 2]

coco_noses_df.plot(kind="scatter", x="normalized_nose_x",
                   y="normalized_nose_y", alpha=0.3).invert_yaxis()

与前面一样，我们使用一个转换器来添加新列。

COCO数据集包含不同宽度和高度的图像，我们必须标准化每个图像中鼻子的x，y坐标，这样我们就能在输出图表中画出代表鼻子的点。

我们首先确定所有图像的平均宽度和高度（第7-8行）这里我们可以使用任何值，因为它只用于确定比例因子。

在第40-44行，我们从dataframe中找到所需列的索引。

随后，我们执行转换（第46-47行）并创建一个新的数据帧，其中包含新的列normalized_nose_x和normalized_nose_y（第51-55行）

最后一行绘制二维图表。

现在我们可以检查一些图像，例如，我们想检查一些头部位置非常接近图像底边的图像，为了实现这一点，我们通过列normalized_nose_y来过滤数据帧

low_noses_df = coco_noses_df[coco_noses_df['normalized_nose_y'] > 430 ]
low_noses_df

以下是满足此条件的示例图像：

关键点数量

具有特定数量关键点的边界框的数量是附加的有用信息。

为什么要边界框？

边界框有一个特殊的标志iscrowd，用来确定内容是应该作为一个群组（没有关键点）还是一个人（应该有关键点）。一般来说，iscrowd是为包含许多人的小实例（例如网球比赛中的观众）的边界框设置的。

y_images = coco_extra_attribs_df['num_keypoints'].value_counts()
x_keypoints = y_images.index.values

# 绘图

plt.figsize=(10,5)
plt.bar(x_keypoints, y_images)
plt.title('Histogram of keypoints')
plt.xticks(x_keypoints)
plt.xlabel('Number of keypoints')
plt.ylabel('Number of bboxes')
plt.show()

# 带有若干关键点(行)的bboxes(列)百分比

kp_df = pd.DataFrame({
    "Num keypoints %": coco_extra_attribs_df[
                           "num_keypoints"].value_counts() / len(coco_extra_attribs_df)
}).sort_index()

如你所见，在表中显示相同的信息非常容易：

规模

这是迄今为止最有价值的指标。

训练姿态估计深度神经网络模型对样本中人的规模变化非常敏感，提供一个平衡的数据集是非常关键的，否则，模型可能会偏向于一个更具优势的规模。

你还记得一个额外的属性scale_cat吗？现在我们要好好利用它。

代码：

persons_df = coco_extra_attribs_df[coco_extra_attribs_df['num_keypoints'] > 0]
persons_df['scale_cat'].hist()

可以呈现以下图表：

我们清楚地看到，COCO数据集包含了很多小人物——不到图像总高度的40%。我们把它放到表格中：

scales_props_df = pd.DataFrame({
    "Scales": persons_df["scale_cat"].value_counts() / len(persons_df)
})
scales_props_df

COCO数据集的分层抽样

首先，分层抽样定义为当我们将整个数据集划分为训练集/验证集等时，我们希望确保每个子集包含相同比例的特定数据组。

假设我们有1000人，男性占57%，女性占43%。我们不能只为训练集和验证集选取随机数据，因为在这些数据子集中，一个组可能会被低估。，我们必须从57%的男性和43%的女性中按比例选择。

换句话说，分层抽样在训练集和验证集中保持了57%的男性/43%的女性的比率。

同样，我们可以检查COCO训练集和验证集中是否保持了不同规模的比率。

persons_df = coco_extra_attribs_df[coco_extra_attribs_df['num_keypoints'] > 0]
train_df = persons_df[persons_df['source'] == 0]
val_df = persons_df[persons_df['source'] == 1]

scales_props_df = pd.DataFrame({
    "Scales in train set %": train_df["scale_cat"].value_counts() / len(train_df),
    "Scales in val set %": val_df["scale_cat"].value_counts() / len(val_df)
})
scales_props_df["Diff 100%"] = 100 * \
    np.absolute(scales_props_df["Scales in train set %"] -
                scales_props_df["Scales in val set %"])

在第2-3行，我们将数据帧拆分为训练集和验证集的单独数据帧，这与我们分别从person_keypoints_train2017.json和person_keypoints_val2017.json加载数据帧相同。

接下来，我们用训练集和验证集中每个规模组的基数创建一个新的数据帧，此外，我们添加了一个列，其中包含两个数据集之间差异的百分比。

结果如下：

如我们所见，COCO数据集的分层非常好，训练集和验证集中的规模组之间只有很小的差异（1-2%）。

现在，让我们检查不同的组-边界框中关键点的数量。

train_df = coco_extra_attribs_df[coco_extra_attribs_df['source'] == 0]
val_df = coco_extra_attribs_df[coco_extra_attribs_df['source'] == 1]

kp_props_df = pd.DataFrame({
    "Num keypoints in train set %": train_df["num_keypoints"].value_counts() / 
    len(train_df),
    "Num keypoints in val set %": val_df["num_keypoints"].value_counts() /
    len(val_df)
}).sort_index()

kp_props_df["Diff 100%"] = 100 * \
    np.absolute(kp_props_df["Num keypoints in train set %"] -
                kp_props_df["Num keypoints in val set %"])

类似地，我们看到关键点的数量在COCO训练和验证集中是相等的，这很好！

现在，你可以将所有数据集（MPII、COCO）合并到一个包中，然后自己进行拆分，有一个很好的sklearn类：StratifiedShuffleSplit可用做这个事情。

总结

在本文中，分析了COCO数据集的结构，了解其中的内容可以帮助你更好地决定增加或丢弃一些不相关的样本。

分析可以在Jupyter notebook上进行。

从COCO数据集中展示了一些或多或少有用的指标，比如图像中人的分布、人的边界框的规模、某些特定身体部位的位置。

最后，描述了验证集分层的过程。

github仓库链接：https://github.com/michalfaber/dataset_toolkit

下载1：OpenCV-Contrib扩展模块中文版教程

在「小白学视觉」公众号后台回复：扩展模块中文教程，即可下载全网第一份OpenCV扩展模块教程中文版，涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。

下载2：Python视觉实战项目52讲

在「小白学视觉」公众号后台回复：Python视觉实战项目，即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目，助力快速学校计算机视觉。

下载3：OpenCV实战项目20讲

在「小白学视觉」公众号后台回复：OpenCV实战项目20讲，即可下载含有20个基于OpenCV实现20个实战项目，实现OpenCV学习进阶。

交流群

欢迎加入公众号读者群一起和同行交流，目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群（以后会逐渐细分），请扫描下面微信号加群，备注：”昵称+学校/公司+研究方向“，例如：”张三 + 上海交大 + 视觉SLAM“。请按照格式备注，否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告，否则会请出群，谢谢理解~