使用Python+OpenCV+FaceNet 实现亚马逊门铃系统上的人脸识别-技术圈

点击下面卡片关注“AI算法与图像处理”，选择加"星标"或“置顶”

重磅干货，第一时间送达

作为一个新的亚马逊门铃的买家，我喜欢它提供的炫酷功能。然而，我认为我可以做一些改进。我需要的是为住在我家的人定制的门铃。要是门铃能认出是谁在敲门就好了。看到门铃是多么的受欢迎，我决定帮助大多数家庭，最好方法是让他们能够毫不费力地定制他们的门铃。

我开发了一个应用程序，可以告诉你谁在你的门口，只需输入你的门铃帐户的用户名和密码。知道谁在你的门口，无需等待门铃在你的智能手机上显示视频，这是非常方便的。它大大提高了安全性，带来了极大的便利，甚至可以安装在一个自动开门的系统上。在深度学习时代，每个家庭都需要安装这些系统。下图说明了我的系统是如何工作的。

完整的代码可以在这里的Git存储库中找到。

https://github.com/dude123studios/SmarterRingV2

要求如下：

tensorflow==2.4.1
opencv-python==4.5.1.48
mtcnn==0.1.0
ring_doorbell==0.7.0
oauthlib~=3.1.0
numpy~=1.19.5
scipy~=1.6.1
scikit-learn==0.24.1
gtts==2.2.2
playsound~=1.2.2

让我们来分析一下发生了什么。通过输入用户名和密码作为环境变量，Ring API就能够连接到你的帐户。该API允许用户访问python特性。这里（https://github.com/tchellomello/python-ring-doorbell）有API存储库和简短的文档。

这是Ring.py的一个片段，它实例化了一个与你的门铃的连接:

import os
import json
from pathlib import Path
from ring_doorbell import Ring, Auth
from oauthlib.oauth2 import MissingTokenError

cache_file = Path("test_token.cache")


def token_updated(token):
    cache_file.write_text(json.dumps(token))


def otp_callback():
    auth_code = input("[INPUT] 2FA code: ")
    return auth_code


def main(download_only=False):
    if cache_file.is_file():
        auth = Auth("MyProject/1.0", json.loads(cache_file.read_text()), token_updated)
    else:
        username = os.environ.get('USERNAME')
        password = os.environ.get('PASSWORD')
        auth = Auth("MyProject/1.0", None, token_updated)
        try:
            auth.fetch_token(username, password)
        except MissingTokenError:
            auth.fetch_token(username, password, otp_callback())

    ring = Ring(auth)
    ring.update_data()

    wait_for_update(ring, download_only=download_only)

wait_for_update方法持续运行并实例化一个正在等待客户端的处理程序。它会继续刷新，直到发现Ring的存储历史记录有更新。一旦发生这种情况，它检查门铃是否被按了。如果是这样，它会把整个视频下载到你的设备上。为了加快这一过程，请使用智能手机上的ring应用程序缩小视频录制的大小。

你的门铃响了，最后一段视频就会传到你的电脑上。从那里，我们截取了那段视频的多个帧，以确保一个人的脸都不会被遮住。我在utils.py中定义了这个方法。它将在稍后显示。下面是ring.py的另一个片段。用于处理主线程:

import time

def wait_for_update(ring, download_only=False):
    id = -1
    start = time.time()
    while True:
        try:
            ring.update_data()
        except:
            time.sleep(1)
            continue
        doorbell = ring.devices()['authorized_doorbots'][0]
        for event in doorbell.history(limit=20, kind='ding'):
            current_id = event['id']
            break
        if current_id != id:
            id = current_id
            print('[INFO] finished search in:', str(time.time() - start))
            start = time.time()
            if download_only:
                handle_video(ring, True)
                return
            handle = handle_video(ring)
            if handle:
                text_to_speech(handle)
            else:
                text_to_speech('The person at the door is not very clear')
        time.sleep(1)

如果你对identify、get_first_frame和text_to_speech方法调用有点困惑，不要担心！我们就要谈到这个了!现在我们的处理程序已经就位，让我们开始面部识别吧！

FaceNet

FaceNet是谷歌在2015年开发的一个模型。FaceNet使用一种称为聚类的过程

聚类的目的是创建一种嵌入，就像单词一样。唯一的区别是，该模型不是学习向量标记的id，而是将图像压缩到一个小的潜在空间。具体来说，给定一幅形状为(160,160,3)的图像，FaceNet模型，产生一个形状为（128）的矢量，称为它的嵌入。该模型将确保不同人的面孔在嵌入空间中的距离较远，同一个人的面孔距离较近。这样，一个人无论在什么样的光线条件下，从什么样的角度，或者什么妆容，都可以被认出来。

FaceNet架构

FaceNet类似于ResNet和InceptionV3。架构如下所示。输入图像经过1x1Conv层和2x2Pooling层，然后沿着深度ResNet下行，由成对的Inception层和残差连接层组成。最后的层包含多个3x3Conv、Concat和2x2Pooling层。

加载模型的代码很简单。模型存储在目录model/files/中。

from tensorflow.keras.models import load_model

model = load_model('model/facenet_keras.h5')

开发一个模型来概括它以前从未见过的面孔是很困难的。FaceNet模型是在MS-Celeb-1M数据集上训练的，该数据集包含100万张不同名人的照片。通过对同一个人的图像组进行L2归一化，以及余弦相似函数，FaceNet能够产生令人难以置信的高识别精度。

我发明了一种方便的方法来登记你家人的面孔，运行submit_face.py，并传递参数“name”(要注册的人的名字)。另外，为了提高准确性和匹配照明条件，你可以使用布尔参数“from_door”，如果为真，将直接从你的门铃的最后录制的视频中保存图像。

这些图像被存储在目录data/faces/中。用MTCNN人脸检测对它们进行预裁剪。检测方法将在稍后显示，它是face_recognition.py的一部分。对于拍到的视频，我抓取了视频的特定帧，并测试哪些帧可以工作。我们将需要做一些图像预处理，以及其他小的函数，我将在utils.py中定义:

import cv2


def normalize(img):
    mean, std = img.mean(), img.std()
    return (img - mean) / (std + 1e-7)


def preprocess(cv2_img):
    cv2_img = normalize(cv2_img)
    cv2_img = cv2.resize(cv2_img, (160, 160))
    return cv2_img


def get_specific_frames(video_path, times):
    vidcap = cv2.VideoCapture(video_path)
    frames = []
    for time in times:
        vidcap.set(1, time * 15)
        success, image = vidcap.read()
        if success:
            frames.append(image)
    return frames

一旦你想要识别的每个人的图像都在目录data/faces/中，我们就可以将其转换为编码。我们把这作为单独的一步，因为我们L2标准化了每个人对应的所有图像。

import os
from utils import preprocess
import cv2
import numpy as np
from sklearn.preprocessing import Normalizer
import face_recognition
import pickle

encoding_dict = {}
l2_normalizer = Normalizer('l2')

for face_names in os.listdir('data/faces/'):
    person_dir = os.path.join('data/faces/', face_names)
    encodes = []
    for image_name in os.listdir(person_dir):
        image_path = os.path.join(person_dir, image_name)

        face = cv2.imread(image_path)

        face = preprocess(face)
        encoding = face_recognition.encode(face)
        encodes.append(encoding)

    if encodes:
        encoding = np.sum(encodes, axis=0)
        encoding = l2_normalizer.transform(np.expand_dims(encoding, axis=0))[0]
        encoding_dict[face_names] = encoding

path = 'data/encodings/encoding.pkl'
with open(path, 'wb') as file:
    pickle.dump(encoding_dict, file)

预处理函数是我用来标准化图像并将其重塑为(160,160,3)的函数，而识别函数是一个执行编码函数的类。如果你注意到了，我将这些编码保存为字典。在执行实时识别时，这个字典很方便，因为它是存储人名和编码的一种简单方法。

实时人脸识别

现在我们有了我们想要识别的人的图像，那么实时识别过程是如何工作的呢?如下图所示:

门铃响时，下载一个视频，选择多个帧。利用这些帧，用detect_faces方法进行多实例的人脸检测。下面是face_recognition.py 类的一个片段:

import cv2
import mtcnn

face_detector = mtcnn.MTCNN()
conf_t = 0.99

def detect_faces(cv2_img):
    img_rgb = cv2.cvtColor(cv2_img, cv2.COLOR_BGR2RGB)
    results = face_detector.detect_faces(img_rgb)
    faces = []
    for res in results:
        x1, y1, width, height = res['box']
        x1, y1 = abs(x1), abs(y1)
        x2, y2 = x1 + width, y1 + height

        confidence = res['confidence']
        if confidence < conf_t:
            continue
        faces.append(cv2_img[y1:y2, x1:x2])
    return faces


def detect_face(cv2_img):
    img_rgb = cv2.cvtColor(cv2_img, cv2.COLOR_BGR2RGB)
    results = face_detector.detect_faces(img_rgb)
    x1, y1, width, height = results[0]['box']
    cv2.waitKey(1)
    x1, y1 = abs(x1), abs(y1)
    x2, y2 = x1 + width, y1 + height

    confidence = results[0]['confidence']
    if confidence < conf_t:
        return None
    return cv2_img[y1:y2, x1:x2]

对图像进行预处理并送入FaceNet。FaceNet将输出每个人脸的128维嵌入。然后使用余弦相似度将这些向量与encode .pkl中存储的向量进行比较。人脸与输入人脸最接近的人被返回。如果一张脸距离它最近的脸有一个特定的阈值，则返回“未知”。这表明这张脸不像任何已知的脸。下面是face_recognition.py类的其余部分:

from utils import preprocess
from model.facenet_loader import model
import numpy as np
from scipy.spatial.distance import cosine
import pickle
from sklearn.preprocessing import Normalizer

l2_normalizer = Normalizer('l2')


def encode(img):
    img = np.expand_dims(img, axis=0)
    out = model.predict(img)[0]
    return out


def load_database():
    with open('data/encodings/encoding.pkl', 'rb') as f:
        database = pickle.load(f)
    return database


recog_t = 0.35


def recognize(img):
    people = detect_faces(img)
    if len(people) == 0:
        return None
    best_people = []
    people = [preprocess(person) for person in people]
    encoded = [encode(person) for person in people]
    encoded = [l2_normalizer.transform(encoding.reshape(1, -1))[0]
               for encoding in encoded]
    database = load_database()
    for person in encoded:
        best = 1
        best_name = ''
        for k, v in database.items():
            dist = cosine(person, v)
            if dist < best:
                best = dist
                best_name = k
        if best > recog_t:
            best_name = 'UNKNOWN'
        best_people.append(best_name)
    return best_people

这样就完成了大部分的识别任务。

语音合成

我想知道谁在门口。一开始，我以为在铃声设备上播放声音是最佳策略，但亚马逊不允许我这么做，只允许我播放铃声伴随的默认声音。因此，从文本到语音似乎是一种更合适的方式。这可以通过两个包GTTS和playsound来简化。GTTS使用谷歌的Tacotron 2模型。虽然完全理解它的工作原理并不重要，但对于感兴趣的读者来说，该图说明了它的架构

Tacotron与Seq2Seq非常相似，但是它使用了双向LSTM、卷积层、预网络层，以及最重要的2D生成输入到解码器(光谱图)。如果你想了解更多关于Tacotron 2的内容，这里有一个由CodeEmporium制作的关于这个主题的视频。

https://www.youtube.com/watch?v=le1LH4nPfmE&ab_channel=CodeEmporium

虽然Tacotron 2算不上是最好的，尤其是与transformer 模型相比，但它确实做到了。使用GTTS python API的方法如下:

from gtts import gTTS
from playsound import playsound

language = 'en'
slow_audio_speed = False
filename = 'tts_file.mp3'

def text_to_speech(text):
    audio_created = gTTS(text=text, lang=language,
                         slow=slow_audio_speed)
    audio_created.save(filename)
    playsound(filename)

很简单。

我使用playsound而不是os.system的原因是，os.system将默认打开默认的声音播放器应用程序，而playsound不会弹出任何窗口。这就完成了项目的最后一个步骤。

总结和Git存储库

请在这里查看我的git存储库，以获得完整的代码，并轻松地定制你自己的门铃。

https://github.com/dude123studios/SmarterRingV2

在README.md中查看说明，并解释在你自己的家里使用这个系统的确切步骤。只需要5分钟就可以安装好！亚马逊，把它放进你的下一个门铃里！

进一步的探索和问题

FaceNet是一个相当过时的模式。在过去的五年里，在transformer模型方面有了重大发现，例如ViT。GPT-3是一个概括之神。完成创建广义嵌入的任务后，GPT-3之类的转换器会更好地工作吗？卷积神经网络可能不是面部识别的最佳选择，因为长期依赖关系（如耳朵或下颚线）需要庞大的网络。另一方面，transformer模型可以考虑到自相似性，并且实时进行人脸识别的速度要快得多。

参考文献

https://www.youtube.com/watch?v=le1LH4nPfmE&ab_channel=CodeEmporium

https://github.com/tchellomello/python-ring-doorbell

https://arxiv.org/abs/1503.03832

https://gtts.readthedocs.io/en/latest/module.html

https://pypi.org/project/playsound/

https://pypi.org/project/mtcnn/

☆ END ☆

个人微信（如果没有备注不拉群！）

请注明：地区+学校/企业+研究方向+昵称

下载1：何恺明顶会分享

在「AI算法与图像处理」公众号后台回复：何恺明，即可下载。总共有6份PDF，涉及 ResNet、Mask RCNN等经典工作的总结分析

下载2：终身受益的编程指南：Google编程风格指南

在「AI算法与图像处理」公众号后台回复：c++，即可下载。历经十年考验，最权威的编程规范！

下载3 CVPR2021

在「AI算法与图像处理」公众号后台回复：CVPR，即可下载1467篇CVPR 2020论文 和 CVPR 2021 最新论文

点亮，告诉大家你也在看