Face Recognition With Ring

7 min readMar 8, 2021

As the buyer of a new Amazon Ring Doorbell, I enjoyed the cool features which it offered. However I thought I could make some improvements. What I needed was a doorbell customized for the people who live in our house. It would be nice if the doorbell recognized who was at the door. Seeing how popular Ring is, I decided the best way to help most households would be to enable them to customize their Ring Doorbell with little to no effort.

I have developed an application that can tell you who’s at your door, with just a simple entry of your ring account’s username and password. The knowledge of who’s at your door, without having to wait for your Ring to display the video on your smart phone is very convenient. It drastically improves safety, results in major convenience, and could even be rigged to a system the automatically opens the door, as well. In the age of deep learning, it should be trivial that these systems need to be in every household. The graphic below illustrates how my system works.

Customized Ring based face recognition — Architecture for customized Ring Doorbell

The complete code can be found in my Git Repository here.

The requirements are as follows:

Lets break down what happens. By entering a username and password as environment variables, the Ring API is able to connect to your account. The API enables users to access features in python. The API repository with brief documentation is here. This is a snippet of ring.py that instantiates a connection with your Ring:

The wait_for_update method runs continuously and instantiates a handler is waiting on your client. It continues refreshing until it notices an update in the ring’s stored history. Once this happens, it checks whether or not a “ding” has occurred(The doorbell pressed). If so, it downloads the entire video to your device. To speed up the process turn down the video recording size with the ring app on your smart phone .

The ring doorbell that you own will send it’s last video to your computer. From there, we grab multiple frames of that video, making sure that a person’s face isn’t turned away in any. I define this method in utils.py, which will be displayed later. The following is another snippet of ring.py, handling the main thread:

If you are a bit confused about the recognize, get_first_frame, and text_to_speech method calls, don't worry! We’re getting to that! Now that our handler is in place, let us get to the vital aspect of Facial Recognition!

FaceNet

FaceNet is a model by Google, developed in 2015. FaceNet uses a procces known as clustering

Clustering aims to create an embedding, just like those for words. The only difference, the model is not learning vectors tokenized id’s, rather, it’s compressing images down to a small latent space. Specifically, given an image of shape (160, 160, 3), The FaceNet model, produces a vector of shape (128), known as it’s embedding. The model will make sure that different people’s faces are farther apart in the embedding space, and faces of the same person are close together. In this way, a person can be recognized no matter what lighting condition, angle, or makeup that they wear.

FaceNet Architecture

FaceNet is similar to ResNet and InceptionV3. The Architecture is shown below. The Input Image is passed through 1x1Conv and 2x2Pooling layers, and is then followed down a deep ResNet with pairs of Inception layers followed by residual connections. The Final layers Contain multiple 3x3Conv, Concat, and 2x2Pooling Layers.

The code to load the model is simple. The model is stored in the directory model/files/.

Developing a model that can generalize faces it has never seen before is tough. The FaceNet model was trained on the MS-Celeb-1M dataset of a million images of different celebrities. With L2 normalization on groups of images from the same person, and cosine similarity function, FaceNet Is capable of producing incredibly high recognition accuracy.

I have created a convenient way to register faces of your family. Run submit_face.py, and pass the argument “name” (name of the person who’s face you want to register). Alternatively, to improve accuracy and match lighting conditions you can use the Boolean argument “from_door”, which, if true, will directly save images from your Ring’s last recorded video.

The images are stored into the directory data/faces/. They are stored pre-cropped with MTCNN face detection. The detection method will be displayed later, and is part of face_recognition.py. For the ring video, I grab specific frames of the video, and test which ones work. We will need to do some image preprocessing, along with other small functions, which I will all define in utils.py:

Once Images of everyone you would like to recognize are in the directory data/faces/, we are ready to convert these to encodings. We make this a separate step because we L2 normalize all of the images corresponding to each person.

The preprocess function is a function I used to normalize an image and reshape it to (160 ,160, 3), and recognize is a class that performs the encoding function. If you noticed, I save the encodings as a dictionary. This dictionary is convenient when performing real time recognition, because it is an easy way to store both the name, and the encoding of a person.

Real Time Face Recognition

Now that we have images of the people we would like to recognize, how does the real-time recognition process work? It is shown in the following diagram:

When the doorbell is rung, A video is downloaded, and multiple frames are selected. With these frames, multi instance facial detection is performed with the detect_faces method. The following is a snippet of the face_recognition.py class:

The image crops are preprocessed and fed into the FaceNet. The FaceNet will output each face’s 128 dimensional embedding. Each of these vectors are then compared using cosine similarity to the vectors stored in encodeding.pkl. The people whose faces are closest to that of the input faces are returned. If a face is a certain threshold away from it’s closest face, then “unknown” is returned. This indicates that the face does not resemble any known face. The following is the remainder of the face_recognition.py class:

This completes the majority of the recognition task.

Text-To-Speech

I would prefer to be told who is at the door. At first, I thought that playing a noise on the ring chime device was the optimal strategy, but Amazon prevents me from doing this, only enabling me to play the default sounds accompanied with the ring. For this reason Text-To-Speech seems to be a more appropriate modality. This is made simple with two packages, GTTS, and playsound. GTTS uses Google’s Tacotron 2 model. While it is not important you fully understand how it works, for the interested reader, the diagram illustrates it’s architecture

The Tacotron is similar to Seq2Seq with attention, however it uses a Bi-directional LSTM, convolutional layers, a Pre-net layer, and most importantly 2D generative inputs to the decoder(Spectrograms). If you would like to learn more about Tacotron 2, here is a video by CodeEmporium on the subject. While Tacotron 2 is not nearly the best, especially compared to transformer models, it does the job. To use the GTTS python API, I use the following method:

Pretty simple.

The reason I use playsound, rather than os.system, is that os.system will default to opening the default sound player app, while playsound makes no popups. That completes the final aspect of the project.

Conclusion and Git Repository

Check out my git repository here, to get the entire code, and easily customize your own Ring Doorbell. The instructions are in the README.md file, and explain the exact steps to use this system for your own home. It takes only five minutes to setup! Amazon, put this in your next doorbell!

Further Exploration and Questions

FaceNet is quite an outdated model. During the past five years, major discoveries in Transformers have been made, such as the ViT. The GPT-3 is a GOD at generalization. With the task of creating generalized embeddings, will transformers such as GPT-3 work better? ConvNets may not be the best for facial recognition due to the huge networks required for long term dependencies such as ears or jawline. Transformers on the other hand can take into account self-similarity and are much, much faster to be performing facial recognition in real time.