Metrics Context
Last updated
Last updated
Having the model working on the PYNQ-Z2, remains the establishment of means of comparison. The only way to know if a PLD is a good device to execute Neural Networks is to compare it against other devices and measure common metrics. Previously we compared the inference speed against a normal computer but now we will evaluate some accuracy measurements.
To start, I'll give you the context on common object detection metrics. These information's can also be consulted on the Kiprono Elijah Koech article and on Rafael Padilla's works.
Intersection over union (IoU) is a metric that evaluates the intersection between two bounding boxes. This requires the ground truth bounding boxes (the ones that actually limit around the object) and the predicted bounding boxes (the ones that the model predicted). The IoU is obtained by the intersection of the ground truth bounding box with the predicted bounding box, divided by their union. The following equation show's that mathematical relationship:
To visualize the situation better, the next image represents the same equation but with the ground truth bounding box in green and the predicted bounding box in red:
IoU varies between 0 and 1, where 0 means no intersection between the ground truth bounding box and the predicted bounding box and 1 means that they are perfectly overlapping. This metric will be essential to distinguish a correct detection from a wrong detection, so, it need to be associated with a threshold like 50%, 75% or 95%. The following concepts apply the distinction based on the threshold value:
True Positives (TP) - Correct detection. IoU greater or equal to the threshold.
False Positive (FP) - Incorrect detection. IoU is smaller or equal to the threshold.
False Negative (FN) - Ground truth undetected.
True Negative (TN) - Represents the bounding boxes that must not be detected on the image. This doesn't apply on object detection because there are too many options.
These concepts can be applied on a Confusion Matrix as shown below:
Now, there will be introduced two important metrics for object detection: precision and recall. Precision is the model's ability to detect only relevant objects. This is the percentage of correct detection's, as shown on the equation:
About the recall, it represents the model's ability to find ground truth's. In other words, it represents the number of true positives (TP) found on a number of ground truths.
To help you understand these measurements better I created this image which we will discuss:
The figure represents an image and in green there are the ground truth bounding boxes and in red the bounding boxes predicted by the model. The Iou threshold was defined to 50%, so, if the IoU is smaller than that value, the detection is called False Positive (FP). If the detection has a IoU greater than 50%, the detection will be a True Positive, in other words, a correct detection. On the image there are, in fact, 2 FP and 1 TP. With that being said, the Precision is the TP divided by all detection's/ predictions, so, P=33,3%. On the other hand, the recall is the number of TP divided by the number of ground truths, so, recall is 1 TP divided by 4 ground truths (green). R=25%.
A good way to measure the performance of a object detector is to use a graph that relates Precision with Recall for each class. A good object detector is identified when the Precision remains high when the Recall increases because this means that even when changing the confidence threshold, the Precision and Recall remain high. On the following image you can see in blue a a good object detector and in red a less good object detector.
Another good way to compare performance between object detection models is by using the Average precision (AP). This metric is simply the area under the Precision x Recall curve. This metric arises because of the fact that many times comparing different curves from different models is hard, since there might be some abrupt changes or the curves might cross. Basically, AP is the average Precision across the Recall between 0 and 1. Nowadays it's done interpolation on all points of the graph but before 2010 there were different methods based on interpolation on 11 points with the same distance. The next images show in a simple way how the AP is obtained:
As you can see, first we make the interpolation of the precision x recall curve. You might have noticed that this process is here to simplify the original curve so that the AP calculation is easier and less computationally demanding.
The next step implies generating areas using the interpolated graph so we can obtain an approximation of the total are of the curve. The sum of the area of all rectangles is the AP. In this case, if the AP is 0,2456, it means that the average precision of the model is 24,56% across the recall.
To compare the performance between models for all classes, it's possible to obtain the Mean Average Precision (mAP) which represents the average AP for the total set of classes. It's important that the models to be compared have the same number of classes or this metric is no longer relevant.
The number of classes is the number of objects that are possible to be identified. In our case, the YOLO model used the COCO dataset and it had the full 80 classes.