Operator Reference

Object Detection and Instance Segmentation

This chapter explains how to use object detection based on deep learning.

With object detection we want to find the different instances in an image and assign them to a class. The instances can partially overlap and still be distinguished as distinct. This is illustrated in the following schema.

A possible example for object detection: Within the input image three instances are found and assigned to a class.

Instance segmentation is a special case of object detection, where the model also predicts an instance mask marking the specific region of the instance within the image. This is illustrated in the following schema. In general the explanations to object detection also apply to instance segmentation. Possible differences are brought up in the specific sections.

A possible example for instance segmentation: Within the input image three instances are found. Each instances is assigned to a class and obtains a mask marking its particular region.

Object detection leads to two different tasks: Finding the instances and classifying them. In order to do so, we use a combined network consisting of three main parts. The first part, called backbone, consists of a pretrained classification network. Its task is to generate various feature maps, so the classifying layer is removed. These feature maps encode different kinds of information at different scales, depending how deep they are in the network. See also the chapter Deep Learning. Thereby, feature maps with the same width and height are said to belong to the same level. In the second part, backbone layers of different levels are combined. More precisely, backbone levels of different levels are specified as docking layers. Their feature maps are combined. As a result we obtain feature maps containing information of lower and higher levels. These are the feature maps we will use in the third part. This second part is also called feature pyramid and together with the first part it constitutes the feature pyramid network. The third part consists of additional networks, called heads, for every selected level. They get the corresponding feature maps as input and learn how to localize and classify, respectively, potential objects. Additionally this third part includes the reduction of overlapping predicted bounding boxes. An overview of the three parts is shown in the following figure.

A schematic overview of the mentioned three parts: (1) The backbone. (2) Backbone feature maps are combined and new feature maps generated. (3) Additional networks, called heads, which learn how to localize and classify, respectively, potential objects. Overlapping bounding boxes are suppressed.

Let us have a look what happens in this third part. In object detection, the location in the image of an instance is given by a rectangular bounding box. Hence, the first task is to find a suiting bounding box for every single instance. To do so, the network generates reference bounding boxes and learns, how to modify them to fit the instances best possible. These reference bounding boxes are called anchors. The better these anchors represent the shape of the different ground truth bounding boxes, the easier the network can learn them. For this purpose the network generates a set of anchors on every anchor point, thus on every pixel of the used feature maps of the feature pyramid. Such a set consists of anchors of all combinations of shapes, sizes, and for instance type 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2" (see below) also orientations. The shape of those boxes is affected by the parameter 'anchor_aspect_ratios'"anchor_aspect_ratios""anchor_aspect_ratios""anchor_aspect_ratios""anchor_aspect_ratios" the size by the parameter 'anchor_num_subscales'"anchor_num_subscales""anchor_num_subscales""anchor_num_subscales""anchor_num_subscales", and the orientation by the parameter 'anchor_angles'"anchor_angles""anchor_angles""anchor_angles""anchor_angles", see the illustration below and get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamget_dl_model_param. If the parameters generate multiple identical anchors, the network internally ignores those duplicates.

Schema for anchors in the feature map (right) and in the input image (left). (1) Anchors are created on the feature maps of different levels, e.g., the ones drawn (in light blue, orange, and dark blue). (2) Anchors of different sizes are created setting 'anchor_num_subscales'"anchor_num_subscales""anchor_num_subscales""anchor_num_subscales""anchor_num_subscales". (3) Anchors of different shapes are created setting 'anchor_aspect_ratios'"anchor_aspect_ratios""anchor_aspect_ratios""anchor_aspect_ratios""anchor_aspect_ratios". (4) Anchors of different orientations are created setting 'anchor_angles'"anchor_angles""anchor_angles""anchor_angles""anchor_angles" (only for instance type 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2").

The network predicts offsets how to modify the anchors in order to obtain bounding boxes fitting the potential instances better. The network learns this with its bounding box heads, which compare the anchors generated for their level with the corresponding ground truth bounding boxes, thus the information where in the image the single instances are. An illustration is shown in the figure below.


( 1)	( 2)

Bounding box comparisons, illustrated for instance type 'rectangle1'"rectangle1""rectangle1""rectangle1""rectangle1". (1) The network modifies the anchor (light blue) in order to predict a better fitting bounding box (orange). (2) During training, the predicted bounding box (orange) gets compared with the most overlapping ground truth bounding box (blue) so the network can learn the necessary modifications.

As mentioned before, feature maps of different levels are used. Depending on the size your instances have in comparison to the total image it is beneficial to include early feature maps (where the feature map is not very compressed and therefore small features are still visible) and deeper feature maps (where the feature map is very compressed and only large features are visible) or not. This can be controlled by the parameters 'min_level'"min_level""min_level""min_level""min_level" and 'max_level'"max_level""max_level""max_level""max_level", which determine the levels of the feature pyramid.

With these bounding boxes we have the localization of a potential instance, but the instance is not classified yet. Hence, the second task consists of classifying the content of the image part within the bounding boxes. This is done by the class heads. For more information about classification in general, see the chapter Deep Learning / Classification and the “Solution Guide on Classification”.

Most probably the network will find several promising bounding boxes for a single object. The reduction of overlapping predicted bounding boxes is done by non-maximum suppression, set over the parameters 'max_overlap'"max_overlap""max_overlap""max_overlap""max_overlap" and 'max_overlap_class_agnostic'"max_overlap_class_agnostic""max_overlap_class_agnostic""max_overlap_class_agnostic""max_overlap_class_agnostic" when creating the model or using set_dl_model_paramset_dl_model_paramSetDlModelParamSetDlModelParamset_dl_model_param afterwards. An illustration is given in the figure below.


( 1)	( 2)	( 3)

Suppression of significant overlapping bounding boxes, illustrated for instance type 'rectangle1'"rectangle1""rectangle1""rectangle1""rectangle1". (1) The network finds several promising instances for the class apple (orange) and lemon (blue). (2) The suppression of overlapping instances assigned to the same class is set by 'max_overlap'"max_overlap""max_overlap""max_overlap""max_overlap". Overlapping instances of different classes are not suppressed. (3) Using the parameter 'max_overlap_class_agnostic'"max_overlap_class_agnostic""max_overlap_class_agnostic""max_overlap_class_agnostic""max_overlap_class_agnostic", also strongly overlapping instances of different classes get suppressed.

As output you get bounding boxes proposing possible localizations of objects and confidence values, expressing the affinity of this image part to one of the classes.

In case of instance segmentation a further part follows. An additional head obtains as input those parts of the feature maps that correspond to the predicted bounding boxes. Thereout it predicts a (class-agnostic) binary mask image. This mask marks the region within the image belonging to the predicted instance. A schematic overview is given below.

A schematic overview of the mask prediction. The illustration uses the parts shown in the overview of the three parts of an object detection network shown above.

In HALCON, object detection with deep learning is implemented within the more general deep learning model. For more information to the latter one, see the chapter Deep Learning / Model. Two different instance types of object detection models are implemented, differing in the orientation of the bounding boxes:

instance type 'rectangle1'"rectangle1""rectangle1""rectangle1""rectangle1": The rectangular bounding boxes are axis-aligned.
instance type 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2": The rectangular bounding boxes have an arbitrary orientation.

For the specific system requirements in order to apply deep learning, please refer to the HALCON “Installation Guide”.

The following sections are introductions to the general workflow needed for object detection, information related to the involved data and parameters, and explanations to the evaluation measures.

General Workflow

In this paragraph, we describe the general workflow for an object detection task based on deep learning. It is subdivided into the four parts creating the model and preprocessing of the data, training of the model, evaluation of the trained model, and inference on new images. Thereby we assume, your dataset is already labeled, see also the section “Data” below. Have a look at the HDevelop example series detect_pills_deep_learning for an application. The example dl_instance_segmentation_workflow.hdev shows the complete workflow for an instance segmentation application.

Creation of the model and dataset preprocessing

This part covers the creation of a DL object detection model and the adaptation the data for this model. The single steps are also shown in the HDevelop example detect_pills_deep_learning_1_prepare.hdev.

Create a model using the operator
- create_dl_model_detectioncreate_dl_model_detectionCreateDlModelDetectionCreateDlModelDetectioncreate_dl_model_detection.
Thereby you will have to specify at least the backbone and the number of classes to be distinguished. For instance segmentation the parameter 'instance_segmentation'"instance_segmentation""instance_segmentation""instance_segmentation""instance_segmentation" has to be set to create an according model.

Further parameters can be set over the dictionary DLModelDetectionParam. Their values should be well chosen for the specific task, not least to possibly reduce memory consumption and runtime. See the operator documentation for more information.

Anchor parameters suiting your dataset can be estimated using the procedure
- determine_dl_model_detection_param.
In case the dataset is not representing all orientations the network will face during training (e.g., due to augmentation), the suggested values need to be adapted accordingly.

Note, after creation of the model, its underlying network architecture is fixed to the specified input values. As a result the operator returns a handle 'DLModelHandle'"DLModelHandle""DLModelHandle""DLModelHandle""DLModelHandle".

Alternatively you can also use
- read_dl_modelread_dl_modelReadDlModelReadDlModelread_dl_model.
to read in a model you have already saved with write_dl_modelwrite_dl_modelWriteDlModelWriteDlModelwrite_dl_model.
The information what is to be found on which image of your training dataset needs to be read in. This can be done reading the data out of
- a DLDataset dictionary using read_dictread_dictReadDictReadDictread_dict, or
- a file in the COCO data format using the procedure read_dl_dataset_from_coco, whereby a dictionary DLDataset is created.
The dictionary DLDataset serves as a database storing all necessary information about your data. For more information about the data and the way it is transferred, see the section “Data” below and the chapter Deep Learning / Model.
Split the dataset represented by the dictionary DLDataset. This can be done using the procedure
- split_dl_dataset.
The resulting split will be saved over the key split in each sample entry of DLDataset.
The network imposes requirements on the images, as e.g., the image width and height. You can retrieve every single value using the operator
- get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamget_dl_model_param.
or you can retrieve all necessary parameter using the procedure
- create_dl_preprocess_param_from_model.
Note, for classes declared by 'class_ids_no_orientation'"class_ids_no_orientation""class_ids_no_orientation""class_ids_no_orientation""class_ids_no_orientation" the bounding boxes need to be treated specially during preprocessing. As a consequence, these classes should be set at this point latest.

Now you can preprocess your dataset. For this, you can use the procedure
- preprocess_dl_dataset.
This procedure also offers guidance on how to implement a customized preprocessing procedure. We recommend to preprocess and store all images used for the training before starting the training, since this speeds up the training significantly.

To visualize the preprocessed data, the procedure
- dev_display_dl_data
is available.

Training of the model

This part covers the training of a DL object detection model. The single steps are also shown in the HDevelop example detect_pills_deep_learning_2_train.hdev.

Set the training parameters and store them in the dictionary TrainParam. These parameters include:
- the hyperparameters, for an overview see the section “Model Parameters and Hyperparameters” below and the chapter Deep Learning.
- parameters for possible data augmentation
Train the model. This can be done using the procedure
- train_dl_model.
The procedure expects:
- the model handle DLModelHandleDLModelHandleDLModelHandleDLModelHandledlmodel_handle
- the dictionary with the data information DLDataset
- the dictionary with the training parameters TrainParam
- the information, over how many epochs the training shall run.
During the training you should see how the total loss minimizes.

Evaluation of the trained model

In this part we evaluate the object detection model. The single steps are also shown in the HDevelop example detect_pills_deep_learning_3_evaluate.hdev.

Set the model parameters which may influence the evaluation.
The evaluation can conveniently be done using the procedure
- evaluate_dl_model.
This procedure expects a dictionary GenParamEval with the evaluation parameters. Set the parameter detailed_evaluation to 'true'"true""true""true""true" to get the data necessary for the visualization.
You can visualize your evaluation results using the procedure
- dev_display_detection_detailed_evaluation.

Inference on new images

This part covers the application of a DL object detection model. The single steps are also shown in the HDevelop example detect_pills_deep_learning_4_infer.hdev.

Request the requirements the network imposes on the images using the operator
- get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamget_dl_model_param
or the procedure
- create_dl_preprocess_param_from_model.
Set the model parameter described in the section “Model Parameters and Hyperparameters” below, using the operator
- set_dl_model_paramset_dl_model_paramSetDlModelParamSetDlModelParamset_dl_model_param.
The 'batch_size'"batch_size""batch_size""batch_size""batch_size" can be generally set independently from the number of images to be inferred. See apply_dl_modelapply_dl_modelApplyDlModelApplyDlModelapply_dl_model for details on how to set this parameter for greater efficiency.
Generate a data dictionary DLSample for each image. This can be done using the procedure
- gen_dl_samples_from_images.
Every image has to be preprocessed as done for the training. For this, you can use the procedure
- preprocess_dl_samples.
Apply the model using the operator
- apply_dl_modelapply_dl_modelApplyDlModelApplyDlModelapply_dl_model.
Retrieve the results from the dictionary 'DLResultBatch'"DLResultBatch""DLResultBatch""DLResultBatch""DLResultBatch".

Data

We distinguish between data used for training and evaluation, consisting of images with their information about the instances, and data for inference, which are bare images. For the first ones, you provide the information defining for each instance to which class it belongs and where it is in the image (via its bounding box). For instance segmentation the pixel-precise region of the objects is needed (provided via the masks).

As a basic concept, the model handles data over dictionaries, meaning it receives the input data over a dictionary DLSample and returns a dictionary DLResult and DLTrainResult, respectively. More information on the data handling can be found in the chapter Deep Learning / Model.

Data for training and evaluation

The dataset consists of images and corresponding information. They have to be provided in a way the model can process them. Concerning the image requirements, find more information in the section “Images” below.

The training data is used to train and evaluate a network for your specific task. With the aid of this data the network can learn which classes are to be distinguished, how such examples look like, and how to find them. The necessary information is provided by telling for each object in every image to which class this object belongs to and where it is located. This is done by providing a class label and an enclosing bounding box for every object. In case of instance segmentation additionally a mask is needed for every instance. There are different ways possible, how to store and retrieve this information. How the data has to be formatted in HALCON for a DL model is explained in the chapter Deep Learning / Model. In short, a dictionary DLDataset serves as a database for the information needed by the training and evaluation procedures. You can label your data and directly create the dictionary DLDataset in the respective format using the MVTec Deep Learning Tool, available from the MVTec website. If you have your data already labeled in the standard COCO format, you can use the procedure read_dl_dataset_from_coco (for 'instance_type'"instance_type""instance_type""instance_type""instance_type" = 'rectangle1'"rectangle1""rectangle1""rectangle1""rectangle1" only). It formats the data and creates a dictionary DLDataset. For further information on the needed part of the COCO data format, please refer to the documentation of the procedure.

You also want enough training data to split it into three subsets, used for training, validation and testing the network. These subsets are preferably independent and identically distributed, see the section “Data” in the chapter Deep Learning.

Note, that in object detection the network has to learn how to find possible locations and sizes of the instances. That is why also the later important instance locations and sizes need to appear representatively in your training dataset.

Images

Regardless of the application, the network poses requirements on the images regarding e.g., the image dimensions. The specific values depend on the network itself and can be queried with get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamget_dl_model_param. In order to fulfill these requirements, you may have to preprocess your images. Standard preprocessing of the entire dataset and therewith also the images is implemented in preprocess_dl_dataset and in preprocess_dl_samples for a single sample, respectively. This procedure also offers guidance on how to implement a customized preprocessing procedure.

Bounding boxes

Depending on the instance type of object detection model, the bounding boxes are parametrized differently:

instance type 'rectangle1'"rectangle1""rectangle1""rectangle1""rectangle1": The bounding boxes are defined over the coordinates of upper left corner ('bbox_row1'"bbox_row1""bbox_row1""bbox_row1""bbox_row1", 'bbox_col1'"bbox_col1""bbox_col1""bbox_col1""bbox_col1") and the lower right corner ('bbox_row2'"bbox_row2""bbox_row2""bbox_row2""bbox_row2", 'bbox_col2'"bbox_col2""bbox_col2""bbox_col2""bbox_col2"). This is consistent with gen_rectangle1gen_rectangle1GenRectangle1GenRectangle1gen_rectangle1.
instance type 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2": The bounding boxes are defined over the coordinates of their center ('bbox_row'"bbox_row""bbox_row""bbox_row""bbox_row", 'bbox_col'"bbox_col""bbox_col""bbox_col""bbox_col"), the orientation 'bbox_phi'"bbox_phi""bbox_phi""bbox_phi""bbox_phi" and the half edge lengths 'bbox_length1'"bbox_length1""bbox_length1""bbox_length1""bbox_length1" and 'bbox_length2'"bbox_length2""bbox_length2""bbox_length2""bbox_length2". The orientation is given in arc measure and indicates the angle between the horizontal axis and 'bbox_length1'"bbox_length1""bbox_length1""bbox_length1""bbox_length1" (mathematically positive). This is consistent with gen_rectangle2gen_rectangle2GenRectangle2GenRectangle2gen_rectangle2.

If in case of 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2" you are interested in the oriented bounding box, but without considering the direction of the object within the bounding box, the parameter 'ignore_direction'"ignore_direction""ignore_direction""ignore_direction""ignore_direction" can be set to 'true'"true""true""true""true". This is illustrated in the figure below.


( 1)	( 2)	( 3)

Bounding box formats of the different object detection instance types: (1) Instance type 'rectangle1'"rectangle1""rectangle1""rectangle1""rectangle1". (2) Instance type 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2", where the bounding box is oriented towards the banana end. (3) Instance type 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2", where the oriented bounding box is of interest without considering the direction of the banana within the bounding box.

Mask

Instance segmentation requires, in addition to a tight bounding box, a mask for every object to be learned. Such a mask is given as a region. Note, these regions are given with respect to the image.

The masks for a single image are given as an object tuple of regions, see the illustration below. The order of the masks corresponds to the order of the bounding box annotations.

Illustration of the masks of an image: The tupel contains for every object to be learned an independent region.

Network output

The network output depends on the task:

training: As output, the operator train_dl_model_batchtrain_dl_model_batchTrainDlModelBatchTrainDlModelBatchtrain_dl_model_batch will return a dictionary DLTrainResultDLTrainResultDLTrainResultDLTrainResultdltrain_result with the current value of the total loss as well as values for all other losses included in your model.
inference and evaluation: As output, the operator apply_dl_modelapply_dl_modelApplyDlModelApplyDlModelapply_dl_model will return a dictionary DLResultDLResultDLResultDLResultdlresult for every image. For object detection, this dictionary will include for every detected instance its bounding box and confidence value of the assigned class as well as its mask in case of instance segmentation. Thereby several instances may be detected for the same object in the image, see the explanation to the non-maximum suppression above. The resulting bounding boxes are parametrized according to the instance type (specified over 'instance_type'"instance_type""instance_type""instance_type""instance_type") and given in pixel centered sub-pixel accurate coordinates. For more information to the coordinate system, see the chapter Transformations / 2D Transformations. Further information on the output dictionary can be found in the chapter Deep Learning / Model.

Model Parameters and Hyperparameters

Next to the general DL hyperparameters explained in Deep Learning, there are further hyperparameters relevant for object detection:

'bbox_heads_weight'"bbox_heads_weight""bbox_heads_weight""bbox_heads_weight""bbox_heads_weight"
'class_heads_weight'"class_heads_weight""class_heads_weight""class_heads_weight""class_heads_weight"
'mask_head_weight'"mask_head_weight""mask_head_weight""mask_head_weight""mask_head_weight" (in case of instance segmentation)

These hyperparameters are explained in more detail in get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamget_dl_model_param and set using create_dl_model_detectioncreate_dl_model_detectionCreateDlModelDetectionCreateDlModelDetectioncreate_dl_model_detection.

For an object detection model, there are two different types of model parameters:

Parameter defining your architecture. They can not be changed anymore once your model is created. These parameters are all set using the operator create_dl_model_detectioncreate_dl_model_detectionCreateDlModelDetectionCreateDlModelDetectioncreate_dl_model_detection when creating your model.
Parameter influencing your predictions and as a consequence the evaluation results. Those only relevant for object detection are
- 'max_num_detections'"max_num_detections""max_num_detections""max_num_detections""max_num_detections"
- 'max_overlap'"max_overlap""max_overlap""max_overlap""max_overlap"
- 'max_overlap_class_agnostic'"max_overlap_class_agnostic""max_overlap_class_agnostic""max_overlap_class_agnostic""max_overlap_class_agnostic"
- 'min_confidence'"min_confidence""min_confidence""min_confidence""min_confidence"
They are explained in more detail in get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamget_dl_model_param. To set them you can use create_dl_model_detectioncreate_dl_model_detectionCreateDlModelDetectionCreateDlModelDetectioncreate_dl_model_detection when creating your model or set_dl_model_paramset_dl_model_paramSetDlModelParamSetDlModelParamset_dl_model_param afterwards.

Evaluation measures for the Results from Object Detection

For object detection, the following evaluation measures are supported in HALCON. Note that for computing such a measure for an image, the related ground truth information is needed.

Mean average precision, mAP and average precision (AP) of a class for an IoU threshold, ap_iou_classname

The AP value is an average of maximum precision at different recall values. In simple words it tells us, if the objects predicted for this class are generally correct detections or not. Thereby we pay more attention to the predictions with high confidence values. The higher the value, the better.

To count a prediction as a hit, we want both correct, its top-1 classification and its localization. The measure, telling us the correctness of the localization is the intersection over union, IoU: an instance is localized correctly if the IoU is higher than the demanded threshold. The IoU is explained in more detail below. For this reason, the AP value depends on the class and on the IoU threshold.

You can obtain the specific AP values, the averages over the classes, the averages over the IoU thresholds, and the average over both, the classes and the IoU thresholds. The latter one is the mean average precision, mAP, a measure to tell us how well instances are found and classified.
True Positives, False Positives, False Negatives

The concept of true positive, false positives, and false negatives is explained in Deep Learning. It applies for object detection with the exception that there are different kinds of false positives, as e.g.:
- An instance got classified wrongly.
- An instance was found where there is only background.
- An instance was localized badly, meaning the IoU between the instance and its ground truth is lower than the evaluation IoU threshold.
- There is a duplicate, thus at least two instances overlap mainly with the same ground truth bounding box, but they overlap not more than 'max_overlap'"max_overlap""max_overlap""max_overlap""max_overlap" with each other, so none of them got suppressed.
Note, these values are only available from the detailed evaluation. This means, in evaluate_dl_model the parameter detailed_evaluation has to be set to 'true'"true""true""true""true".
Score of Angle Precision (SoAP)

The SoAP value is a score for the precision of the inferred orientation angles. This score is determined by the angle differences between the inferred instances (I) and the corresponding ground truth annotations (GT): where the index runs over all inferred instances. This score only applies for detection models of 'instance_type'"instance_type""instance_type""instance_type""instance_type" 'rectangle2'"rectangle2""rectangle2""rectangle2""rectangle2".

Before mentioned measures use the intersection over union (IoU). The IoU is a measure for the accuracy of an object detection. For a proposed bounding box it compares the ratio between area of intersection and the area of overlap with the ground truth bounding box. A visual example is shown in the following schema.


( 1)	( 2)

Visual example of the IoU, illustrated for instance type 'rectangle1'"rectangle1""rectangle1""rectangle1""rectangle1". (1) The input image with the ground truth bounding box (orange) and the predicted bounding box (light blue). (2) The IoU is the ratio between the area intersection and the area overlap.

In case of instance segmentation the IoU is calculated (by default) based on the masks. It is possible to change the default and use the bounding boxes instead.

List of Operators

create_dl_model_detectionCreateDlModelDetectioncreate_dl_model_detectionCreateDlModelDetectioncreate_dl_model_detection: Create a deep learning network for object detection or instance segmentation.

Operators