train_dl_model_batchTrainDlModelBatchTrainDlModelBatchtrain_dl_model_batchT_train_dl_model_batch🔗

Short description🔗

train_dl_model_batchTrainDlModelBatchTrainDlModelBatchtrain_dl_model_batchT_train_dl_model_batch — Train a deep learning model.

Signature🔗

train_dl_model_batch( dl_model DLModelHandle, dict DLSampleBatch, out dict DLTrainResult )

Description🔗

The operator train_dl_model_batchTrainDlModelBatch performs a training step of the deep learning model contained in DLModelHandleDLModelHandledlmodel_handle. The current loss values are returned in the dictionary DLTrainResultDLTrainResultdltrain_result.

For DLModelHandleDLModelHandledlmodel_handle all model types but 'anomaly_detection'"anomaly_detection" and 'counting'"counting" are valid. See train_dl_model_anomaly_datasetTrainDlModelAnomalyDataset for the training of anomaly detection models.

A training step means here to perform a single update of the weights, based on the batch images given in DLSampleBatchDLSampleBatchdlsample_batch. The optimization algorithms which can be used are explained further in the subsection “Further Information on the Algorithms” below. For more information on how to train a network, please see the subchapter “The Network and the Training Process” in Deep Learning.

To successfully train the model, its applicable hyperparameters need to be set and the training data handed over according to the model requirements. For information to the hyperparameters, see the chapter of the corresponding model and the general chapter Deep Learning.

The training data consists of images and corresponding information. This operator expects one batch of training data, handed over in the tuple of dictionaries DLSampleBatchDLSampleBatchdlsample_batch. Such a DLSampleDLSampledlsample dictionary is created out of DLDataset for every image sample, e.g., by the procedure gen_dl_samples. See the chapter Deep Learning / Model for further information to the used dictionaries and their keys.

The number of images in a DLSampleBatchDLSampleBatchdlsample_batch tuple needs to be a multiple of the 'batch_size'"batch_size". In particular on GPU the parameter 'batch_size'"batch_size" is limited by the amount of available memory. In order to process more images in one training step, the model parameter 'batch_size_multiplier'"batch_size_multiplier" can be set to a value greater than 1. The number of DLSampleDLSampledlsample dictionaries being passed to the training operator needs to be equal to 'batch_size'"batch_size" times 'batch_size_multiplier'"batch_size_multiplier". Note that a training step calculated for a batch and a 'batch_size_multiplier'"batch_size_multiplier" greater 1 is an approximation of a training step calculated for the same batch but with a 'batch_size_multiplier'"batch_size_multiplier" equal to 1 and an accordingly greater 'batch_size'"batch_size". As an example, the loss calculated with a 'batch_size'"batch_size" of 4 and a 'batch_size_multiplier'"batch_size_multiplier" of 2 is usually not equal to the loss calculated with a 'batch_size'"batch_size" of 8 and a 'batch_size_multiplier'"batch_size_multiplier" of 1, although the same number of DLSampleDLSampledlsample dictionaries is used for training in both cases. However, the approximation generally delivers comparably good results, so it can be utilized if you wish to train with a larger number of images than your GPU allows. In some rare cases the approximation with a 'batch_size'"batch_size" of 1 and an accordingly large 'batch_size_multiplier'"batch_size_multiplier" does not show the expected performance. Set the 'batch_size'"batch_size" to a value greater than 1 can help to solve this issue.

In the output dictionary DLTrainResultDLTrainResultdltrain_result you get the current value of the total loss as the value for the key total_loss as well as the values for all other losses included in your model.

For object detection models of 'type'"type" = 'detection'"detection" such losses are e.g., the losses for the heads of every selected level, namely the 'Huber Loss' for the bounding box regression heads and the 'Focal Loss' for the classification heads (see also Deep Learning / Instance Segmentation as well as 'max_level'"max_level" and 'min_level'"min_level" in get_dl_model_paramGetDlModelParam).

For models retrieved from a Deep 3D Matching model, of 'type'"type" = 'detection'"detection" or 'type'"type" = '3d_pose_estimation'"3d_pose_estimation", the (typically synthetically generated) training samples are automatically augmented.

For models of 'type'"type" = '3d_pose_estimation'"3d_pose_estimation", the training can be sped up if OpenGL 2.1, GLSL 1.2, and the OpenGL extensions GL_EXT_framebuffer_object and GL_EXT_framebuffer_blit are available.

Further Information on the Algorithms🔗

During training, an optimization algorithm is applied with the goal to minimize the value of the total loss function. The latter one is determined based on the prediction of the neural network for the current batch of images.

In HALCON we have two optimization algorithms available so far, the SGD (stochastic gradient descent) and Adam (adaptive moment estimation).

SGD: The SGD updates the layers’ weights of the previous iteration \(t-1\), \(w_{t-1}\), to the new values \(w_{t}\) at iteration \(t\) as follows:

\[\begin{eqnarray*} v_{t} & = & \mu v_{t-1} - \lambda \nabla_{w} L\\ w_{t} & = & w_{t-1} + v_{t}. \end{eqnarray*}\]

Here, \(\lambda\) is the learning rate, \(\mu\) the momentum, \(L\) the total loss, and \(\nabla_{w} L\) the gradient of the total loss with respect to the weights. The variable \(v_{t}\) is used to include the influence of the momentum \(\mu\).
Adam: Like the SGD, Adam updates the layer’s weights of the previous iteration but comes with an adaptive moment estimation to automatically estimate a scaling for the learning rate. In this way it is determined how fast the solver moves towards a minimum. The estimated moments are the first two moments of the weights’ gradients which are the mean and the uncentered variance. To estimate the moments Adam uses exponentially moving averages \(m_{t}\) and \(v_{t}\), computed on the gradient evaluated on a mini-batch. This results in the following formula:

\[\begin{eqnarray*} m_{t} & = & \beta_{1} m_{t-1} + (1 - \beta_{1}) g_{t}\\ v_{t} & = & \beta_{2} v_{t-1} + (1 - \beta_{2}) g_{t}^{2}. \end{eqnarray*}\]

\(g\) is the weight’s gradient on the current mini-batch, \(\beta_{1}\) is the moment for the linear term, and \(\beta_{2}\) is the moment for the quadratic term of the Adam solver. Furthermore, Adam has so-called bias correctors \(\hat{m_{t}}\) and \(\hat{v_{t}}\). These values are computed as follows:

\[\begin{eqnarray*} \hat{m_{t}} & = & \frac{m_{t}}{1 - \beta_{1}^{t}}\\ \hat{v_{t}} & = & \frac{v_{t}}{1 - \beta_{2}^{t}}. \end{eqnarray*}\]

As a last step the moving averages are used to scale the learning rates individually for each parameter. To perform the weights update this results with \(w_{t}\) as the model weights, and \(\lambda\) as the learning rate in the following formula.

\[\begin{eqnarray*} w_{t} & = & w_{t-1} - \lambda \frac{\hat{m_{t}}}{\sqrt{\hat{v_{t}}} + \epsilon}. \end{eqnarray*}\]

Here \(\epsilon\) is the parameter to ensure numeric stability. For a more detailed description we refer to the referenced paper.

The different models may have several losses implemented, which are summed up. To this sum the regularization term \(E_{\alpha}(w)\) is added, which generally penalizes large weights, and together they form the total loss. The different types of losses are:

Huber Loss (model of 'type'"type"='detection'"detection"): The 'Huber Loss' is also known as 'Smooth L1 Loss'. The total 'Huber Loss' is the sum of the contributions from all bounding box variables of all found instances in the batch. For a single bounding box variable this contribution defined as follows:

\[\begin{eqnarray*} L_{Huber}(x) = \left\{ \begin{array}{ll} 0.5 x^2/\beta & \textnormal{if } |x|<\beta \\ |x|-0.5\beta & \textnormal{else} \end{array} \right.. \end{eqnarray*}\]

Thereby, \(x\) denotes a bounding box variable and \(\beta\) a parameter fixed to a value of 0.11.

We refer to create_dl_layer_loss_huberCreateDlLayerLossHuber for more information.
Focal Loss (model of 'type'"type"='detection'"detection"): The total 'Focal Loss' is the sum of the contributions from all found instance in the batch. For a single sample, this contribution is defined as follows:

\[\begin{eqnarray*} L_{focal}(p) & = & -\sum_{c=0}^{C-1} \alpha_{t}^{c}\big(1-p_{t}^{c}\big)^{\gamma}\log\big(p_{t}^{c}\big), \end{eqnarray*}\]

where \(\gamma\) is a parameter fixed to a value of 2. \(\alpha^{c}\) stands for the class specific weight ('class_weights') of the \(c\)-th class and \(p_{t}\), \(\alpha_{t}\) are defined as

\[\begin{eqnarray*} p_{t} := \left\{ \begin{array}{ll} p & \textnormal{if } y=1 \\ 1-p & \textnormal{else} \end{array} \right., \quad\textnormal{and}\quad \alpha_{t} := \left\{ \begin{array}{ll} \alpha & \textnormal{if } y=1 \\ 1-\alpha & \textnormal{else} \end{array} \right.. \end{eqnarray*}\]

Here, \(p=(p^0,\ldots,p^{C-1})\) is a tuple of the model’s estimated probabilities for each of the \(C\)-many classes, and \(y_{c}\) is a one-hot encoded target vector that encodes the class of the annotation.

We refer to create_dl_layer_loss_focalCreateDlLayerLossFocal for more information.
Multinomial Logistic Loss(model of 'type'"type"= 'classification'"classification",'segmentation'"segmentation"): The 'Multinomial Logistic Loss' is also known as 'Cross Entropy Loss'. It is defined as follows:

\[\begin{eqnarray*} L_{mn}\big(f(x,w)\big) & = & - \frac{1}{N} \sum_{n=0}^{N-1} \alpha(y_{n}) \langle y_{n} , \log\big(f(x_{n}, w)\big) \rangle \end{eqnarray*}\]

Here, \(f(x,w)\) is the predicted result which depends on the network weights \(w\) and the input batch \(x\). \(y_{n}\) is a one-hot encoded target vector that encodes the label of the \(n\)-th image \(x_{n}\) of the batch \(x\) containing \(N\)-many images, and \(\log (f(x_{n}, w))\) shall be understood to be a vector such that \(\log\) is applied on each component of \(f(x_{n}, w)\). The value \(\alpha(y_{n})\) is a class specific weight for the class given by \(y_{n}\). This weight corresponds to the value set by 'class_weights' and is normalized by the sum over the weights for all classes in addition.

We refer to create_dl_layer_loss_cross_entropyCreateDlLayerLossCrossEntropy for more information.

The regularization term \(E_{\alpha}(w)\) is a weighted \(l^2\)-norm involving all \(K\) weights except for biases. Its influence can be controlled through \(\alpha\). Latter one is the hyperparameter 'weight_prior'"weight_prior", which can be set with set_dl_model_paramSetDlModelParam.

\[\begin{eqnarray*} E_{\alpha}(w) = \frac{\alpha}{2} \sum_{k=0}^{K-1} |w_{k}|^{2} \end{eqnarray*}\]

Here the index \(k\) runs over all weights of the network, except for the biases which are not regularized. The regularization term \(E_{\alpha}(w)\) generally penalizes large weights, thus pushing the weights towards zero, which effectively reduces the complexity of the model.

Alternatively, a decoupled weight decay mechanism can be used to constrain the magnitude of the weights. It can be controlled by the hyperparameter 'weight_decay'"weight_decay", which can be set with set_dl_model_paramSetDlModelParam. The weight decay updates the layers’ weights of the previous iteration \(t-1\), \(w_{t-1}\), to the new values \(w_{t}\) at iteration \(t\) as follows:

\[\begin{eqnarray*} w_{t} & = & w_{t-1} - \alpha \lambda w_{t-1} \end{eqnarray*}\]

Here, \(\lambda\) is the learning rate and \(\alpha\) is the weight decay parameter. The weight decay update is applied after the gradients are computed, but before the solver applies its update to the weights. Weight decay is typically preferred to L2 regularization when using the Adam solver, which is then referred to as AdamW.

Attention🔗

The operator train_dl_model_batchTrainDlModelBatch internally calls functions that might not be deterministic. Therefore, results from multiple calls of train_dl_model_batchTrainDlModelBatch can slightly differ, although the same input values have been used. Setting 'cudnn_deterministic'"cudnn_deterministic" of set_systemSetSystem may influence this behavior.

System requirements: Implementation on CPU is limited to specific platform types. To run this operator on GPU by setting 'runtime'"runtime" to 'gpu'"gpu" (see get_dl_model_paramGetDlModelParam), cuDNN and cuBLAS are required. Please refer to the “Installation Guide”, paragraph “Requirements for Deep Learning and Deep-Learning-Based Methods”, for the specific system requirements.

Execution information🔗

Execution information

Multithreading type: reentrant (runs in parallel with non-exclusive operators).
Multithreading scope: global (may be called from any thread).
Automatically parallelized on internal data level.

Parameters🔗

DLModelHandleDLModelHandledlmodel_handle (input_control) dl_model → (handle)HTuple (HHandle)HDlModel, HTuple (IntPtr)HHandleHtuple (handle)

Deep learning model handle.

DLSampleBatchDLSampleBatchdlsample_batch (input_control) dict-array → (handle)HTuple (HHandle)HDict, HTuple (IntPtr)Sequence[HHandle]Htuple (handle)

Tuple of Dictionaries with input images and corresponding information.

DLTrainResultDLTrainResultdltrain_result (output_control) dict → (handle)HTuple (HHandle)HDict, HTuple (IntPtr)HHandleHtuple (handle)

Dictionary with the train result data.

Result🔗

If the parameters are valid, the operator train_dl_model_batchTrainDlModelBatch returns the value 2 (H_MSG_TRUE). If necessary, an exception is raised.

Combinations with other operators🔗

Combinations

Possible predecessors

read_dl_modelReadDlModel, set_dl_model_paramSetDlModelParam, get_dl_model_paramGetDlModelParam

Possible successors

apply_dl_modelApplyDlModel

See also

apply_dl_modelApplyDlModel

References🔗

D. P. Kingma, J. Ba: "Adam: A method for Stochastic Optimization”, 2014, https://arxiv.org/pdf/1412.6980,

I. Loshchilov, F. Hutter: "Decoupled weight decay regularization”, 2017, https://arxiv.org/abs/1711.05101

Module🔗

This operator uses dynamic licensing (see the ‘Installation Guide’). Which of the following modules is required depends on the specific usage of the operator:

3D Metrology, OCR/OCV, Deep Learning Professional